News: 11 March 2016 - Forum Rules
Current Moderators - DarkSol, KingMike, MathOnNapkins, Azkadellia, Danke

Author Topic: About FF6 multiply algorithm  (Read 4223 times)

magno

  • RHDN Patreon Supporter!
  • Full Member
  • *****
  • Posts: 157
    • View Profile
    • Traducciones Magno
About FF6 multiply algorithm
« on: July 17, 2013, 05:11:19 am »
Hi,

I've just finished dumping and labelling all ASM code in $C3 bank (menu-related code), which makes possible to add and change whatever code you want without effort, since all code is 100% relocatable and 80% tested (haven't reached to Colisseum yet).

Anyway, my question is pretty technical. I was amazed when discovered how multiply was done in the most part of funcitions for menus. They are this kind:

Code: [Select]
MULTIPLY_SPELL_ID_BY_14
PHA
Tag_$C350F6
    LDA.w $4212
AND.b #$40
BEQ Tag_$C350F6
PLA
STA.w $211B
STZ.w $211B
LDA.b #$0E
STA.w $211C
STA.w $211C
RTS

This means that the CPU waits for H-Blank (polling $4212) and then uses matrix multiply through register $211D and $211C to get the result in $2134. Why do they do multiply in such way? Some kind of frame synchronization through H-Blank? I analyzed almost all code in this bank and didn't find any justification for doing things in such way. Maybe some of you know the reason.

I'm replacing that "weird" code with mine:

Code: [Select]
MULTIPLY_SPELL_ID_BY_14
PHP
REP #$30
AND.w #$00FF
ASL A
PHA
ASL A
ASL A
ASL A
SEC
SBC.b $01,S
TAX
PLA
PLP
RTS
...which is faster and smaller and didn't make any visible changes in the menu (at least for the moment).

assassin

  • Full Member
  • ***
  • Posts: 140
    • View Profile
    • My Barren Webpage
Re: About FF6 multiply algorithm
« Reply #1 on: July 17, 2013, 08:32:06 am »
Hi,

I've just finished dumping and labelling all ASM code in $C3 bank (menu-related code), which makes possible to add and change whatever code you want without effort, since all code is 100% relocatable and 80% tested (haven't reached to Colisseum yet).

a cool undertaking.  be sure to check out:
http://slickproductions.org/docs.php?id=FF6

for lots of comments of Bank C3, among others.

as for your question, you're probably right that there's no good reason for them to use a slower method.  the $211B - $2120 Mode 7 registers need to be accessed during a blank:
http://bin.smwcentral.net/u/4842/regs.txt

that explains the wait, but not why they're using those registers in the first place.  how do they compare on average to calculating with $00420n registers, which require an ~8-cycle wait to obtain the results?

i doubt they're synchronizing with that function.  the menus do use their share of HDMA, but it's not like any of the C3/50F5 callers are handling the actual transfer; that'll occur later.

anyway, it's a running theory that much of Bank C3 was written by baboons, so don't be surprised when you can outcode them.

Lenophis

  • IRC Staff
  • Hero Member
  • *****
  • Posts: 958
  • The return of the sombrero!
    • View Profile
    • Slick Productions
Re: About FF6 multiply algorithm
« Reply #2 on: July 17, 2013, 11:12:42 am »
Generally speaking, using the $4202 register to multiply will be faster than the matrix one used by bank C3. However, one advantage the matrix multiplying gives you is that it can naturally do 24-bit multiplying. No, bank C3 has no need for 24-bit results, at least not with anything that is done currently.

...which is faster and smaller and didn't make any visible changes in the menu (at least for the moment).
Actually, I hate to burst your bubble, but your routine is overall slower. The original routine uses 39 cycles IF it loads $4212 and it's in H-Blank. Your routine uses 45 cycles. I doubt 6 cycles will make any impact on performance. If you would like a faster routine, you can borrow what C2 does:
Code: [Select]
Multiplication Function
Multiplies low bit of A * high bit of A. Stores result in 16-bit A.

C2/4781: 08           PHP
C2/4782: C2 20        REP #$20
C2/4784: 8F 02 42 00  STA $004202
C2/4788: EA           NOP
C2/4789: EA           NOP
C2/478A: EA           NOP
C2/478B: EA           NOP
C2/478C: AF 16 42 00  LDA $004216
C2/4790: 28           PLP
C2/4791: 60           RTS
You could axe that last NOP and it would take 34 cycles, as opposed to the original 39, or your 45. You could also get 2 more cycles since the bank entering would be 00 (it has to be, otherwise the H-blank check wouldn't work) by changing the STA and LDA to absolutes. :thumbsup: And since it's in absolute, you could LDX instead of LDA TAX. Just make sure you check the calling code for instructions that would destroy your result.

anyway, it's a running theory that much of Bank C3 was written by baboons, so don't be surprised when you can outcode them.
:crazy:
« Last Edit: July 17, 2013, 11:34:18 am by Lenophis »


https://ff6randomizer.codeplex.com/ - Randomize your FF6 experience!

magno

  • RHDN Patreon Supporter!
  • Full Member
  • *****
  • Posts: 157
    • View Profile
    • Traducciones Magno
Re: About FF6 multiply algorithm
« Reply #3 on: July 17, 2013, 03:00:23 pm »
Generally speaking, using the $4202 register to multiply will be faster than the matrix one used by bank C3. However, one advantage the matrix multiplying gives you is that it can naturally do 24-bit multiplying. No, bank C3 has no need for 24-bit results, at least not with anything that is done currently.

That could be a good reason :) I also though that MAYBE standard multiplication registers ($4202) could be "reserved" for another "task" (logically speaking, not phisically) but I double-checked that and $4202 is not used in bank $C3.
So seeing as there is no need for 24-bit multiplies, I think I will go forwad with my code.


Actually, I hate to burst your bubble, but your routine is overall slower. The original routine uses 39 cycles IF it loads $4212 and it's in H-Blank. Your routine uses 45 cycles. I doubt 6 cycles will make any impact on performance.

I disagree: original code is faster in the best case, but not in average. If you take into account the probabilities of being in H-Blank when first checking $4212 against the probability of not being in H-Blank, you could see that the loop is executed 3 times in average, which makes my code faster, since it always takes 39 cycles.
Anyway, you are right, a few cycles won't make any difference, but if my code is faster (in average) and smaller, I prefer it :)
As for non-constant multiplications, I already changed the code to use standard $4202 multiply registers.

Thanks to both for your answers! I admire you both a lot for your FF6 fixes, which I'm planning to use in my code if you give your permission.

Regards!

Lenophis

  • IRC Staff
  • Hero Member
  • *****
  • Posts: 958
  • The return of the sombrero!
    • View Profile
    • Slick Productions
Re: About FF6 multiply algorithm
« Reply #4 on: July 17, 2013, 03:11:10 pm »
The patches made public are made specifically for that purpose, for the public to use them. ;)


https://ff6randomizer.codeplex.com/ - Randomize your FF6 experience!

assassin

  • Full Member
  • ***
  • Posts: 140
    • View Profile
    • My Barren Webpage
Re: About FF6 multiply algorithm
« Reply #5 on: July 17, 2013, 06:01:52 pm »
If you take into account the probabilities of being in H-Blank when first checking $4212 against the probability of not being in H-Blank, you could see that the loop is executed 3 times in average,

can i see your math for this?  i'm getting closer to 7.8 times, but admittedly don't know what i'm doing.


Quote
Thanks to both for your answers! I admire you both a lot for your FF6 fixes, which I'm planning to use in my code if you give your permission.

sure.  just link to both of my webpages, and try to give the patch version numbers (so users can easily see whether the component fixes have been updated since the latest release of your larger patch).

magno

  • RHDN Patreon Supporter!
  • Full Member
  • *****
  • Posts: 157
    • View Profile
    • Traducciones Magno
Re: About FF6 multiply algorithm
« Reply #6 on: July 18, 2013, 05:21:15 am »
can i see your math for this?  i'm getting closer to 7.8 times, but admittedly don't know what i'm doing.

Sure. HCounter counts from 0 to 339 (340 dots) in 1364 master cycles (i.e., 4 cycles per dot), and assumin HBlank = '0' when HCount is greater than 1 and less than 277 (though there is some confusing information about this), this means that chances of being in HBlank are: (339-277)/340 = 0.18 of total processing time (maybe less due to 40 cyles per scanline to refresh WRAM).
The waiting-for-HBlank loop takes (4+2+2)*6 = 48 master cycles (accessing register and ROM at 3.68MHz), so in 1 scanline, you check $4212 at most 28 times (1364 / 48), but you don't loop the 18% of times (because you are in Hblank), so finally you only loop 23 times AT MOST and once AT LEAST. Seeing as there is equal probability of looping (since it is equally probable to be at any position of the scanline anytime) and this probability is 1/23, it results:

 2 loop * (1/23) chances + 3 loops * (1/23) chances + 4 loops * (1/23) chances + ......... + 24 loops * (1/23) chances = 13 loops in mean.

Anyway, the result of 3 times I posted aboves was empirically proved: I logged 30 times each function where $4212 is polled and then I calculated the mean of number of times the loop is executed. The result was 3.8 and I rounded down to 3. I didn't intend to be precise, but to prove that code is slower in mean.


sure.  just link to both of my webpages, and try to give the patch version numbers (so users can easily see whether the component fixes have been updated since the latest release of your larger patch).

 :thumbsup:
« Last Edit: July 18, 2013, 05:26:38 am by magno »

assassin

  • Full Member
  • ***
  • Posts: 140
    • View Profile
    • My Barren Webpage
Re: About FF6 multiply algorithm
« Reply #7 on: July 18, 2013, 08:05:35 pm »
Sure. HCounter counts from 0 to 339 (340 dots) in 1364 master cycles (i.e., 4 cycles per dot), and assumin HBlank = '0' when HCount is greater than 1 and less than 277 (though there is some confusing information about this), this means that chances of being in HBlank are: (339-277)/340 = 0.18 of total processing time (maybe less due to 40 cyles per scanline to refresh WRAM).
The waiting-for-HBlank loop takes (4+2+2)*6 = 48 master cycles (accessing register and ROM at 3.68MHz), so in 1 scanline, you check $4212 at most 28 times (1364 / 48), but you don't loop the 18% of times (because you are in Hblank), so finally you only loop 23 times AT MOST and once AT LEAST. Seeing as there is equal probability of looping (since it is equally probable to be at any position of the scanline anytime) and this probability is 1/23, it results:

 2 loop * (1/23) chances + 3 loops * (1/23) chances + 4 loops * (1/23) chances + ......... + 24 loops * (1/23) chances = 13 loops in mean.

Anyway, the result of 3 times I posted aboves was empirically proved: I logged 30 times each function where $4212 is polled and then I calculated the mean of number of times the loop is executed. The result was 3.8 and I rounded down to 3. I didn't intend to be precise, but to prove that code is slower in mean.

thanks!  that's very thorough.  a nitpick on the final equation..  shouldn't it be:

(1 loop * (5/28) chance) + [ ( 2 loop * (1/23) chances + 3 loops * (1/23) chances + 4 loops * (1/23) chances + ......... + 24 loops * (1/23) chances ) * (23/28) chance ]

?

in other words, you're accounting for the 18% instant-success rate once, but maybe it should be accounted for twice: in both shortening the maximum delay if HBlank is 0 when the code is first executed, *and* having an 18% occurrence of HBlank being 1 the first time.

i could be wrong.

magno

  • RHDN Patreon Supporter!
  • Full Member
  • *****
  • Posts: 157
    • View Profile
    • Traducciones Magno
Re: About FF6 multiply algorithm
« Reply #8 on: July 19, 2013, 01:23:08 am »
Yes, you are abolutely right. I did calculations that way because I calculated the times the loop is executed more than once, since at least it must be executed 1 time to check HBlank. In short, the original code is faster if the loop executes once, so I calculated the number of times it is executed making my code faster.

The proper exact calculation would be yours, which implies that loop is executed about 11 times in average.

Looking at the original code in a statistical way, maybe changing the multiply code could make the difference, because you are saving a lot of cycles, although I think those task are not very repetitive, since multiply is used to access name tables to print items, weapons, armors and such...



Code: [Select]
Multiplication Function
Multiplies low bit of A * high bit of A. Stores result in 16-bit A.

C2/4781: 08           PHP
C2/4782: C2 20        REP #$20
C2/4784: 8F 02 42 00  STA $004202
C2/4788: EA           NOP
C2/4789: EA           NOP
C2/478A: EA           NOP
C2/478B: EA           NOP
C2/478C: AF 16 42 00  LDA $004216
C2/4790: 28           PLP
C2/4791: 60           RTS
You could axe that last NOP and it would take 34 cycles, as opposed to the original 39, or your 45.

Why could I axe the last NOP? Isn't it supposed hardware multiplication is 8 machine cycles long? I just checked that all $4202 multiplications wait for 6 machine cycles, not 8, as I always thought was the correct number.


By the way, I just discovered some 24 bit multiplications in bank $C3 in routines:
* $C3/6D79
* $C3/8CD6

 I'll post all my relocatable dissassembly when it's done.
« Last Edit: July 19, 2013, 06:28:07 am by magno »

assassin

  • Full Member
  • ***
  • Posts: 140
    • View Profile
    • My Barren Webpage
Re: About FF6 multiply algorithm
« Reply #9 on: July 19, 2013, 07:32:12 pm »
Quote
Why could I axe the last NOP? Isn't it supposed hardware multiplication is 8 machine cycles long? I just checked that all $4202 multiplications wait for 6 machine cycles, not 8, as I always thought was the correct number.

8 is correct, to my knowledge.  as for Lenophis' reasoning:

http://softpixel.com/~cwright/sianse/docs/65c816.txt

see Table 9.  the "LDA $004216" won't actually start accessing the register until a few cycles into the instruction.

i'm still paranoid about incorporating this knowledge into my own code to save space (i have one patch that could possibly have a 4-byte helper function eliminated), but probably for no good reason.

magno

  • RHDN Patreon Supporter!
  • Full Member
  • *****
  • Posts: 157
    • View Profile
    • Traducciones Magno
Re: About FF6 multiply algorithm
« Reply #10 on: October 24, 2013, 04:23:39 am »
As I promised, I gave credit to authors in my FF6 spanish patch (not released yet):




I also hacked the full credits routine to add spanish characters:



I'll upload my $C3 bank dissasembly as soon as I have access to my webpage again. The code is fully relocatable (every routine and every branch are tagged) and data tables are properly listed and formatted. The code is aimed for x816 assembler, but can be easily converted to others. Besides, there are some new routines I coded for the spanish version and inserted almost ALL bugfix patches released so far.

Hope somebody find it useful.