11 March 2016 - Forum Rules

Main Menu


Started by KingMike, April 06, 2011, 10:00:49 AM

Previous topic - Next topic


I thought I shouldn't hijack RedComet's DMA thread since this is a different issue.

I suspect I may have found a bottleneck in Maka-maka's decompression routine.
I wonder if this explains the awful loading time?
It writes FF to 64K of RAM one byte at a time.
Isn't that the point you should use DMA, or even MVP/MVN?

Whereas the routine to decompress the graphics to WRAM uses the DMA registers to write ONE BYTE AT A TIME to WRAM. :banghead:

$00/9721 DA          PHX                     A:D31C X:D3AF Y:0017 D:0000 DB:07 S:1FE7
$00/9722 5A          PHY                     A:D31C X:D3AF Y:0017 D:0000 DB:07 S:1FE5
$00/9723 8B          PHB                     A:D31C X:D3AF Y:0017 D:0000 DB:07 S:1FE3
$00/9724 08          PHP                     A:D31C X:D3AF Y:0017 D:0000 DB:07 S:1FE2
$00/9725 C2 20       REP #$20                A:D31C X:D3AF Y:0017 D:0000 DB:07 S:1FE1
$00/9727 48          PHA                     A:D31C X:D3AF Y:0017 D:0000 DB:07 S:1FE1
$00/9728 E2 20       SEP #$20                A:D31C X:D3AF Y:0017 D:0000 DB:07 S:1FDF
$00/972A A9 00       LDA #$00                A:D31C X:D3AF Y:0017 D:0000 DB:07 S:1FDF
$00/972C 48          PHA                     A:D300 X:D3AF Y:0017 D:0000 DB:07 S:1FDF
$00/972D AB          PLB                     A:D300 X:D3AF Y:0017 D:0000 DB:07 S:1FDE
$00/972E A2 00 00    LDX #$0000              A:D300 X:D3AF Y:0017 D:0000 DB:00 S:1FDF
$00/9731 3A          DEC A                   A:D300 X:0000 Y:0017 D:0000 DB:00 S:1FDF
$00/9732 9F 00 00 7F STA $7F0000,x[$7F:0000] A:D3FF X:0000 Y:0017 D:0000 DB:00 S:1FDF
$00/9736 CA          DEX                     A:D3FF X:0000 Y:0017 D:0000 DB:00 S:1FDF
$00/9737 D0 F9       BNE $F9    [$9732]      A:D3FF X:FFFF Y:0017 D:0000 DB:00 S:1FDF
"My watch says 30 chickens" Google, 2018


I don't know if it explains all of it, but the 0xFF byte fill could be easily achieved with a DMA transfer using registers $2180-$2183.

As for decompression, it's typical that it proceed one or two bytes at a time. If it's using DMA to write the bytes, well that would probably be a huge bottleneck too, but without seeing any code that shows that, I can't really jump to conclusions. Which DMA registers is it using, $2180? $2180 makes sense for stuff like that, as it provides a nice way to write to WRAM and also get the benefit of auto-incrementing the target address on single byte writes. (Word writes with $2180 aren't possible without additional logic, however).


King M, whether or not that's a bottleneck depends on how often the decompression routine is called and thus how often it's executed. If it runs say once per scene change or something like that, despite being inefficient, I doubt it's really causing a problem. If however it's being called regularly, it very well could be.

I don't have the data in front of me, but those three instructions are probably somewhere on the order of say 4 cycles on average, which would be say 6 master cycles each. So we have maybe 12x6=72 master cycles for the loop. 21Mhz master clock cycle multiplied by 72 cycles multiplied by ~65000 iterations. We're talking somewhere in the neighborhood of under a quarter of a second I think.  Feel free to grab Anomie's timing docs and a 65816 doc to do more precise math.

How and when it this particular loop being called? If my estimate is correct, it can't be called too terribly often or you'd have massive delay.

Simple fix as MoN said is to do DMA to WRAM registers. Alternatively, just doing 16-bit manual copy or MVN will cut it down significantly.

Also, it's probably not even needed. When you decompress a stream, you're typically not reading ahead in the output buffer. What purpose would it serve? I don't see why you'd need to clear it out every time it's run. Did you look at if the game would fail without it and if so, why?
TransCorp - Over 20 years of community dedication.
Dual Orb 2, Wozz, Emerald Dragon, Tenshi No Uta, Glory of Heracles IV SFC/SNES Translations


Quote from: MathOnNapkins on April 06, 2011, 11:24:31 AM
Which DMA registers is it using, $2180? $2180 makes sense for stuff like that, as it provides a nice way to write to WRAM
Every time it writes a byte, it resets $2181-2183 and then writes to $2180.
Actually, it loads $2181/2 from a value RAM, and it reads $2183 (which I hear is not meant to be read, so the result is garbage), then ORA #$01 STA $2183 (which works because I read that only bit 0 of $2183 is used anyways, which is guaranteed to be a 1 because of the ORA, but why not just a lda #$01 if they really need to reset the WRAM address every time?)

Instead of just something like "ldx $offset sta $7F0000,x" Actually, I just replaced the original routine with that and NOPed the rest and it ran.

Using Geiger's debugger, I calculate about 6 times each time it loads the overworld (when entering and again after each fight), and the first two "interior" maps average about 20 calls each time.
Resulting in about a 9 second load time for the overworld and about 13 for the interiors (using Windows clock to time it). It also stops to reload each time you exit a battle, as well.
"My watch says 30 chickens" Google, 2018


Is the WRAM target address in $2181-$2182 predictable or is it all over the place? Is DMA actually being used or is it something like

LDA someArray, X
STA $002181

LDA someArray+1, X
STA $002182

LDA $002183
ORA #$01
STA $002183

LDA someOtherArray, Y
STA $2180

I mean, is this actually compressed data or is it just a fancy way of coordinating the writes to WRAM?

If the offsets for $2181 are predictable, you can also automate that with a DMA transfer, sending up to 0x100 bytes at a time, and then messing with $2182 between DMA transfers. Automating with all 4 registers $2180-$2183 being written to is of course also an option, but for a lot of data you're going to waste a lot of space for the address registers. $2183 is supposed to be fixed anyways as 0x01 ( bank 0x7F) in this instance. Would have been nice if there were a "write to register p, p+1, p+2" option for DMA, but no such luck.


Glancing at my notes, it seems Y is the RAM write position
REP #$20
STA $2181

I'm tempted to try to rewrite the whole routine myself. It does JSRs with lots of push/pulls, which adds up to a lot of seemingly extra cycles during the actual data decompression.

Using DMA for the 64K clear routine saw results in bsnes. Reduced overworld loading to about 6 seconds and interiors to about 8, which is about a 33-40% reduction.
"My watch says 30 chickens" Google, 2018


I really doubt there is a real need to clear 64K of RAM each time you decompress something, such as 20 times per overworld load. I'd strongly recommend looking into this and see why that might be and if it really is necessary.  Also, even if is is needed for some reason, maybe it doesn't need to clear all 64K, or maybe it doesn't need to do it more than once. Hell, it's also highly doubtful the game would ever actually decompress 64K of anything in RAM, especially 20 times per load. That's half the total RAM in the system at a time and serious processing! It's likely all much smaller data lengths and the coders were simply retarded monkeys that didn't know better. You should definitely be able to save more time looking into this.

It's laughable that they'd use $218X loading up with a new address for each SINGLE byte. That defeats the entire purpose of using it. I agree with MoN. See if it's sequential and/or predictable. I imagine it would be. I can't see each byte write going to totally different RAM locations during a decompression sequence. Then either fix up the routine to properly use $218X or DMA it as appropriate as mentioned.

TransCorp - Over 20 years of community dedication.
Dual Orb 2, Wozz, Emerald Dragon, Tenshi No Uta, Glory of Heracles IV SFC/SNES Translations


I forget exactly what the error was, but I do recall it is one of only two games where I had to modify my template LZ tool to compensate for a bug in the game's decompressor (something to do with handling buffer wrap-around), as naturally the game's original graphics were compressed with the bug in effect.
But at least it's not as bad the other game: Jelly Boy 2, it's a HiROM game, yet that game's decompressor detects if the input stream pointer crosses a 32KB "boundary" (so it will jump from ROM offset x7FFF to (x+1)0000. I'm amazed the end credits music was the only thing broken in that one.)

NEW QUESTION: It's doing SEP/REP #$02, which seems to be the same thing as SEC/CLC except wasting a cycle and a byte, right? :P
"My watch says 30 chickens" Google, 2018


Quote from: KingMike on April 07, 2011, 08:36:01 PMNEW QUESTION: It's doing SEP/REP #$02, which seems to be the same thing as SEC/CLC except wasting a cycle and a byte, right? :P
Clear is the first bit, not the second. Looks like the zero flag is the second bit. (No, I'm not sure.) - Randomize your FF6 experience!


Yep, that's the zero flag. Since there's no CLZ (hypothetical "clear zero flag" instruction) It's only a waste if they're doing something like:

REP #$02
REP #$20

which naturally could be combined as

REP #$22

I've seen optimizations like:

REP #$31 ; switch to 16-bit A/X/Y registers and clear carry

used in anticipation of an ADC instruction which saves a CPU cycle if there is fact one occuring shortly thereafter. Course, that's a pretty paltry optimization as it usually is only of any use at the beginning of a subroutine. Not to mention it makes the code harder to read. But hey, it saves a byte of code size and a CPU cycle so it's super worth it! </fake enthusiasm>


I've fixed my replacement decompression routine.
Now finishes each load in about 5 seconds.

While it is LZ, it seems each LZ offset pair contains a relative offset, which it expects to resolve to an absolute 16-bit signed pointer (+ value, it reads from the decompressed data and writes to the insertion point, - value, it writes zeros until either it has finished the copy length or the pointer has wrapped back to a positive value).
So it would seem to have only a 32KB accessible range, but when I changed the DMA routine to only clear 32KB, it glitched the graphics in one of the early maps (a few tiles had wrong colors). Changing the clear routine back to 64K fixed that.
"My watch says 30 chickens" Google, 2018