65816 (SNES) Shift and Add-3 (converts binary to bcd/decimal)

Started by gauveldt, April 07, 2014, 02:21:37 PM

Previous topic - Next topic

gauveldt

I've seen others in various romhacking posts and other places avoid writing routines that convert binary numbers to decimal to display results of calculations or otherwise on screen.  Any decent game needs to do this really and it's not all that hard to do.  One will note that professional games rarely store values in RAM using BCD format as it would use up a lot of extra memory.

The way to do this very quickly is an algorithm called Shift and Add-3 or "double dabble".

It's not that hard to explain in pseudocode:
Edit: so much for pseudo code.... I basically recalled the entire ASM from memory while house sitting and am away from the computer I actually write code on... o_O

1. start with your binary number padding with 3 nybbles or so for every 3 bits (or portion) of input (or you know that a 32-bit unsigned 4 byte integer has a max value of 4.29 billion and requires 10 digits hence 10 nybbles).

That's the original binary value with zero bytes (10 nybbles after it):
in0 in1 in2 in3 nyb01 nyb23 nyb45 nyb67 nyb89 = 9 bytes

B2D_DIGITS = 10 bytes (stick this somewhere in wram or bw-ram)

we're going to use 16-bit operations so let's pad that odd byte with a 00
in0 in1 in2 in3 nyb01 nyb23 nyb45 nyb67 nyb89 00 = 10 bytes
clear nyb00 thru to the pad byte after nyb89 inclusive to $00

; precondition: in0 thru in3 are set to the 32-bit value to convert
;   -- pad lesser sizes (ie: bytes, words, 24-bit ptrs) with $00 in higher order inX values)
Bin2Dec:
-   ldx.w #$09
    lda #$00
    sta nyb01,x
    dex
    bpl -


set the shift count to 0:


stz B2D_SHIFTCOUNT      ; just define this to be somewhere in WRAM (or BW-RAM if you use SA-1)


2. shift everything left starting from in0,in1 and working towards nyb89 00

B2D_Loop:
rep #$20
clc
rol in0        ; +in1
rol in2        ; +in3
rol nyb01   ; +nyb23
rol nyb45   ; +nyb67
rol nyb89   ; +00
sep #$20

3. count this shift.  we are done if this is 32nd shift.

inc B2D_SHIFTCOUNT
lda B2D_SHIFTCOUNT
cmp #$20
beq B2D_UnpackDigits

4. now the fun part...  we scan each byte in nyb01 thru nyb89 for nybbles >= 5 and add 3 any that are
instead of trying to mess with a whole shwak of logic ops shifts, pushes, pops, etc I used a lookup table to shirnk it down to only a few instructions, the most complicated being to turn the read nybbles into a pointer value (in my code I tend to leave the index regs at 16 bits.  The 160-byte lookup table will be described at the end).  NB: The order you do the adds does not matter only that you check every nybble.

ldy.w #$04
-   lda nyb01,y
    xba
    lda #0
    xba       ; makes xxnn into 00nn
    rep #$20
    tax
    sep #$20
    lda Add3Tbl,x
    sta nyb01,y
    dey
    bpl -

5. Repeat for remaining shifts

    bra B2D_Loop

6. Technically we are done the heavy part but these will be awkward to use in printing functions and they will need to unpack the nybbles to display text so we may as well do the unpacking in the same routine:

B2D_UnpackDigits:
    ; we will unpack so the MSB digits come first as humans would read them
    ldx.w #9        ; points to digit to store lsd
    ldy.w #0        ; points to nybbles lsb
-   lda nyb01,y
    pha
    and #$0f
    sta B2D_DIGITS,x
    dex
    pla
    lsr : lsr : lsr : lsr
    and #$0f
    sta B2D_DIGITS,x
    dex
    iny
    cpy #$05
    bne -

7. That's all, folks

    rts


The look up table:

first 16 bytes the table looks as follows (based on a truth table for shift and add-3):
$00,$01,$02,$03,$04,$08,$09,$0a,$0b,$0c
For the table I've zeroed the unreachable entries (it needs to be padded to 16 bytes)
this is the horizontal part of the table that takes care of the low nybble...
now we repeat the pattern with the upper nybble counting the same pattern vertically:

Add3Tbl:
db $00,$01,$02,$03,$04,$08,$09,$0a,$0b,$0c,$00,$00,$00,$00,$00,$00
db $10,$11,$12,$13,$14,$18,$19,$1a,$1b,$1c,$00,$00,$00,$00,$00,$00
db $20,$21,$22,$23,$24,$28,$29,$2a,$2b,$2c,$00,$00,$00,$00,$00,$00
db $30,$31,$32,$33,$34,$38,$39,$3a,$3b,$3c,$00,$00,$00,$00,$00,$00
db $40,$41,$42,$43,$44,$48,$49,$4a,$4b,$4c,$00,$00,$00,$00,$00,$00
db $80,$81,$82,$83,$84,$88,$89,$8a,$8b,$8c,$00,$00,$00,$00,$00,$00
db $90,$91,$92,$93,$94,$98,$99,$9a,$9b,$9c,$00,$00,$00,$00,$00,$00
db $a0,$a1,$a2,$a3,$a4,$a8,$a9,$aa,$ab,$ac,$00,$00,$00,$00,$00,$00
db $b0,$b1,$b2,$b3,$b4,$b8,$b9,$ba,$bb,$bc,$00,$00,$00,$00,$00,$00
db $c0,$c1,$c2,$c3,$c4,$c8,$c9,$ca,$cb,$cc

You can safely lose the last six 00 bytes on the final ($cx) row, making this a 154-byte table.

There now everyone should be able to write their players' experience and money to the screen in decimal rather than hex :P

Disch

How does this perform compared to the more traditional "keep dividing by 10 until you run out of data" approach?  While clever, it seems like this would be slower (and with the LUT, take up more ROM space).  This is a pretty sizable loop that has to run 32 times... whereas the division loop would only have to run at maximum 10 times.

gauveldt

Quote from: Disch on April 09, 2014, 12:31:17 PM
How does this perform compared to the more traditional "keep dividing by 10 until you run out of data" approach?  While clever, it seems like this would be slower (and with the LUT, take up more ROM space).  This is a pretty sizable loop that has to run 32 times... whereas the division loop would only have to run at maximum 10 times.
Staying in integer land with contstant complexity and no divisions to require the multiply/divide hardware just to write out numbers seems like a suitable tradeoff for the small lookup table.

Not using the hardware, the divide by 10 becomes a loop to count how many tens count to the input value per pass, making it O(n*n).

The routine may be optimized for specific numeral sizes (ie: 16-bit word maximums could be use a version of the above algorithm only using 16 passes).

Disch

Constant complexity is not necessarily a good thing.  "Early out" approaches are typically better performers.

And if the division hardware is there... why not use it?  Especially if using it results in faster and smaller code.

Measuring performance is pretty easy on the SNES.  There's really only 2 things you'd need to look at:  cycles and bytes.  IE:  how fast is the code, and how big is it.

There's some wiggle room where some games might want faster code at the cost of a few extra bytes... and other games might want smaller code that runs a little slower.  But if an implementation is both big and slow, it's not favorable.

I'm not trying to be judgmental or anything.  This is definitely an interesting approach... I'm just curious as to how efficient it is.  If it doesn't perform as well as the divide-by-10 approach, then it doesn't seem very practical to me.

I might whip up a divide-by-10 routine later and tally up the cycle count to see how it compares.


QuoteNot using the hardware, the divide by 10 becomes a loop to count how many tens count to the input value per pass, making it O(n*n).

Yeah but not using the hardware is retarded.  That'd be like trying to do this without using ROL.  Sure you could do it, but what's the point?

QuoteThe routine may be optimized for specific numeral sizes (ie: 16-bit word maximums could be use a version of the above algorithm only using 16 passes).

Well... the 16-bit divide-by-10 approach would have 5 passes... at most (frequently it would have less -- early outs are good)

assassin

i like Andrew Jacobs' "More Hexadecimal to Decimal Conversion" approach here:

http://www.6502.org/source/

you'd want to modify it to include 16-bit instructions, as well as to handle 32-bit input and output variables.  should be a trivial adjustment.

by the author's admission, it uses somewhat more cycles.  but on the bright side, it takes less space and doesn't make my eyes bleed.

i understand nothing from the original post here, and regret having even looked at it, but will offer that the "lda #$00" can be moved in front of the "Bin2Dec" loop. :P

gauveldt

Quote from: assassin on April 27, 2014, 07:41:16 AM
i understand nothing from the original post here, and regret having even looked at it, but will offer that the "lda #$00" can be moved in front of the "Bin2Dec" loop. :P
'
Read either source link (copied from the OP) for information about the algorithm:
http://en.wikipedia.org/wiki/Double_dabble
http://people.ee.duke.edu/~dwyer/courses/ece52/Binary_to_BCD_Converter.pdf



April 27, 2014, 11:46:46 AM - (Auto Merged - Double Posts are not allowed before 7 days.)

Quote from: Disch on April 09, 2014, 01:49:09 PM
Well... the 16-bit divide-by-10 approach would have 5 passes... at most (frequently it would have less -- early outs are good)

So this is why games slow down in the menu when you are at 9999/9999 HP, 999/999 MP and have 9,999,999 GP. :P

FinS

This might be too simple but you could switch to decimal mode and bitshift and add the decimal value of each shifted bit. It would require approximately 42 operations at about 4 cycles per op for the first 13 bits then 21 more operations for the last 3 bits because the converted decimal value would exceed the 16 bit limit. So in total to convert a 16 bit number it would only cost 63 operations and around 250 cycles.

assassin

Quote from: gauveldtRead either source link (copied from the OP) for information about the algorithm:

thanks.  i should have hit those links before having a brain meltdown from staring at the code.  my bad.

----

Quote from: FinSThis might be too simple but you could switch to decimal mode and bitshift and add the decimal value of each shifted bit.

that's what the code i linked does.  it doesn't separate the 16 iterations into the 13/3 you propose, but just does all 16 the same way.  so it's giving up a few dozen cycles in exchange for saving about a dozen bytes.

Bregalad

QuoteNot using the hardware, the divide by 10 becomes a loop to count how many tens count to the input value per pass, making it O(n*n).
No this is completely wrong. You can divide an integer by 10 without any dividing instruction/hardware and in constant time.

If you're interested as an example in how exactly I can link to here (I wrote that) it's not 65816 it's plain old 6502, but the concept applies to all CPUs without a divide instruction.
In all simplicity, this means you have to compare with 10, 20, 40, 80, etc... (in reverse order) until the max. value you want to support for the quotient.

FinS

Here's a method I got out of a game then altered for optimization. It just subtracts the hex values of each decimal digit and counts how many times it can be subtracted. Then places the digits one per byte starting at $20 ram.  For the largest number ($FFFFFFFF = 4,294,967,295) it takes 776 operations to do the conversion. It could be further optimized by using another subroutine for the last 4 which don't require math on the 16 high bits.

//enter hex equivalents for each decimal digit in $04 and $06, largest first
.al
.xl
REP #$30               
STZ $0A   
LDX $0A
LDA #$CA00
STA $04   
LDA #$3B9A 
STA $06   
JSR SUBTR
LDA #$E100 
STA $04   
LDA #$05F5 
STA $06   
JSR SUBTR
LDA #$9680 
STA $04   
LDA #$0098 
STA $06   
JSR SUBTR
LDA #$4240 
STA $04   
LDA #$000F 
STA $06   
JSR SUBTR
LDA #$86A0 
STA $04   
LDA #$0001 
STA $06   
JSR SUBTR
LDA #$2710 
STA $04   
STZ $06   
JSR SUBTR
LDA #$03E8 
STA $04   
STZ $06   
JSR SUBTR
LDA #$0064 
STA $04   
STZ $06   
JSR SUBTR
LDA #$000A 
STA $04   
STZ $06   
JSR SUBTR
.as
SEP #$20
LDA $00
STA $20,X
RTS   

//subtract equivalents from hex number in $00 and $02, increment base 10 digit
SUBTR
.al
.xl
REP #$30 
STZ $08   
YOKE
SEC       
LDA $00   
SBC $04   
TAY
LDA $02   
SBC $06   
BCC VALE 
STY $00
STA $02
INC $08   
BRA YOKE   
VALE
.as
SEP #$20 
LDA $0A   // $0a is flag to check if a number has come up yet
BNE XRAY 
LDA $08   
BNE ZOO
LDA #$20    // $20 is tile value for empty space if no numbers have come up yet
BRA WHY   
ZOO 
INC $0A 
XRAY
LDA $08   
WHY 
STA $20,x
INX       
.al
REP #$20               
RTS