News:

11 March 2016 - Forum Rules

Main Menu

6 bit text encoding

Started by Pennywise, September 03, 2020, 12:29:01 PM

Previous topic - Next topic

abw

If you're setting up a brand new text encoding from scratch and your primary concern is being able to fit a large amount of text into a small amount of space, then yes, using some encoding that's specifically designed to compress text will very likely give a better compression ratio than some encoding that's not designed to do that.

Pennywise's question, however, does not appear to be about choosing a particular compression algorithm (he may well also have that in mind, but that's not what he actually asked). Instead, I suspect (since we've been talking about it elsewhere) that he's looking at a particular game that uses 6-bit tokens and is trying to get a better understanding of how that's implemented.

With that in mind, here's a commented example of reading a stream of 6-bit tokens and converting them to 8-bit tokens from Dragon Quest IV (NES):

; read the next 6-bit token into A, updating script variables appropriately
; control flow target (from $DF76, $DF8F)
0x03DFA9|$0F:$DF99:A0 00    LDY #$00   
0x03DFAB|$0F:$DF9B:B1 49    LDA ($49),Y ; read current token byte 0
0x03DFAD|$0F:$DF9D:85 5F    STA $5F    ; store current token byte 0 in $5F
0x03DFAF|$0F:$DF9F:A5 5E    LDA $5E    ; read token counter
0x03DFB1|$0F:$DFA1:E6 5E    INC $5E    ; increment token counter
0x03DFB3|$0F:$DFA3:29 03    AND #$03    ; check low 2 bits of token counter: a sequence of 6-bit tokens can only be aligned in 4 possible ways relative to 8-bit boundaries
0x03DFB5|$0F:$DFA5:F0 10    BEQ $DFB7  ; handle 0bSSSSSSxx; we're going to need to re-read the current script byte for the next token, so don't update the script read address
0x03DFB7|$0F:$DFA7:20 D5 DF JSR $DFD5  ; update the script read address, taking into account reserved regions of RAM, the script crossing bank boundaries, etc.; leaves A unchanged
0x03DFBA|$0F:$DFAA:C9 02    CMP #$02   
0x03DFBC|$0F:$DFAC:90 0E    BCC $DFBC  ; 0 was handled earlier, so A < 2 => A == 1; handle 0bxxxxxxSS 0bSSSSxxxx
0x03DFBE|$0F:$DFAE:F0 18    BEQ $DFC8  ; A == 2; handle 0bxxxxSSSS 0bSSxxxxxx
0x03DFC0|$0F:$DFB0:A5 5F    LDA $5F    ; A == 3; handle 0bxxSSSSSS (the easy case); load current token byte 0
0x03DFC2|$0F:$DFB2:29 3F    AND #$3F    ; and chop off the bits we don't care about
0x03DFC4|$0F:$DFB4:60      RTS       


; code -> unknown
0x03DFC5|$0F:$DFB5:EA      ; NOP       
0x03DFC6|$0F:$DFB6:EA      ; NOP       

; unknown -> code
; handle 0bSSSSSSxx
; control flow target (from $DFA5)
0x03DFC7|$0F:$DFB7:A5 5F    LDA $5F    ; load current token byte 0
0x03DFC9|$0F:$DFB9:4A      LSR        ; shift from 0bSSSSSSxx to 0bxSSSSSSx
0x03DFCA|$0F:$DFBA:4A      LSR        ; shift from 0bxSSSSSSx to 0bxxSSSSSS
0x03DFCB|$0F:$DFBB:60      RTS       

; handle 0bxxxxxxSS 0bSSSSxxxx
; control flow target (from $DFAC)
0x03DFCC|$0F:$DFBC:B1 49    LDA ($49),Y ; $DFD5 updated $49-$4A to the next script byte, so this reads current token byte 1
0x03DFCE|$0F:$DFBE:46 5F    LSR $5F    ; shift from 0bxxxxxxSS 0bSSSSxxxx to 0bxxxxxxxS 0bSSSSSxxx
0x03DFD0|$0F:$DFC0:6A      ROR       
0x03DFD1|$0F:$DFC1:46 5F    LSR $5F    ; shift from 0bxxxxxxxS 0bSSSSSxxx to 0bxxxxxxxx 0bSSSSSSxx
0x03DFD3|$0F:$DFC3:6A      ROR       
0x03DFD4|$0F:$DFC4:4A      LSR        ; shift from 0bSSSSSSxx to 0bxSSSSSSx
0x03DFD5|$0F:$DFC5:4A      LSR        ; shift from 0bxSSSSSSx to 0bxxSSSSSS
0x03DFD6|$0F:$DFC6:60      RTS       


; code -> unknown
0x03DFD7|$0F:$DFC7:EA      ; NOP       

; unknown -> code
; handle 0bxxxxSSSS 0bSSxxxxxx
; control flow target (from $DFAE)
0x03DFD8|$0F:$DFC8:B1 49    LDA ($49),Y ; $DFD5 updated $49-$4A to the next script byte, so this reads current token byte 1
0x03DFDA|$0F:$DFCA:0A      ASL        ; shift from 0bxxxxSSSS 0bSSxxxxxx to 0bxxxSSSSS 0bSxxxxxxx
0x03DFDB|$0F:$DFCB:26 5F    ROL $5F   
0x03DFDD|$0F:$DFCD:0A      ASL        ; shift from 0bxxxSSSSS 0bSxxxxxxx to 0bxxSSSSSS 0bxxxxxxxx
0x03DFDE|$0F:$DFCE:26 5F    ROL $5F   
0x03DFE0|$0F:$DFD0:A5 5F    LDA $5F    ; load 0bxxSSSSSS
0x03DFE2|$0F:$DFD2:29 3F    AND #$3F    ; and chop off the bits we don't care about
0x03DFE4|$0F:$DFD4:60      RTS       

Basically that keeps track of the current script read address in $49-$4A and the number of tokens that have been read so far (from which we can deduce the alignment of the current token) in $5E, then reads the current script byte into $5F and (if necessary) also reads the next script byte, and then performs some bit shifts to get the 6 desired bits into the low 6 bits of A, setting the high 2 bits of A to 0.

After that, here's an example of using context (table switching) while converting from 6-bit tokens (or 8-bit tokens with the high 2 bits clear, depending on how you want to think of them) to characters (PPU tile IDs):

; control flow target (from $86D1)
; call to code in a different bank ($1F:$DF86)
0x0586CA|$16:$86BA:20 86 DF JSR $DF86  ; swap in needed banks then call $DF99 to read the next 6-bit token into A
0x0586CD|$16:$86BD:C9 3C    CMP #$3C    ; table switch tokens are #$3C - #$3F
0x0586CF|$16:$86BF:90 05    BCC $86C6  ; A < #$3C => normal table entry
0x0586D1|$16:$86C1:F0 06    BEQ $86C9  ; A == #$3C => switch between hiragana and katakana tables
0x0586D3|$16:$86C3:4C D4 86 JMP $86D4  ; A > #$3C => switch to dictionary table

; handle normal table entry
; control flow target (from $86BF)
0x0586D6|$16:$86C6:4C 4E 87 JMP $874E  ; target code is located too far away to use a branch, so JMP

; switch between hiragana and katakana tables
; control flow target (from $86C1)
0x0586D9|$16:$86C9:AD 66 05 LDA $0566  ; current table ID
0x0586DC|$16:$86CC:49 01    EOR #$01    ; switch between hiragana and katakana tables
0x0586DE|$16:$86CE:8D 66 05 STA $0566  ; current table ID
0x0586E1|$16:$86D1:4C BA 86 JMP $86BA  ; loop to read next token

; ...

; handle normal table entry
0x05875E|$16:$874E:85 5F    STA $5F    ; current 6-bit token 0b00SSSSSS
0x058760|$16:$8750:AD 66 05 LDA $0566  ; current table ID
0x058763|$16:$8753:0A      ASL        ; shift from 0bxxxxxxTT to 0bTTxxxxxx (yes, ROR x 3 would have been both faster and shorter)
0x058764|$16:$8754:0A      ASL       
0x058765|$16:$8755:0A      ASL       
0x058766|$16:$8756:0A      ASL       
0x058767|$16:$8757:0A      ASL       
0x058768|$16:$8758:0A      ASL       
0x058769|$16:$8759:18      CLC       
0x05876A|$16:$875A:65 5F    ADC $5F    ; combine table ID and 6-bit token to get 0bTTSSSSSS (yes, ORA $5F would have been both faster and shorter)
0x05876C|$16:$875C:AA      TAX        ; use combined value as an index into the hiragana/katakana mapping table
0x05876D|$16:$875D:BD 65 87 LDA $8765,X ; get the 8-bit value (usually a hiragana/katakana PPU tile ID)
0x058770|$16:$8760:C9 F0    CMP #$F0    ; check for control code #$F0 (wait for input)
0x058772|$16:$8762:F0 B7    BEQ $871B  ; if found, go handle #$F0
0x058774|$16:$8764:60      RTS        ; otherwise return the new 8-bit value

Basically that utilizes some extra storage in $0566 to remember the current table, updates it whenever a table switch token is read, and, in conjunction with the 6-bit token that was read earlier, uses it as an index into a lookup table to get an 8-bit value (usually a PPU tile ID, but could also be a control code) and then returns that for its caller to process (write to PPU or handle control code).

For another example, Dragon Warrior II (NES) uses a 5-bit encoding with a different implementation of table switching; you can check the commentary on https://datacrystal.romhacking.net/wiki/Dragon_Warrior_II::ROM_map/ASM_bank_02 starting at $02:$B393 for further details.

Quote from: slidelljohn on September 05, 2020, 04:39:41 AM
You don't need another 6-bit table for 6 extra tokens. If you need more tokens you might
as well add some sort of compression.
Yes, if you're going to make up your own encoding, you can definitely come up with something that has better space efficiency than using 6 bits to store 1 bit of information, but again, I don't think that's what the question was about ;).

slidelljohn

#21
 :thumbsup:

Anime_World

Quote from: abw on September 05, 2020, 12:06:28 PM
If you're setting up a brand new text encoding from scratch and your primary concern is being able to fit a large amount of text into a small amount of space, then yes, using some encoding that's specifically designed to compress text will very likely give a better compression ratio than some encoding that's not designed to do that.

Pennywise's question, however, does not appear to be about choosing a particular compression algorithm (he may well also have that in mind, but that's not what he actually asked). Instead, I suspect (since we've been talking about it elsewhere) that he's looking at a particular game that uses 6-bit tokens and is trying to get a better understanding of how that's implemented.

With that in mind, here's a commented example of reading a stream of 6-bit tokens and converting them to 8-bit tokens from Dragon Quest IV (NES):

; read the next 6-bit token into A, updating script variables appropriately
; control flow target (from $DF76, $DF8F)
0x03DFA9|$0F:$DF99:A0 00    LDY #$00   
0x03DFAB|$0F:$DF9B:B1 49    LDA ($49),Y ; read current token byte 0
0x03DFAD|$0F:$DF9D:85 5F    STA $5F    ; store current token byte 0 in $5F
0x03DFAF|$0F:$DF9F:A5 5E    LDA $5E    ; read token counter
0x03DFB1|$0F:$DFA1:E6 5E    INC $5E    ; increment token counter
0x03DFB3|$0F:$DFA3:29 03    AND #$03    ; check low 2 bits of token counter: a sequence of 6-bit tokens can only be aligned in 4 possible ways relative to 8-bit boundaries
0x03DFB5|$0F:$DFA5:F0 10    BEQ $DFB7  ; handle 0bSSSSSSxx; we're going to need to re-read the current script byte for the next token, so don't update the script read address
0x03DFB7|$0F:$DFA7:20 D5 DF JSR $DFD5  ; update the script read address, taking into account reserved regions of RAM, the script crossing bank boundaries, etc.; leaves A unchanged
0x03DFBA|$0F:$DFAA:C9 02    CMP #$02   
0x03DFBC|$0F:$DFAC:90 0E    BCC $DFBC  ; 0 was handled earlier, so A < 2 => A == 1; handle 0bxxxxxxSS 0bSSSSxxxx
0x03DFBE|$0F:$DFAE:F0 18    BEQ $DFC8  ; A == 2; handle 0bxxxxSSSS 0bSSxxxxxx
0x03DFC0|$0F:$DFB0:A5 5F    LDA $5F    ; A == 3; handle 0bxxSSSSSS (the easy case); load current token byte 0
0x03DFC2|$0F:$DFB2:29 3F    AND #$3F    ; and chop off the bits we don't care about
0x03DFC4|$0F:$DFB4:60      RTS       


; code -> unknown
0x03DFC5|$0F:$DFB5:EA      ; NOP       
0x03DFC6|$0F:$DFB6:EA      ; NOP       

; unknown -> code
; handle 0bSSSSSSxx
; control flow target (from $DFA5)
0x03DFC7|$0F:$DFB7:A5 5F    LDA $5F    ; load current token byte 0
0x03DFC9|$0F:$DFB9:4A      LSR        ; shift from 0bSSSSSSxx to 0bxSSSSSSx
0x03DFCA|$0F:$DFBA:4A      LSR        ; shift from 0bxSSSSSSx to 0bxxSSSSSS
0x03DFCB|$0F:$DFBB:60      RTS       

; handle 0bxxxxxxSS 0bSSSSxxxx
; control flow target (from $DFAC)
0x03DFCC|$0F:$DFBC:B1 49    LDA ($49),Y ; $DFD5 updated $49-$4A to the next script byte, so this reads current token byte 1
0x03DFCE|$0F:$DFBE:46 5F    LSR $5F    ; shift from 0bxxxxxxSS 0bSSSSxxxx to 0bxxxxxxxS 0bSSSSSxxx
0x03DFD0|$0F:$DFC0:6A      ROR       
0x03DFD1|$0F:$DFC1:46 5F    LSR $5F    ; shift from 0bxxxxxxxS 0bSSSSSxxx to 0bxxxxxxxx 0bSSSSSSxx
0x03DFD3|$0F:$DFC3:6A      ROR       
0x03DFD4|$0F:$DFC4:4A      LSR        ; shift from 0bSSSSSSxx to 0bxSSSSSSx
0x03DFD5|$0F:$DFC5:4A      LSR        ; shift from 0bxSSSSSSx to 0bxxSSSSSS
0x03DFD6|$0F:$DFC6:60      RTS       


; code -> unknown
0x03DFD7|$0F:$DFC7:EA      ; NOP       

; unknown -> code
; handle 0bxxxxSSSS 0bSSxxxxxx
; control flow target (from $DFAE)
0x03DFD8|$0F:$DFC8:B1 49    LDA ($49),Y ; $DFD5 updated $49-$4A to the next script byte, so this reads current token byte 1
0x03DFDA|$0F:$DFCA:0A      ASL        ; shift from 0bxxxxSSSS 0bSSxxxxxx to 0bxxxSSSSS 0bSxxxxxxx
0x03DFDB|$0F:$DFCB:26 5F    ROL $5F   
0x03DFDD|$0F:$DFCD:0A      ASL        ; shift from 0bxxxSSSSS 0bSxxxxxxx to 0bxxSSSSSS 0bxxxxxxxx
0x03DFDE|$0F:$DFCE:26 5F    ROL $5F   
0x03DFE0|$0F:$DFD0:A5 5F    LDA $5F    ; load 0bxxSSSSSS
0x03DFE2|$0F:$DFD2:29 3F    AND #$3F    ; and chop off the bits we don't care about
0x03DFE4|$0F:$DFD4:60      RTS       

Basically that keeps track of the current script read address in $49-$4A and the number of tokens that have been read so far (from which we can deduce the alignment of the current token) in $5E, then reads the current script byte into $5F and (if necessary) also reads the next script byte, and then performs some bit shifts to get the 6 desired bits into the low 6 bits of A, setting the high 2 bits of A to 0.

After that, here's an example of using context (table switching) while converting from 6-bit tokens (or 8-bit tokens with the high 2 bits clear, depending on how you want to think of them) to characters (PPU tile IDs):

; control flow target (from $86D1)
; call to code in a different bank ($1F:$DF86)
0x0586CA|$16:$86BA:20 86 DF JSR $DF86  ; swap in needed banks then call $DF99 to read the next 6-bit token into A
0x0586CD|$16:$86BD:C9 3C    CMP #$3C    ; table switch tokens are #$3C - #$3F
0x0586CF|$16:$86BF:90 05    BCC $86C6  ; A < #$3C => normal table entry
0x0586D1|$16:$86C1:F0 06    BEQ $86C9  ; A == #$3C => switch between hiragana and katakana tables
0x0586D3|$16:$86C3:4C D4 86 JMP $86D4  ; A > #$3C => switch to dictionary table

; handle normal table entry
; control flow target (from $86BF)
0x0586D6|$16:$86C6:4C 4E 87 JMP $874E  ; target code is located too far away to use a branch, so JMP

; switch between hiragana and katakana tables
; control flow target (from $86C1)
0x0586D9|$16:$86C9:AD 66 05 LDA $0566  ; current table ID
0x0586DC|$16:$86CC:49 01    EOR #$01    ; switch between hiragana and katakana tables
0x0586DE|$16:$86CE:8D 66 05 STA $0566  ; current table ID
0x0586E1|$16:$86D1:4C BA 86 JMP $86BA  ; loop to read next token

; ...

; handle normal table entry
0x05875E|$16:$874E:85 5F    STA $5F    ; current 6-bit token 0b00SSSSSS
0x058760|$16:$8750:AD 66 05 LDA $0566  ; current table ID
0x058763|$16:$8753:0A      ASL        ; shift from 0bxxxxxxTT to 0bTTxxxxxx (yes, ROR x 3 would have been both faster and shorter)
0x058764|$16:$8754:0A      ASL       
0x058765|$16:$8755:0A      ASL       
0x058766|$16:$8756:0A      ASL       
0x058767|$16:$8757:0A      ASL       
0x058768|$16:$8758:0A      ASL       
0x058769|$16:$8759:18      CLC       
0x05876A|$16:$875A:65 5F    ADC $5F    ; combine table ID and 6-bit token to get 0bTTSSSSSS (yes, ORA $5F would have been both faster and shorter)
0x05876C|$16:$875C:AA      TAX        ; use combined value as an index into the hiragana/katakana mapping table
0x05876D|$16:$875D:BD 65 87 LDA $8765,X ; get the 8-bit value (usually a hiragana/katakana PPU tile ID)
0x058770|$16:$8760:C9 F0    CMP #$F0    ; check for control code #$F0 (wait for input)
0x058772|$16:$8762:F0 B7    BEQ $871B  ; if found, go handle #$F0
0x058774|$16:$8764:60      RTS        ; otherwise return the new 8-bit value

Basically that utilizes some extra storage in $0566 to remember the current table, updates it whenever a table switch token is read, and, in conjunction with the 6-bit token that was read earlier, uses it as an index into a lookup table to get an 8-bit value (usually a PPU tile ID, but could also be a control code) and then returns that for its caller to process (write to PPU or handle control code).

For another example, Dragon Warrior II (NES) uses a 5-bit encoding with a different implementation of table switching; you can check the commentary on https://datacrystal.romhacking.net/wiki/Dragon_Warrior_II::ROM_map/ASM_bank_02 starting at $02:$B393 for further details.
Yes, if you're going to make up your own encoding, you can definitely come up with something that has better space efficiency than using 6 bits to store 1 bit of information, but again, I don't think that's what the question was about ;).

Perfect explanation @abw.

Pennywise

Thanks everyone.

Yes, I'm probably going to poke around with DQ4, which uses 6 bit encoding and wanted to some reading material so to speak when I start looking at the code.

I figured this public discourse would be beneficial to others since the subject is a bit uncommon.

Anime_World

Hint: abcde (atlas/cartographer) supports 6-bit encoding.