Romhacking.net

Romhacking => Programming => Topic started by: Pennywise on September 03, 2020, 12:29:01 pm

Title: 6 bit text encoding
Post by: Pennywise on September 03, 2020, 12:29:01 pm
What is the process behind encoding text as 6 bits and then converting it to 8 bits in the code? I'm assuming that's what happens with 6 bit text encoding.

I get the general idea is if you have 4 bytes/characters, with 6 bit encoding, they'll only take up 3 bytes of space. How exactly does that work? I know that instead of thinking of every as bytes, the magic happens on a binary level. Some type of bit shifting?
Title: Re: 6 bit text encoding
Post by: Raeven0 on September 03, 2020, 08:32:28 pm
Bit shifting works well on the SNES, where bytes are individually addressable and the accumulator can be 1 byte wide. Here's a naive 6-bit-to-8-bit decoder:
Code: [Select]
SR_DecodeText:
; Assumes: X = address of encoded source, Y = address of buffer destination
; Assumes: srcAddr>$8000, destAddr<$2000
;   (otherwise use long addressing for src and set DBR for dest)
; Assumes terminator byte is $00
.FirstByte:
lda $0000,x : lsr : lsr : sta $0000,y
  beq .Done
.SecondByte:
lda $0000,x : and #$03
  asl : asl : asl : asl : sta $0001,y
lda $0001,x : and #$F0
  lsr : lsr : lsr : lsr : ora $0001,y : sta $0001,y
  beq .Done
.ThirdByte:
lda $0001,x : and #$0F
  asl : asl : sta $0002,y
lda $0002,x : and #$C0
  lsr : lsr : lsr : lsr : lsr : lsr : ora $0002,y : sta $0002,y
  beq .Done
.FourthByte:
lda $0002,x : and #$3F : sta $0003,y
  beq .Done
.ToNextGroup:
inx : inx : inx
iny : iny : iny : iny
bra .FirstByte
.Done:
rts
Not tested. Definitely not optimized.
Title: Re: 6 bit text encoding
Post by: slidelljohn on September 03, 2020, 09:18:13 pm
 :thumbsup:
Title: Re: 6 bit text encoding
Post by: abw on September 03, 2020, 10:02:21 pm
For a wordier somewhat generalized take on the issue:

With 8 bits, you can represent 2^8 = 256 different tokens (where a "token" could be a sequence of one or more text characters, a control code, etc.). With 6 bits, you can represent 2^6 = 64 different tokens. If you have more than 64 different tokens to encode, you can't do it with just 6 bits, so you need to use something else to provide additional information about your tokens, and that something else is context, where the meaning of a 6-bit token depends on not just its own bits but also on some other bits, almost always on bits that occur earlier in the input. So a 0b000000 token could mean different things depending on its context, which is to say it could mean different things depending on the tokens that have preceded it. As an example, maybe that 0b000000 starts off meaning "A", but if it comes after a 0b111111 token, it means "a" instead.

In ROM hacking, we usually represent a text encoding in a table file, and the extra tokens that provide the context necessary to determine the meaning of a token are sometimes referred to as table switches since switching the context used to interpret a token corresponds to switching to a different table. In order to encode some input, you'll need to have at least enough contexts (tables) to represent all of your tokens including some extra tokens for indicating context changes (table switches). I.e. if you have a total of 70 6-bit tokens, you'll need at least 2 64-entry tables to represent them all; if you have 170 6-bit tokens, you'll need at least 3 64-entry tables. Once you've decided on the specifics of the encoding scheme, you can convert your text input to binary and burn it to ROM (or whatever your storage medium actually is, since in practice we almost never touch physical ROM chips).

So, getting back to your question, when decoding a stream of context-sensitive 6-bit input tokens to an 8-bit output stream, the decoder has to a) chop the input up into groups of 6-bit tokens (on a byte-based processor this will indeed usually involve lots of bit shifting) and b) keep track of changes to the context it's supposed to use to interpret each of those tokens. After that, it has to map each token from 6 bits to 8 bits and output the 8-bit value (which could then be used as a PPU tile ID or a control code etc.). The mapping method will vary from game to game, but two of the simplest options are to add a fixed value depending on the context (e.g. if we're interpreting tokens as uppercase characters, then the 8-bit value could be the 6-bit value + $80, or if we're interpreting tokens as lowercase characters, then the 8-bit value could be the 6-bit value + $C0; The Battle of Olympus [NES] does something like this) or to use a separate lookup table for each context (e.g. if we're interpreting tokens as uppercase characters, then the 8-bit value could be given by the byte at RAM ($DEAD + the 6-bit value), or if we're interpreting tokens as lowercase characters, then the 8-bit value could be given by the byte at RAM ($BEEF + the 6-bit value); Dragon Quest IV [NES] does something like this).
Title: Re: 6 bit text encoding
Post by: slidelljohn on September 03, 2020, 11:40:11 pm
 :thumbsup:
Title: Re: 6 bit text encoding
Post by: Anime_World on September 03, 2020, 11:57:36 pm
An example in python:
Code: [Select]
output = b''
text = 'ABCDEFGHIJKLMNOPQRSTUVWXYZ'
for character in text:
 value = ord(character) &0x3F
 output += value.to_bytes(1,'big')
print(output)

result will be:
Code: [Select]
01 02 03 04 05 06 07 08 09 0A 0B 0C 0D 0E 0F 10 11 12 13 14 15 15 17 18 19 1A

Now to reduce in a group of 3 bytes you can do:
Code: [Select]
output = b''
text = 'ABCDEFGHIJKLMNOPQRSTUVWXYZ'
converted = []
for character in text:
 value = ord(character) &0x3F
 converted.append(value)

next = 0
for i in range(0, len(converted)//3):
 value = int('0b' + '{:06b}'.format(converted[next]) + '{:06b}'.format(converted[next+1]) + '{:06b}'.format(converted[next+2]), 2)
 output += value.to_bytes(3,'big')
 next += 3

print(output)

result will be:
Code: [Select]
00 10 83 00 41 46 00 72 09 00 A2 CC 00 D3 8F 01 04 52 01 35 15 01 65 D8
Or you can use a dict compression like LZ or Huffman.
Title: Re: 6 bit text encoding
Post by: slidelljohn on September 04, 2020, 07:53:34 pm
If you changed: text = 'ABCDEFGHIJKLMNOPQRSTUVWXYZ'
to: text = 'aBcD'

What would the results of print(output) be after "Now to reduce in a group of 3 bytes you can do:"?
Title: Re: 6 bit text encoding
Post by: Anime_World on September 04, 2020, 11:20:47 pm
If you changed: text = 'ABCDEFGHIJKLMNOPQRSTUVWXYZ'
to: text = 'aBcD'

What would the results of print(output) be after "Now to reduce in a group of 3 bytes you can do:"?

Code: [Select]
00 18 83

It's only a sketch... if u want to output de 4th char, some changes are needed.
Like this:
Code: [Select]
output = b''
text = 'JOHN'
converted = []
for character in text:
 value = ord(character) &0x3F
 converted.append(value)

next = 0
for i in range(0, len(converted)//4):
 value = int('0b' + '{:06b}'.format(converted[next]) + '{:06b}'.format(converted[next+1]) + '{:06b}'.format(converted[next+2]) + '{:06b}'.format(converted[next+3]), 2)
 output += value.to_bytes(3,'big')
 next += 4

print(output)

now result for "JOHN" will be:
Code: [Select]
28 F2 0E
Title: Re: 6 bit text encoding
Post by: slidelljohn on September 04, 2020, 11:42:49 pm
What are the results if text = 'jOhN?
Title: Re: 6 bit text encoding
Post by: Anime_World on September 04, 2020, 11:46:13 pm
What are the results if text = 'jOhN?

Code: [Select]
A8 FA 0E

Right?

Code: [Select]
>>> bin(ord('J'))
'0b1001010'
>>> bin(ord('j'))
'0b1101010'
Title: Re: 6 bit text encoding
Post by: slidelljohn on September 04, 2020, 11:52:02 pm
How does it go from 0x28, 0xF2, 0x0E to 0xA8, 0xFA, 0x0E?

Can you show me what the bits mean because it looks like data is missing?

Example in 8 bit, 4 bytes:
JOHN =
0x4a, 0x4f, 0x48, 0x4e
01001010, 01001111, 01001000, 01001110

Example in 6 bit 3 bytes:
JOHN =
001010, 001111, 001000, 001110

0x28, 0xF2, 0x0E
00101000, 11110010, 00001110
Title: Re: 6 bit text encoding
Post by: Anime_World on September 04, 2020, 11:57:43 pm
How does it go from 0x28, 0xF2, 0x0E to 0xA8, 0xFA, 0x0E?

Can you show me what the bits mean because it looks like data is missing?

Example in 8 bit, 4 bytes:
JOHN =
0x4a, 0x4f, 0x48, 0x4e
01001010, 01001111, 01001000, 01001110

Example in 6 bit 3 bytes:
JOHN =
001010, 001111, 001000, 001110

0x28, 0xF2, 0x0E
00101000, 11110010, 00001110


Code: [Select]
01101010 = j
01001111 = O
01101000 = h
01001110 = N

to 6-bits grouped:
Code: [Select]
   j     O       h     N
101010 001111 101000 001110 = A8FA0E
Title: Re: 6 bit text encoding
Post by: slidelljohn on September 05, 2020, 12:04:44 am
It looks like you are removing data to make it smaller than what it acutely is.
Why not show the full data?

Lets see the table(s).
Title: Re: 6 bit text encoding
Post by: Anime_World on September 05, 2020, 12:12:40 am
It looks like you are removing data to make it smaller than what it acutely is.
Why not show the full data?

Lets see the tables.

OK.  :thumbsup:
Check tables...
Title: Re: 6 bit text encoding
Post by: slidelljohn on September 05, 2020, 12:20:38 am
Did you post them? I dont see any.

September 05, 2020, 12:34:45 am - (Auto Merged - Double Posts are not allowed before 7 days.)
I don't understand the trickery. :-\
All you did was flip the 6th bit on the j and the h to make them lower case.

Example in 6 bit 3 bytes:
JOHN =
001010, 001111, 001000, 001110

Example in 6 bit 3 bytes:
jOhN =
101010, 001111, 101000, 001110

September 05, 2020, 12:49:48 am - (Auto Merged - Double Posts are not allowed before 7 days.)
Lower case, upper case, numbers 0-9, and (!,.?'"space) will not fit in a 64 6-bit table.
Title: Re: 6 bit text encoding
Post by: Anime_World on September 05, 2020, 02:52:15 am
Did you post them? I dont see any.

September 05, 2020, 12:34:45 am - (Auto Merged - Double Posts are not allowed before 7 days.)
I don't understand the trickery. :-\
All you did was flip the 6th bit on the j and the h to make them lower case.

Example in 6 bit 3 bytes:
JOHN =
001010, 001111, 001000, 001110

Example in 6 bit 3 bytes:
jOhN =
101010, 001111, 101000, 001110

September 05, 2020, 12:49:48 am - (Auto Merged - Double Posts are not allowed before 7 days.)
Lower case, upper case, numbers 0-9, and (!,.?'"space) will not fit in a 64 6-bit table.

Look:
Code: [Select]
text = 'ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz'
converted = []
for character in text:
  value = ord(character) &0x3F
  converted.append(value)
print(converted)

and the result:
Code: [Select]
1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58

If u want to use numbers and other characters, combine them with one byte as bitmask every 8 bytes.
1 = 3bytes 6-bit encoded, counts as 1 byte decoded
0 = normal ascii value = decoded byte
Title: Re: 6 bit text encoding
Post by: slidelljohn on September 05, 2020, 03:36:45 am
Glad you came back! :woot!:

Can you finish filling your table from 0x00-0x3f (64 total)?

How did you get this from your table?
to 6-bits grouped:
Code: [Select]
   j     O       h     N
101010 001111 101000 001110 = A8FA0E




This is why I say data is missing and why I'm trying to get you to show me your
data after encoding certain text(chars). I'm just trying to show you what happens
when you do it this way.
If u want to use numbers and other characters, combine them with one byte as bitmask every 8 bytes.
1 = 3bytes 6-bit encoded, counts as 1 byte decoded
0 = normal ascii value = decoded byte
Title: Re: 6 bit text encoding
Post by: Bregalad on September 05, 2020, 04:24:58 am
If you're going to need some complex bit shifting anyway, why not use huffman encoding ? It will always encode into less bits than fixed bitwidth. You know, some people already thought about compression for years and came up with their solutions :)

Including myself : http://www.romhacking.net/utilities/882/ (http://www.romhacking.net/utilities/882/)
Title: Re: 6 bit text encoding
Post by: Anime_World on September 05, 2020, 04:37:02 am
Glad you came back! :woot!:

This is close to your table but your numbers dont add up right.
(https://i.imgur.com/9cRps9L.png)

Can you finish filling your table from 0x00-0x3f (64 total)?

How did you get this from your table?



This is why I say data is missing and why I'm trying to get you to show me your
data after encoding certain text(chars). I'm just trying to show you what happens
when you do it this way.

OK.  8)
Title: Re: 6 bit text encoding
Post by: slidelljohn on September 05, 2020, 04:39:41 am
@Bregalad

Yes, Bregalad. That's one reason why I didn't agree with this statement.
if you have a total of 70 6-bit tokens, you'll need at least 2 64-entry tables to represent them all

You don't need another 6-bit table for 6 extra tokens. If you need more tokens you might
as well add some sort of compression.

Table #2

10  = 65
11  = 66
000 = 67
001 = 68
010 = 69
011 = 70

@Anime_World
Your output data wasn't accurate. I was just trying to help. Also
certain char combinations are going to make your data larger and
that's why I said data was missing.

@Bregalad
I knew you knew a lot about snes music but I didn't realize you
knew so much about compression. I looked at your tool and its
got a lot of different compressions! You also included the source! :woot!:
Thanks! :)
Title: Re: 6 bit text encoding
Post by: abw on September 05, 2020, 12:06:28 pm
If you're setting up a brand new text encoding from scratch and your primary concern is being able to fit a large amount of text into a small amount of space, then yes, using some encoding that's specifically designed to compress text will very likely give a better compression ratio than some encoding that's not designed to do that.

Pennywise's question, however, does not appear to be about choosing a particular compression algorithm (he may well also have that in mind, but that's not what he actually asked). Instead, I suspect (since we've been talking about it elsewhere) that he's looking at a particular game that uses 6-bit tokens and is trying to get a better understanding of how that's implemented.

With that in mind, here's a commented example of reading a stream of 6-bit tokens and converting them to 8-bit tokens from Dragon Quest IV (NES):
Code: [Select]
; read the next 6-bit token into A, updating script variables appropriately
; control flow target (from $DF76, $DF8F)
0x03DFA9|$0F:$DF99:A0 00    LDY #$00   
0x03DFAB|$0F:$DF9B:B1 49    LDA ($49),Y ; read current token byte 0
0x03DFAD|$0F:$DF9D:85 5F    STA $5F    ; store current token byte 0 in $5F
0x03DFAF|$0F:$DF9F:A5 5E    LDA $5E    ; read token counter
0x03DFB1|$0F:$DFA1:E6 5E    INC $5E    ; increment token counter
0x03DFB3|$0F:$DFA3:29 03    AND #$03    ; check low 2 bits of token counter: a sequence of 6-bit tokens can only be aligned in 4 possible ways relative to 8-bit boundaries
0x03DFB5|$0F:$DFA5:F0 10    BEQ $DFB7  ; handle 0bSSSSSSxx; we're going to need to re-read the current script byte for the next token, so don't update the script read address
0x03DFB7|$0F:$DFA7:20 D5 DF JSR $DFD5  ; update the script read address, taking into account reserved regions of RAM, the script crossing bank boundaries, etc.; leaves A unchanged
0x03DFBA|$0F:$DFAA:C9 02    CMP #$02   
0x03DFBC|$0F:$DFAC:90 0E    BCC $DFBC  ; 0 was handled earlier, so A < 2 => A == 1; handle 0bxxxxxxSS 0bSSSSxxxx
0x03DFBE|$0F:$DFAE:F0 18    BEQ $DFC8  ; A == 2; handle 0bxxxxSSSS 0bSSxxxxxx
0x03DFC0|$0F:$DFB0:A5 5F    LDA $5F    ; A == 3; handle 0bxxSSSSSS (the easy case); load current token byte 0
0x03DFC2|$0F:$DFB2:29 3F    AND #$3F    ; and chop off the bits we don't care about
0x03DFC4|$0F:$DFB4:60      RTS       


; code -> unknown
0x03DFC5|$0F:$DFB5:EA      ; NOP       
0x03DFC6|$0F:$DFB6:EA      ; NOP       

; unknown -> code
; handle 0bSSSSSSxx
; control flow target (from $DFA5)
0x03DFC7|$0F:$DFB7:A5 5F    LDA $5F    ; load current token byte 0
0x03DFC9|$0F:$DFB9:4A      LSR        ; shift from 0bSSSSSSxx to 0bxSSSSSSx
0x03DFCA|$0F:$DFBA:4A      LSR        ; shift from 0bxSSSSSSx to 0bxxSSSSSS
0x03DFCB|$0F:$DFBB:60      RTS       

; handle 0bxxxxxxSS 0bSSSSxxxx
; control flow target (from $DFAC)
0x03DFCC|$0F:$DFBC:B1 49    LDA ($49),Y ; $DFD5 updated $49-$4A to the next script byte, so this reads current token byte 1
0x03DFCE|$0F:$DFBE:46 5F    LSR $5F    ; shift from 0bxxxxxxSS 0bSSSSxxxx to 0bxxxxxxxS 0bSSSSSxxx
0x03DFD0|$0F:$DFC0:6A      ROR       
0x03DFD1|$0F:$DFC1:46 5F    LSR $5F    ; shift from 0bxxxxxxxS 0bSSSSSxxx to 0bxxxxxxxx 0bSSSSSSxx
0x03DFD3|$0F:$DFC3:6A      ROR       
0x03DFD4|$0F:$DFC4:4A      LSR        ; shift from 0bSSSSSSxx to 0bxSSSSSSx
0x03DFD5|$0F:$DFC5:4A      LSR        ; shift from 0bxSSSSSSx to 0bxxSSSSSS
0x03DFD6|$0F:$DFC6:60      RTS       


; code -> unknown
0x03DFD7|$0F:$DFC7:EA      ; NOP       

; unknown -> code
; handle 0bxxxxSSSS 0bSSxxxxxx
; control flow target (from $DFAE)
0x03DFD8|$0F:$DFC8:B1 49    LDA ($49),Y ; $DFD5 updated $49-$4A to the next script byte, so this reads current token byte 1
0x03DFDA|$0F:$DFCA:0A      ASL        ; shift from 0bxxxxSSSS 0bSSxxxxxx to 0bxxxSSSSS 0bSxxxxxxx
0x03DFDB|$0F:$DFCB:26 5F    ROL $5F   
0x03DFDD|$0F:$DFCD:0A      ASL        ; shift from 0bxxxSSSSS 0bSxxxxxxx to 0bxxSSSSSS 0bxxxxxxxx
0x03DFDE|$0F:$DFCE:26 5F    ROL $5F   
0x03DFE0|$0F:$DFD0:A5 5F    LDA $5F    ; load 0bxxSSSSSS
0x03DFE2|$0F:$DFD2:29 3F    AND #$3F    ; and chop off the bits we don't care about
0x03DFE4|$0F:$DFD4:60      RTS       
Basically that keeps track of the current script read address in $49-$4A and the number of tokens that have been read so far (from which we can deduce the alignment of the current token) in $5E, then reads the current script byte into $5F and (if necessary) also reads the next script byte, and then performs some bit shifts to get the 6 desired bits into the low 6 bits of A, setting the high 2 bits of A to 0.

After that, here's an example of using context (table switching) while converting from 6-bit tokens (or 8-bit tokens with the high 2 bits clear, depending on how you want to think of them) to characters (PPU tile IDs):
Code: [Select]
; control flow target (from $86D1)
; call to code in a different bank ($1F:$DF86)
0x0586CA|$16:$86BA:20 86 DF JSR $DF86  ; swap in needed banks then call $DF99 to read the next 6-bit token into A
0x0586CD|$16:$86BD:C9 3C    CMP #$3C    ; table switch tokens are #$3C - #$3F
0x0586CF|$16:$86BF:90 05    BCC $86C6  ; A < #$3C => normal table entry
0x0586D1|$16:$86C1:F0 06    BEQ $86C9  ; A == #$3C => switch between hiragana and katakana tables
0x0586D3|$16:$86C3:4C D4 86 JMP $86D4  ; A > #$3C => switch to dictionary table

; handle normal table entry
; control flow target (from $86BF)
0x0586D6|$16:$86C6:4C 4E 87 JMP $874E  ; target code is located too far away to use a branch, so JMP

; switch between hiragana and katakana tables
; control flow target (from $86C1)
0x0586D9|$16:$86C9:AD 66 05 LDA $0566  ; current table ID
0x0586DC|$16:$86CC:49 01    EOR #$01    ; switch between hiragana and katakana tables
0x0586DE|$16:$86CE:8D 66 05 STA $0566  ; current table ID
0x0586E1|$16:$86D1:4C BA 86 JMP $86BA  ; loop to read next token

; ...

; handle normal table entry
0x05875E|$16:$874E:85 5F    STA $5F    ; current 6-bit token 0b00SSSSSS
0x058760|$16:$8750:AD 66 05 LDA $0566  ; current table ID
0x058763|$16:$8753:0A      ASL        ; shift from 0bxxxxxxTT to 0bTTxxxxxx (yes, ROR x 3 would have been both faster and shorter)
0x058764|$16:$8754:0A      ASL       
0x058765|$16:$8755:0A      ASL       
0x058766|$16:$8756:0A      ASL       
0x058767|$16:$8757:0A      ASL       
0x058768|$16:$8758:0A      ASL       
0x058769|$16:$8759:18      CLC       
0x05876A|$16:$875A:65 5F    ADC $5F    ; combine table ID and 6-bit token to get 0bTTSSSSSS (yes, ORA $5F would have been both faster and shorter)
0x05876C|$16:$875C:AA      TAX        ; use combined value as an index into the hiragana/katakana mapping table
0x05876D|$16:$875D:BD 65 87 LDA $8765,X ; get the 8-bit value (usually a hiragana/katakana PPU tile ID)
0x058770|$16:$8760:C9 F0    CMP #$F0    ; check for control code #$F0 (wait for input)
0x058772|$16:$8762:F0 B7    BEQ $871B  ; if found, go handle #$F0
0x058774|$16:$8764:60      RTS        ; otherwise return the new 8-bit value
Basically that utilizes some extra storage in $0566 to remember the current table, updates it whenever a table switch token is read, and, in conjunction with the 6-bit token that was read earlier, uses it as an index into a lookup table to get an 8-bit value (usually a PPU tile ID, but could also be a control code) and then returns that for its caller to process (write to PPU or handle control code).

For another example, Dragon Warrior II (NES) uses a 5-bit encoding with a different implementation of table switching; you can check the commentary on https://datacrystal.romhacking.net/wiki/Dragon_Warrior_II::ROM_map/ASM_bank_02 starting at $02:$B393 for further details.

You don't need another 6-bit table for 6 extra tokens. If you need more tokens you might
as well add some sort of compression.
Yes, if you're going to make up your own encoding, you can definitely come up with something that has better space efficiency than using 6 bits to store 1 bit of information, but again, I don't think that's what the question was about ;).
Title: Re: 6 bit text encoding
Post by: slidelljohn on September 05, 2020, 12:18:23 pm
 :thumbsup:
Title: Re: 6 bit text encoding
Post by: Anime_World on September 09, 2020, 04:25:14 pm
If you're setting up a brand new text encoding from scratch and your primary concern is being able to fit a large amount of text into a small amount of space, then yes, using some encoding that's specifically designed to compress text will very likely give a better compression ratio than some encoding that's not designed to do that.

Pennywise's question, however, does not appear to be about choosing a particular compression algorithm (he may well also have that in mind, but that's not what he actually asked). Instead, I suspect (since we've been talking about it elsewhere) that he's looking at a particular game that uses 6-bit tokens and is trying to get a better understanding of how that's implemented.

With that in mind, here's a commented example of reading a stream of 6-bit tokens and converting them to 8-bit tokens from Dragon Quest IV (NES):
Code: [Select]
; read the next 6-bit token into A, updating script variables appropriately
; control flow target (from $DF76, $DF8F)
0x03DFA9|$0F:$DF99:A0 00    LDY #$00   
0x03DFAB|$0F:$DF9B:B1 49    LDA ($49),Y ; read current token byte 0
0x03DFAD|$0F:$DF9D:85 5F    STA $5F    ; store current token byte 0 in $5F
0x03DFAF|$0F:$DF9F:A5 5E    LDA $5E    ; read token counter
0x03DFB1|$0F:$DFA1:E6 5E    INC $5E    ; increment token counter
0x03DFB3|$0F:$DFA3:29 03    AND #$03    ; check low 2 bits of token counter: a sequence of 6-bit tokens can only be aligned in 4 possible ways relative to 8-bit boundaries
0x03DFB5|$0F:$DFA5:F0 10    BEQ $DFB7  ; handle 0bSSSSSSxx; we're going to need to re-read the current script byte for the next token, so don't update the script read address
0x03DFB7|$0F:$DFA7:20 D5 DF JSR $DFD5  ; update the script read address, taking into account reserved regions of RAM, the script crossing bank boundaries, etc.; leaves A unchanged
0x03DFBA|$0F:$DFAA:C9 02    CMP #$02   
0x03DFBC|$0F:$DFAC:90 0E    BCC $DFBC  ; 0 was handled earlier, so A < 2 => A == 1; handle 0bxxxxxxSS 0bSSSSxxxx
0x03DFBE|$0F:$DFAE:F0 18    BEQ $DFC8  ; A == 2; handle 0bxxxxSSSS 0bSSxxxxxx
0x03DFC0|$0F:$DFB0:A5 5F    LDA $5F    ; A == 3; handle 0bxxSSSSSS (the easy case); load current token byte 0
0x03DFC2|$0F:$DFB2:29 3F    AND #$3F    ; and chop off the bits we don't care about
0x03DFC4|$0F:$DFB4:60      RTS       


; code -> unknown
0x03DFC5|$0F:$DFB5:EA      ; NOP       
0x03DFC6|$0F:$DFB6:EA      ; NOP       

; unknown -> code
; handle 0bSSSSSSxx
; control flow target (from $DFA5)
0x03DFC7|$0F:$DFB7:A5 5F    LDA $5F    ; load current token byte 0
0x03DFC9|$0F:$DFB9:4A      LSR        ; shift from 0bSSSSSSxx to 0bxSSSSSSx
0x03DFCA|$0F:$DFBA:4A      LSR        ; shift from 0bxSSSSSSx to 0bxxSSSSSS
0x03DFCB|$0F:$DFBB:60      RTS       

; handle 0bxxxxxxSS 0bSSSSxxxx
; control flow target (from $DFAC)
0x03DFCC|$0F:$DFBC:B1 49    LDA ($49),Y ; $DFD5 updated $49-$4A to the next script byte, so this reads current token byte 1
0x03DFCE|$0F:$DFBE:46 5F    LSR $5F    ; shift from 0bxxxxxxSS 0bSSSSxxxx to 0bxxxxxxxS 0bSSSSSxxx
0x03DFD0|$0F:$DFC0:6A      ROR       
0x03DFD1|$0F:$DFC1:46 5F    LSR $5F    ; shift from 0bxxxxxxxS 0bSSSSSxxx to 0bxxxxxxxx 0bSSSSSSxx
0x03DFD3|$0F:$DFC3:6A      ROR       
0x03DFD4|$0F:$DFC4:4A      LSR        ; shift from 0bSSSSSSxx to 0bxSSSSSSx
0x03DFD5|$0F:$DFC5:4A      LSR        ; shift from 0bxSSSSSSx to 0bxxSSSSSS
0x03DFD6|$0F:$DFC6:60      RTS       


; code -> unknown
0x03DFD7|$0F:$DFC7:EA      ; NOP       

; unknown -> code
; handle 0bxxxxSSSS 0bSSxxxxxx
; control flow target (from $DFAE)
0x03DFD8|$0F:$DFC8:B1 49    LDA ($49),Y ; $DFD5 updated $49-$4A to the next script byte, so this reads current token byte 1
0x03DFDA|$0F:$DFCA:0A      ASL        ; shift from 0bxxxxSSSS 0bSSxxxxxx to 0bxxxSSSSS 0bSxxxxxxx
0x03DFDB|$0F:$DFCB:26 5F    ROL $5F   
0x03DFDD|$0F:$DFCD:0A      ASL        ; shift from 0bxxxSSSSS 0bSxxxxxxx to 0bxxSSSSSS 0bxxxxxxxx
0x03DFDE|$0F:$DFCE:26 5F    ROL $5F   
0x03DFE0|$0F:$DFD0:A5 5F    LDA $5F    ; load 0bxxSSSSSS
0x03DFE2|$0F:$DFD2:29 3F    AND #$3F    ; and chop off the bits we don't care about
0x03DFE4|$0F:$DFD4:60      RTS       
Basically that keeps track of the current script read address in $49-$4A and the number of tokens that have been read so far (from which we can deduce the alignment of the current token) in $5E, then reads the current script byte into $5F and (if necessary) also reads the next script byte, and then performs some bit shifts to get the 6 desired bits into the low 6 bits of A, setting the high 2 bits of A to 0.

After that, here's an example of using context (table switching) while converting from 6-bit tokens (or 8-bit tokens with the high 2 bits clear, depending on how you want to think of them) to characters (PPU tile IDs):
Code: [Select]
; control flow target (from $86D1)
; call to code in a different bank ($1F:$DF86)
0x0586CA|$16:$86BA:20 86 DF JSR $DF86  ; swap in needed banks then call $DF99 to read the next 6-bit token into A
0x0586CD|$16:$86BD:C9 3C    CMP #$3C    ; table switch tokens are #$3C - #$3F
0x0586CF|$16:$86BF:90 05    BCC $86C6  ; A < #$3C => normal table entry
0x0586D1|$16:$86C1:F0 06    BEQ $86C9  ; A == #$3C => switch between hiragana and katakana tables
0x0586D3|$16:$86C3:4C D4 86 JMP $86D4  ; A > #$3C => switch to dictionary table

; handle normal table entry
; control flow target (from $86BF)
0x0586D6|$16:$86C6:4C 4E 87 JMP $874E  ; target code is located too far away to use a branch, so JMP

; switch between hiragana and katakana tables
; control flow target (from $86C1)
0x0586D9|$16:$86C9:AD 66 05 LDA $0566  ; current table ID
0x0586DC|$16:$86CC:49 01    EOR #$01    ; switch between hiragana and katakana tables
0x0586DE|$16:$86CE:8D 66 05 STA $0566  ; current table ID
0x0586E1|$16:$86D1:4C BA 86 JMP $86BA  ; loop to read next token

; ...

; handle normal table entry
0x05875E|$16:$874E:85 5F    STA $5F    ; current 6-bit token 0b00SSSSSS
0x058760|$16:$8750:AD 66 05 LDA $0566  ; current table ID
0x058763|$16:$8753:0A      ASL        ; shift from 0bxxxxxxTT to 0bTTxxxxxx (yes, ROR x 3 would have been both faster and shorter)
0x058764|$16:$8754:0A      ASL       
0x058765|$16:$8755:0A      ASL       
0x058766|$16:$8756:0A      ASL       
0x058767|$16:$8757:0A      ASL       
0x058768|$16:$8758:0A      ASL       
0x058769|$16:$8759:18      CLC       
0x05876A|$16:$875A:65 5F    ADC $5F    ; combine table ID and 6-bit token to get 0bTTSSSSSS (yes, ORA $5F would have been both faster and shorter)
0x05876C|$16:$875C:AA      TAX        ; use combined value as an index into the hiragana/katakana mapping table
0x05876D|$16:$875D:BD 65 87 LDA $8765,X ; get the 8-bit value (usually a hiragana/katakana PPU tile ID)
0x058770|$16:$8760:C9 F0    CMP #$F0    ; check for control code #$F0 (wait for input)
0x058772|$16:$8762:F0 B7    BEQ $871B  ; if found, go handle #$F0
0x058774|$16:$8764:60      RTS        ; otherwise return the new 8-bit value
Basically that utilizes some extra storage in $0566 to remember the current table, updates it whenever a table switch token is read, and, in conjunction with the 6-bit token that was read earlier, uses it as an index into a lookup table to get an 8-bit value (usually a PPU tile ID, but could also be a control code) and then returns that for its caller to process (write to PPU or handle control code).

For another example, Dragon Warrior II (NES) uses a 5-bit encoding with a different implementation of table switching; you can check the commentary on https://datacrystal.romhacking.net/wiki/Dragon_Warrior_II::ROM_map/ASM_bank_02 starting at $02:$B393 for further details.
Yes, if you're going to make up your own encoding, you can definitely come up with something that has better space efficiency than using 6 bits to store 1 bit of information, but again, I don't think that's what the question was about ;).

Perfect explanation @abw.
Title: Re: 6 bit text encoding
Post by: Pennywise on September 09, 2020, 04:43:51 pm
Thanks everyone.

Yes, I'm probably going to poke around with DQ4, which uses 6 bit encoding and wanted to some reading material so to speak when I start looking at the code.

I figured this public discourse would be beneficial to others since the subject is a bit uncommon.
Title: Re: 6 bit text encoding
Post by: Anime_World on September 10, 2020, 03:21:17 pm
Hint: abcde (atlas/cartographer) supports 6-bit encoding.