News: 11 March 2016 - Forum Rules

### Author Topic: 6 bit text encoding  (Read 4267 times)

#### Pennywise

• Forum Moderator
• Hero Member
• Posts: 2409
• Prepare thyself for holy judgment
##### 6 bit text encoding
« on: September 03, 2020, 12:29:01 pm »
What is the process behind encoding text as 6 bits and then converting it to 8 bits in the code? I'm assuming that's what happens with 6 bit text encoding.

I get the general idea is if you have 4 bytes/characters, with 6 bit encoding, they'll only take up 3 bytes of space. How exactly does that work? I know that instead of thinking of every as bytes, the magic happens on a binary level. Some type of bit shifting?

#### Raeven0

• Jr. Member
• Posts: 43
##### Re: 6 bit text encoding
« Reply #1 on: September 03, 2020, 08:32:28 pm »
Bit shifting works well on the SNES, where bytes are individually addressable and the accumulator can be 1 byte wide. Here's a naive 6-bit-to-8-bit decoder:
Code: [Select]
`SR_DecodeText:; Assumes: X = address of encoded source, Y = address of buffer destination; Assumes: srcAddr>\$8000, destAddr<\$2000;   (otherwise use long addressing for src and set DBR for dest); Assumes terminator byte is \$00.FirstByte:lda \$0000,x : lsr : lsr : sta \$0000,y  beq .Done.SecondByte:lda \$0000,x : and #\$03  asl : asl : asl : asl : sta \$0001,ylda \$0001,x : and #\$F0  lsr : lsr : lsr : lsr : ora \$0001,y : sta \$0001,y  beq .Done.ThirdByte:lda \$0001,x : and #\$0F  asl : asl : sta \$0002,ylda \$0002,x : and #\$C0  lsr : lsr : lsr : lsr : lsr : lsr : ora \$0002,y : sta \$0002,y  beq .Done.FourthByte:lda \$0002,x : and #\$3F : sta \$0003,y  beq .Done.ToNextGroup:inx : inx : inxiny : iny : iny : inybra .FirstByte.Done:rts`Not tested. Definitely not optimized.

#### slidelljohn

• Sr. Member
• Posts: 353
##### Re: 6 bit text encoding
« Reply #2 on: September 03, 2020, 09:18:13 pm »

« Last Edit: September 06, 2020, 07:40:36 am by slidelljohn »

#### abw

• Hero Member
• Posts: 592
##### Re: 6 bit text encoding
« Reply #3 on: September 03, 2020, 10:02:21 pm »
For a wordier somewhat generalized take on the issue:

With 8 bits, you can represent 2^8 = 256 different tokens (where a "token" could be a sequence of one or more text characters, a control code, etc.). With 6 bits, you can represent 2^6 = 64 different tokens. If you have more than 64 different tokens to encode, you can't do it with just 6 bits, so you need to use something else to provide additional information about your tokens, and that something else is context, where the meaning of a 6-bit token depends on not just its own bits but also on some other bits, almost always on bits that occur earlier in the input. So a 0b000000 token could mean different things depending on its context, which is to say it could mean different things depending on the tokens that have preceded it. As an example, maybe that 0b000000 starts off meaning "A", but if it comes after a 0b111111 token, it means "a" instead.

In ROM hacking, we usually represent a text encoding in a table file, and the extra tokens that provide the context necessary to determine the meaning of a token are sometimes referred to as table switches since switching the context used to interpret a token corresponds to switching to a different table. In order to encode some input, you'll need to have at least enough contexts (tables) to represent all of your tokens including some extra tokens for indicating context changes (table switches). I.e. if you have a total of 70 6-bit tokens, you'll need at least 2 64-entry tables to represent them all; if you have 170 6-bit tokens, you'll need at least 3 64-entry tables. Once you've decided on the specifics of the encoding scheme, you can convert your text input to binary and burn it to ROM (or whatever your storage medium actually is, since in practice we almost never touch physical ROM chips).

So, getting back to your question, when decoding a stream of context-sensitive 6-bit input tokens to an 8-bit output stream, the decoder has to a) chop the input up into groups of 6-bit tokens (on a byte-based processor this will indeed usually involve lots of bit shifting) and b) keep track of changes to the context it's supposed to use to interpret each of those tokens. After that, it has to map each token from 6 bits to 8 bits and output the 8-bit value (which could then be used as a PPU tile ID or a control code etc.). The mapping method will vary from game to game, but two of the simplest options are to add a fixed value depending on the context (e.g. if we're interpreting tokens as uppercase characters, then the 8-bit value could be the 6-bit value + \$80, or if we're interpreting tokens as lowercase characters, then the 8-bit value could be the 6-bit value + \$C0; The Battle of Olympus [NES] does something like this) or to use a separate lookup table for each context (e.g. if we're interpreting tokens as uppercase characters, then the 8-bit value could be given by the byte at RAM (\$DEAD + the 6-bit value), or if we're interpreting tokens as lowercase characters, then the 8-bit value could be given by the byte at RAM (\$BEEF + the 6-bit value); Dragon Quest IV [NES] does something like this).

#### slidelljohn

• Sr. Member
• Posts: 353
##### Re: 6 bit text encoding
« Reply #4 on: September 03, 2020, 11:40:11 pm »

« Last Edit: September 06, 2020, 07:41:31 am by slidelljohn »

#### Anime_World

• RHDN Patreon Supporter!
• Jr. Member
• Posts: 76
##### Re: 6 bit text encoding
« Reply #5 on: September 03, 2020, 11:57:36 pm »
An example in python:
Code: [Select]
`output = b''text = 'ABCDEFGHIJKLMNOPQRSTUVWXYZ'for character in text: value = ord(character) &0x3F output += value.to_bytes(1,'big')print(output)`
result will be:
Code: [Select]
`01 02 03 04 05 06 07 08 09 0A 0B 0C 0D 0E 0F 10 11 12 13 14 15 15 17 18 19 1A`
Now to reduce in a group of 3 bytes you can do:
Code: [Select]
`output = b''text = 'ABCDEFGHIJKLMNOPQRSTUVWXYZ'converted = []for character in text: value = ord(character) &0x3F converted.append(value)next = 0for i in range(0, len(converted)//3): value = int('0b' + '{:06b}'.format(converted[next]) + '{:06b}'.format(converted[next+1]) + '{:06b}'.format(converted[next+2]), 2) output += value.to_bytes(3,'big') next += 3print(output)`
result will be:
Code: [Select]
`00 10 83 00 41 46 00 72 09 00 A2 CC 00 D3 8F 01 04 52 01 35 15 01 65 D8`
Or you can use a dict compression like LZ or Huffman.
« Last Edit: September 04, 2020, 06:49:22 pm by Anime_World »

#### slidelljohn

• Sr. Member
• Posts: 353
##### Re: 6 bit text encoding
« Reply #6 on: September 04, 2020, 07:53:34 pm »
If you changed: text = 'ABCDEFGHIJKLMNOPQRSTUVWXYZ'
to: text = 'aBcD'

What would the results of print(output) be after "Now to reduce in a group of 3 bytes you can do:"?
« Last Edit: September 04, 2020, 07:59:41 pm by slidelljohn »

#### Anime_World

• RHDN Patreon Supporter!
• Jr. Member
• Posts: 76
##### Re: 6 bit text encoding
« Reply #7 on: September 04, 2020, 11:20:47 pm »
If you changed: text = 'ABCDEFGHIJKLMNOPQRSTUVWXYZ'
to: text = 'aBcD'

What would the results of print(output) be after "Now to reduce in a group of 3 bytes you can do:"?

Code: [Select]
`00 18 83`
It's only a sketch... if u want to output de 4th char, some changes are needed.
Like this:
Code: [Select]
`output = b''text = 'JOHN'converted = []for character in text: value = ord(character) &0x3F converted.append(value)next = 0for i in range(0, len(converted)//4): value = int('0b' + '{:06b}'.format(converted[next]) + '{:06b}'.format(converted[next+1]) + '{:06b}'.format(converted[next+2]) + '{:06b}'.format(converted[next+3]), 2) output += value.to_bytes(3,'big') next += 4print(output)`
now result for "JOHN" will be:
Code: [Select]
`28 F2 0E`
« Last Edit: September 04, 2020, 11:41:06 pm by Anime_World »

#### slidelljohn

• Sr. Member
• Posts: 353
##### Re: 6 bit text encoding
« Reply #8 on: September 04, 2020, 11:42:49 pm »
What are the results if text = 'jOhN?

#### Anime_World

• RHDN Patreon Supporter!
• Jr. Member
• Posts: 76
##### Re: 6 bit text encoding
« Reply #9 on: September 04, 2020, 11:46:13 pm »
What are the results if text = 'jOhN?

Code: [Select]
`A8 FA 0E`
Right?

Code: [Select]
`>>> bin(ord('J'))'0b1001010'>>> bin(ord('j'))'0b1101010'`
« Last Edit: September 04, 2020, 11:51:21 pm by Anime_World »

#### slidelljohn

• Sr. Member
• Posts: 353
##### Re: 6 bit text encoding
« Reply #10 on: September 04, 2020, 11:52:02 pm »
How does it go from 0x28, 0xF2, 0x0E to 0xA8, 0xFA, 0x0E?

Can you show me what the bits mean because it looks like data is missing?

Example in 8 bit, 4 bytes:
JOHN =
0x4a, 0x4f, 0x48, 0x4e
01001010, 01001111, 01001000, 01001110

Example in 6 bit 3 bytes:
JOHN =
001010, 001111, 001000, 001110

0x28, 0xF2, 0x0E
00101000, 11110010, 00001110

#### Anime_World

• RHDN Patreon Supporter!
• Jr. Member
• Posts: 76
##### Re: 6 bit text encoding
« Reply #11 on: September 04, 2020, 11:57:43 pm »
How does it go from 0x28, 0xF2, 0x0E to 0xA8, 0xFA, 0x0E?

Can you show me what the bits mean because it looks like data is missing?

Example in 8 bit, 4 bytes:
JOHN =
0x4a, 0x4f, 0x48, 0x4e
01001010, 01001111, 01001000, 01001110

Example in 6 bit 3 bytes:
JOHN =
001010, 001111, 001000, 001110

0x28, 0xF2, 0x0E
00101000, 11110010, 00001110

Code: [Select]
`01101010 = j01001111 = O01101000 = h01001110 = N`
to 6-bits grouped:
Code: [Select]
`   j     O       h     N101010 001111 101000 001110 = A8FA0E`

#### slidelljohn

• Sr. Member
• Posts: 353
##### Re: 6 bit text encoding
« Reply #12 on: September 05, 2020, 12:04:44 am »
It looks like you are removing data to make it smaller than what it acutely is.
Why not show the full data?

Lets see the table(s).
« Last Edit: September 05, 2020, 12:13:07 am by slidelljohn »

#### Anime_World

• RHDN Patreon Supporter!
• Jr. Member
• Posts: 76
##### Re: 6 bit text encoding
« Reply #13 on: September 05, 2020, 12:12:40 am »
It looks like you are removing data to make it smaller than what it acutely is.
Why not show the full data?

Lets see the tables.

OK.
Check tables...

#### slidelljohn

• Sr. Member
• Posts: 353
##### Re: 6 bit text encoding
« Reply #14 on: September 05, 2020, 12:20:38 am »
Did you post them? I dont see any.

September 05, 2020, 12:34:45 am - (Auto Merged - Double Posts are not allowed before 7 days.)
I don't understand the trickery.
All you did was flip the 6th bit on the j and the h to make them lower case.

Example in 6 bit 3 bytes:
JOHN =
001010, 001111, 001000, 001110

Example in 6 bit 3 bytes:
jOhN =
101010, 001111, 101000, 001110

September 05, 2020, 12:49:48 am - (Auto Merged - Double Posts are not allowed before 7 days.)
Lower case, upper case, numbers 0-9, and (!,.?'"space) will not fit in a 64 6-bit table.
« Last Edit: September 05, 2020, 12:51:07 am by slidelljohn »

#### Anime_World

• RHDN Patreon Supporter!
• Jr. Member
• Posts: 76
##### Re: 6 bit text encoding
« Reply #15 on: September 05, 2020, 02:52:15 am »
Did you post them? I dont see any.

September 05, 2020, 12:34:45 am - (Auto Merged - Double Posts are not allowed before 7 days.)
I don't understand the trickery.
All you did was flip the 6th bit on the j and the h to make them lower case.

Example in 6 bit 3 bytes:
JOHN =
001010, 001111, 001000, 001110

Example in 6 bit 3 bytes:
jOhN =
101010, 001111, 101000, 001110

September 05, 2020, 12:49:48 am - (Auto Merged - Double Posts are not allowed before 7 days.)
Lower case, upper case, numbers 0-9, and (!,.?'"space) will not fit in a 64 6-bit table.

Look:
Code: [Select]
`text = 'ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz'converted = []for character in text:  value = ord(character) &0x3F  converted.append(value) print(converted)`
and the result:
Code: [Select]
`1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58`

If u want to use numbers and other characters, combine them with one byte as bitmask every 8 bytes.
1 = 3bytes 6-bit encoded, counts as 1 byte decoded
0 = normal ascii value = decoded byte
« Last Edit: September 05, 2020, 03:18:16 am by Anime_World »

#### slidelljohn

• Sr. Member
• Posts: 353
##### Re: 6 bit text encoding
« Reply #16 on: September 05, 2020, 03:36:45 am »

Can you finish filling your table from 0x00-0x3f (64 total)?

How did you get this from your table?
to 6-bits grouped:
Code: [Select]
`   j     O       h     N101010 001111 101000 001110 = A8FA0E`

This is why I say data is missing and why I'm trying to get you to show me your
data after encoding certain text(chars). I'm just trying to show you what happens
when you do it this way.
If u want to use numbers and other characters, combine them with one byte as bitmask every 8 bytes.
1 = 3bytes 6-bit encoded, counts as 1 byte decoded
0 = normal ascii value = decoded byte
« Last Edit: September 06, 2020, 08:52:11 am by slidelljohn »

• Hero Member
• Posts: 2763
##### Re: 6 bit text encoding
« Reply #17 on: September 05, 2020, 04:24:58 am »
If you're going to need some complex bit shifting anyway, why not use huffman encoding ? It will always encode into less bits than fixed bitwidth. You know, some people already thought about compression for years and came up with their solutions

Including myself : http://www.romhacking.net/utilities/882/

#### Anime_World

• RHDN Patreon Supporter!
• Jr. Member
• Posts: 76
##### Re: 6 bit text encoding
« Reply #18 on: September 05, 2020, 04:37:02 am »

Can you finish filling your table from 0x00-0x3f (64 total)?

How did you get this from your table?

This is why I say data is missing and why I'm trying to get you to show me your
data after encoding certain text(chars). I'm just trying to show you what happens
when you do it this way.

OK.
« Last Edit: September 05, 2020, 04:48:11 am by Anime_World »

#### slidelljohn

• Sr. Member
• Posts: 353
##### Re: 6 bit text encoding
« Reply #19 on: September 05, 2020, 04:39:41 am »

Yes, Bregalad. That's one reason why I didn't agree with this statement.
if you have a total of 70 6-bit tokens, you'll need at least 2 64-entry tables to represent them all

You don't need another 6-bit table for 6 extra tokens. If you need more tokens you might
as well add some sort of compression.

Table #2

10  = 65
11  = 66
000 = 67
001 = 68
010 = 69
011 = 70

@Anime_World
Your output data wasn't accurate. I was just trying to help. Also
certain char combinations are going to make your data larger and
that's why I said data was missing.