News:

11 March 2016 - Forum Rules

Main Menu

6 bit text encoding

Started by Pennywise, September 03, 2020, 12:29:01 PM

Previous topic - Next topic

Pennywise

What is the process behind encoding text as 6 bits and then converting it to 8 bits in the code? I'm assuming that's what happens with 6 bit text encoding.

I get the general idea is if you have 4 bytes/characters, with 6 bit encoding, they'll only take up 3 bytes of space. How exactly does that work? I know that instead of thinking of every as bytes, the magic happens on a binary level. Some type of bit shifting?

Raeven0

Bit shifting works well on the SNES, where bytes are individually addressable and the accumulator can be 1 byte wide. Here's a naive 6-bit-to-8-bit decoder:

SR_DecodeText:
; Assumes: X = address of encoded source, Y = address of buffer destination
; Assumes: srcAddr>$8000, destAddr<$2000
;   (otherwise use long addressing for src and set DBR for dest)
; Assumes terminator byte is $00
.FirstByte:
lda $0000,x : lsr : lsr : sta $0000,y
  beq .Done
.SecondByte:
lda $0000,x : and #$03
  asl : asl : asl : asl : sta $0001,y
lda $0001,x : and #$F0
  lsr : lsr : lsr : lsr : ora $0001,y : sta $0001,y
  beq .Done
.ThirdByte:
lda $0001,x : and #$0F
  asl : asl : sta $0002,y
lda $0002,x : and #$C0
  lsr : lsr : lsr : lsr : lsr : lsr : ora $0002,y : sta $0002,y
  beq .Done
.FourthByte:
lda $0002,x : and #$3F : sta $0003,y
  beq .Done
.ToNextGroup:
inx : inx : inx
iny : iny : iny : iny
bra .FirstByte
.Done:
rts

Not tested. Definitely not optimized.

slidelljohn

#2
 :thumbsup:

abw

For a wordier somewhat generalized take on the issue:

With 8 bits, you can represent 2^8 = 256 different tokens (where a "token" could be a sequence of one or more text characters, a control code, etc.). With 6 bits, you can represent 2^6 = 64 different tokens. If you have more than 64 different tokens to encode, you can't do it with just 6 bits, so you need to use something else to provide additional information about your tokens, and that something else is context, where the meaning of a 6-bit token depends on not just its own bits but also on some other bits, almost always on bits that occur earlier in the input. So a 0b000000 token could mean different things depending on its context, which is to say it could mean different things depending on the tokens that have preceded it. As an example, maybe that 0b000000 starts off meaning "A", but if it comes after a 0b111111 token, it means "a" instead.

In ROM hacking, we usually represent a text encoding in a table file, and the extra tokens that provide the context necessary to determine the meaning of a token are sometimes referred to as table switches since switching the context used to interpret a token corresponds to switching to a different table. In order to encode some input, you'll need to have at least enough contexts (tables) to represent all of your tokens including some extra tokens for indicating context changes (table switches). I.e. if you have a total of 70 6-bit tokens, you'll need at least 2 64-entry tables to represent them all; if you have 170 6-bit tokens, you'll need at least 3 64-entry tables. Once you've decided on the specifics of the encoding scheme, you can convert your text input to binary and burn it to ROM (or whatever your storage medium actually is, since in practice we almost never touch physical ROM chips).

So, getting back to your question, when decoding a stream of context-sensitive 6-bit input tokens to an 8-bit output stream, the decoder has to a) chop the input up into groups of 6-bit tokens (on a byte-based processor this will indeed usually involve lots of bit shifting) and b) keep track of changes to the context it's supposed to use to interpret each of those tokens. After that, it has to map each token from 6 bits to 8 bits and output the 8-bit value (which could then be used as a PPU tile ID or a control code etc.). The mapping method will vary from game to game, but two of the simplest options are to add a fixed value depending on the context (e.g. if we're interpreting tokens as uppercase characters, then the 8-bit value could be the 6-bit value + $80, or if we're interpreting tokens as lowercase characters, then the 8-bit value could be the 6-bit value + $C0; The Battle of Olympus [NES] does something like this) or to use a separate lookup table for each context (e.g. if we're interpreting tokens as uppercase characters, then the 8-bit value could be given by the byte at RAM ($DEAD + the 6-bit value), or if we're interpreting tokens as lowercase characters, then the 8-bit value could be given by the byte at RAM ($BEEF + the 6-bit value); Dragon Quest IV [NES] does something like this).

slidelljohn

#4
 :thumbsup:

Anime_World

#5
An example in python:

output = b''
text = 'ABCDEFGHIJKLMNOPQRSTUVWXYZ'
for character in text:
value = ord(character) &0x3F
output += value.to_bytes(1,'big')
print(output)


result will be:

01 02 03 04 05 06 07 08 09 0A 0B 0C 0D 0E 0F 10 11 12 13 14 15 15 17 18 19 1A


Now to reduce in a group of 3 bytes you can do:

output = b''
text = 'ABCDEFGHIJKLMNOPQRSTUVWXYZ'
converted = []
for character in text:
value = ord(character) &0x3F
converted.append(value)

next = 0
for i in range(0, len(converted)//3):
value = int('0b' + '{:06b}'.format(converted[next]) + '{:06b}'.format(converted[next+1]) + '{:06b}'.format(converted[next+2]), 2)
output += value.to_bytes(3,'big')
next += 3

print(output)


result will be:

00 10 83 00 41 46 00 72 09 00 A2 CC 00 D3 8F 01 04 52 01 35 15 01 65 D8


Or you can use a dict compression like LZ or Huffman.

slidelljohn

#6
If you changed: text = 'ABCDEFGHIJKLMNOPQRSTUVWXYZ'
to: text = 'aBcD'

What would the results of print(output) be after "Now to reduce in a group of 3 bytes you can do:"?

Anime_World

#7
Quote from: slidelljohn on September 04, 2020, 07:53:34 PM
If you changed: text = 'ABCDEFGHIJKLMNOPQRSTUVWXYZ'
to: text = 'aBcD'

What would the results of print(output) be after "Now to reduce in a group of 3 bytes you can do:"?


00 18 83


It's only a sketch... if u want to output de 4th char, some changes are needed.
Like this:

output = b''
text = 'JOHN'
converted = []
for character in text:
value = ord(character) &0x3F
converted.append(value)

next = 0
for i in range(0, len(converted)//4):
value = int('0b' + '{:06b}'.format(converted[next]) + '{:06b}'.format(converted[next+1]) + '{:06b}'.format(converted[next+2]) + '{:06b}'.format(converted[next+3]), 2)
output += value.to_bytes(3,'big')
next += 4

print(output)


now result for "JOHN" will be:

28 F2 0E

slidelljohn

What are the results if text = 'jOhN?

Anime_World

#9
Quote from: slidelljohn on September 04, 2020, 11:42:49 PM
What are the results if text = 'jOhN?


A8 FA 0E


Right?


>>> bin(ord('J'))
'0b1001010'
>>> bin(ord('j'))
'0b1101010'

slidelljohn

How does it go from 0x28, 0xF2, 0x0E to 0xA8, 0xFA, 0x0E?

Can you show me what the bits mean because it looks like data is missing?

Example in 8 bit, 4 bytes:
JOHN =
0x4a, 0x4f, 0x48, 0x4e
01001010, 01001111, 01001000, 01001110

Example in 6 bit 3 bytes:
JOHN =
001010, 001111, 001000, 001110

0x28, 0xF2, 0x0E
00101000, 11110010, 00001110

Anime_World

Quote from: slidelljohn on September 04, 2020, 11:52:02 PM
How does it go from 0x28, 0xF2, 0x0E to 0xA8, 0xFA, 0x0E?

Can you show me what the bits mean because it looks like data is missing?

Example in 8 bit, 4 bytes:
JOHN =
0x4a, 0x4f, 0x48, 0x4e
01001010, 01001111, 01001000, 01001110

Example in 6 bit 3 bytes:
JOHN =
001010, 001111, 001000, 001110

0x28, 0xF2, 0x0E
00101000, 11110010, 00001110



01101010 = j
01001111 = O
01101000 = h
01001110 = N


to 6-bits grouped:

   j     O       h     N
101010 001111 101000 001110 = A8FA0E

slidelljohn

#12
It looks like you are removing data to make it smaller than what it acutely is.
Why not show the full data?

Lets see the table(s).

Anime_World

Quote from: slidelljohn on September 05, 2020, 12:04:44 AM
It looks like you are removing data to make it smaller than what it acutely is.
Why not show the full data?

Lets see the tables.

OK.  :thumbsup:
Check tables...

slidelljohn

#14
Did you post them? I dont see any.

September 05, 2020, 12:34:45 AM - (Auto Merged - Double Posts are not allowed before 7 days.)

I don't understand the trickery. :-\
All you did was flip the 6th bit on the j and the h to make them lower case.

Example in 6 bit 3 bytes:
JOHN =
001010, 001111, 001000, 001110

Example in 6 bit 3 bytes:
jOhN =
101010, 001111, 101000, 001110

September 05, 2020, 12:49:48 AM - (Auto Merged - Double Posts are not allowed before 7 days.)

Lower case, upper case, numbers 0-9, and (!,.?'"space) will not fit in a 64 6-bit table.

Anime_World

#15
Quote from: slidelljohn on September 05, 2020, 12:20:38 AM
Did you post them? I dont see any.

September 05, 2020, 12:34:45 AM - (Auto Merged - Double Posts are not allowed before 7 days.)

I don't understand the trickery. :-\
All you did was flip the 6th bit on the j and the h to make them lower case.

Example in 6 bit 3 bytes:
JOHN =
001010, 001111, 001000, 001110

Example in 6 bit 3 bytes:
jOhN =
101010, 001111, 101000, 001110

September 05, 2020, 12:49:48 AM - (Auto Merged - Double Posts are not allowed before 7 days.)

Lower case, upper case, numbers 0-9, and (!,.?'"space) will not fit in a 64 6-bit table.

Look:

text = 'ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz'
converted = []
for character in text:
  value = ord(character) &0x3F
  converted.append(value)
print(converted)


and the result:

1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58



If u want to use numbers and other characters, combine them with one byte as bitmask every 8 bytes.
1 = 3bytes 6-bit encoded, counts as 1 byte decoded
0 = normal ascii value = decoded byte

slidelljohn

#16
Glad you came back! :woot!:

Can you finish filling your table from 0x00-0x3f (64 total)?

How did you get this from your table?
Quote from: Anime_World on September 04, 2020, 11:57:43 PM
to 6-bits grouped:

   j     O       h     N
101010 001111 101000 001110 = A8FA0E





This is why I say data is missing and why I'm trying to get you to show me your
data after encoding certain text(chars). I'm just trying to show you what happens
when you do it this way.
Quote from: Anime_World on September 05, 2020, 02:52:15 AM
If u want to use numbers and other characters, combine them with one byte as bitmask every 8 bytes.
1 = 3bytes 6-bit encoded, counts as 1 byte decoded
0 = normal ascii value = decoded byte

Bregalad

If you're going to need some complex bit shifting anyway, why not use huffman encoding ? It will always encode into less bits than fixed bitwidth. You know, some people already thought about compression for years and came up with their solutions :)

Including myself : http://www.romhacking.net/utilities/882/

Anime_World

#18
Quote from: slidelljohn on September 05, 2020, 03:36:45 AM
Glad you came back! :woot!:

This is close to your table but your numbers dont add up right.


Can you finish filling your table from 0x00-0x3f (64 total)?

How did you get this from your table?



This is why I say data is missing and why I'm trying to get you to show me your
data after encoding certain text(chars). I'm just trying to show you what happens
when you do it this way.

OK.  8)

slidelljohn

#19
@Bregalad

Yes, Bregalad. That's one reason why I didn't agree with this statement.
Quote from: abw on September 03, 2020, 10:02:21 PM
if you have a total of 70 6-bit tokens, you'll need at least 2 64-entry tables to represent them all

You don't need another 6-bit table for 6 extra tokens. If you need more tokens you might
as well add some sort of compression.

Table #2

10  = 65
11  = 66
000 = 67
001 = 68
010 = 69
011 = 70

@Anime_World
Your output data wasn't accurate. I was just trying to help. Also
certain char combinations are going to make your data larger and
that's why I said data was missing.

@Bregalad
I knew you knew a lot about snes music but I didn't realize you
knew so much about compression. I looked at your tool and its
got a lot of different compressions! You also included the source! :woot!:
Thanks! :)