News: 11 March 2016 - Forum Rules
Current Moderators - DarkSol, KingMike, MathOnNapkins, Azkadellia, Danke

Author Topic: How to approach reversing this text compression?  (Read 1002 times)

Crane

  • Newbie
  • *
  • Posts: 3
    • View Profile
How to approach reversing this text compression?
« on: December 09, 2018, 09:55:26 am »
So I'm trying to hack the translation from the PS4 version of SAO: Hollow Fragment into the Vita version.
The script files were conveniently uncompressed, and I was able to drop them in and have it working without a hitch - great!

However, various other message files and quest summaries and the like are not so convenient.

Here's a comparison of two files in a hex editor:



At first I thought it was a simple byte-pair compression, but there's obviously more to it than that, because if you look at the places which correspond to "Let's go!" and "Let's do this!" they don't start with the same string of characters at all, so there's obviously more to it.

Does anyone have any suggestions for how I could go about reverse-engineering this?

FAST6191

  • Hero Member
  • *****
  • Posts: 2404
    • View Profile
Re: How to approach reversing this text compression?
« Reply #1 on: December 09, 2018, 02:46:11 pm »
Any compression for a vaguely modern system which leaves that many 00s needs to be taken out the back and shot. To that end I would not call that compression.

The distribution of the first characters in given 16 bit chunks would point me somewhere towards http://www.rikai.com/library/kanjitables/kanji_codes.unicode.shtml (as opposed to the shiftJIS or euc-jp ones, also listed on that site) but all those 1 values is not ideal for that. Alternatively as it is usually FFFF1027 it could be some kind of identifier, markup or formatting (and it appears the same in the English one too).
Bonus is those numbers in the English are also in the Japanese, but with a 00 padding between them which is a common way programmers of games like to make their lives easier rather than doing big boy unicode ( https://www.joelonsoftware.com/2003/10/08/the-absolute-minimum-every-software-developer-absolutely-positively-must-know-about-unicode-and-character-sets-no-excuses/ ).


Mauron

  • Sr. Member
  • ****
  • Posts: 455
    • View Profile
Re: How to approach reversing this text compression?
« Reply #2 on: December 09, 2018, 05:12:19 pm »
I believe that's the English Vita and English PS4 data, not English and Japanese.

The Vita translation of Hollow Fragment was pretty bad, but I've heard they actually did a decent job with the PS4 version. I'm rather disappointed the developers didn't add the new translation as a patch to the Vita version themselves.
Mauron wuz here.

KingMike

  • Forum Moderator
  • Hero Member
  • *****
  • Posts: 6651
  • *sigh* A changed avatar. Big deal.
    • View Profile
Re: How to approach reversing this text compression?
« Reply #3 on: December 09, 2018, 10:06:07 pm »
If you're talking about the bad translation of SAO, isn't that specifically the "Asian" English release?
From what I heard, it got enough bad publicity that Bandai-Namco fixed it before the NA/EU/AU Vita digital releases.
"My watch says 30 chickens" Google, 2018

Mauron

  • Sr. Member
  • ****
  • Posts: 455
    • View Profile
Re: How to approach reversing this text compression?
« Reply #4 on: December 10, 2018, 01:25:17 am »
I played the NA Vita digital release. It's pretty terrible. I've heard the "Asian" English release is the same one as NA. Can't confirm about EU/AU though.
Mauron wuz here.

Crane

  • Newbie
  • *
  • Posts: 3
    • View Profile
Re: How to approach reversing this text compression?
« Reply #5 on: December 10, 2018, 03:41:42 am »
Second Edit: Well, I found the US text for these files. Unfortunately it's actually all glommed together in a single big localize_msg.dat file, which the game obviously pulls from to overwrite the Japanese text when you boot it in English. And I have no reasonable way of working out how to persuade the Vita version to do the same thing - that version of the game handles multiple languages by literally having different copies of all the individual message files in separate folders indexed by language.

Not sure there's really anywhere I can go from here with my level of knowledge. I'm gonna have a look at the PC version to see if maybe it handles its localisation structure in a more convenient fashion, but I suspect not.


Edit: All of the below stuff is invalid, because I crossposted this to Stackexchange, and someone with more experience at this than me pointed out that it is actually the Japanese file I have there. I'm not quite sure how, since I got this from the same .pkg as the English script files. I obviously need to go and dig through the folder structure again, or possibly source an EU copy of the PS4 title instead of a US one...

----

I believe that's the English Vita and English PS4 data, not English and Japanese.

That's right. (Edit, it wasn't!) But I think FAST might not be entirely off-base by suggesting it could be using a Japanese encoding for the text - he's right about it looking like that sort of table with the weird "30"s attached to everything.

I'll upload a couple of files with more readable text in for you guys to take a look at:
Vita
PS4

"Monster" appears to be E230 F330 B930 BF30 / â0ó0¹0¿0


---

Goblin Thief
´0Ö0ê0ó0·0ü0Õ0
B430 D630 EA30 F330 B730 FC30 D530

Goblin Thief Archer
´0Ö0ê0ó0·0ü0Õ0¢0ü0Á0ã0ü0
B430 D630 EA30 F330 B730 FC30 D530 A230 FC30 C130 E330 FC30

---

Ancient Grief
¨0ó0·0§0ó0È0°0ê0ü0Õ0
A830 F330 B730 A730 F330 C830 B030 EA30 FC30 D530

Grief Screamer
°0ê0ü0Õ0¹0¯0ê0ü0Þ0ü0
B030 EA30 FC30 D530 B930 AF30 EA30 FC30 DE30 FC30

---

Grief
EA30 FC30 D530

Thief
B730 FC30 D530

So you can see that FC30 D530 is "ief".
But then I look for more occurences of "gr"

Deep Grudge
Ç0£0ü0×0°0é0Ã0¸0
C730 A330 FC30 D730 B030 E930 C330 B830

And you don't see the EA30 that starts off "Grief".

I have a feeling FC30 could be some kind of switch byte, either an upper case indication or possibly marking the use of some kind of lookup table? It's also interesting that all the lines which are just objectives/boss names have the --30 structure, but some of the descriptive passages don't seem to.
« Last Edit: December 10, 2018, 01:46:23 pm by Crane »

theflyingzamboni

  • Jr. Member
  • **
  • Posts: 76
    • View Profile
Re: How to approach reversing this text compression?
« Reply #6 on: December 11, 2018, 09:51:15 am »
Could it be compressed block-by-block, instead of the whole file at once? That would cause the same combinations of letters to be defined differently when compressed, if they occurred in different blocks.
ROM wasn't hacked in a day.

Crane

  • Newbie
  • *
  • Posts: 3
    • View Profile
Re: How to approach reversing this text compression?
« Reply #7 on: December 11, 2018, 10:38:00 am »
Nah, as I said in my edit it turns out that the file I was looking at was actually the Japanese one.

The PS4 English text is stored in a separate file that doesn't mimic the data structure of the Vita version, and is in fact gonna be a real goddamn pain to patch into the Vita version.

You can have a look at it here.

What I want to do is almost certainly achievable in theory, but may in practice be so much of a pain in the arse that I can't be bothered. For a preliminary test I've hacked the new translations into one of the old message files by copy-pasting them with a hex editor; when I get home this evening I'll repack it and see if that loads nicely.

I give it a 50-50 chance of working, depending on how the Vita version actually locates strings within the file.

If it's byte-addressed or something then it'll fail horribly because the new strings are different sizes to the old ones. However, the structure of the original message files looks as though each individual message is fenced off by some kind of tags, so it's possible that it'll correctly read whatever I dump in between those.