News: 11 March 2016 - Forum Rules
Current Moderators - DarkSol, KingMike, MathOnNapkins, Azkadellia, Danke

Author Topic: Fuzzy string search tool  (Read 3272 times)

Nagato

  • Jr. Member
  • **
  • Posts: 41
    • View Profile
Fuzzy string search tool
« on: August 06, 2014, 11:02:34 am »
I'd like some feedback on a tool I made before submitting this to the utilities database if anyone has some free time and is interested.

I started development on this tool last night after playing around with a GBC game I've been interested in for a while. The tool itself might be a bit specific in its usage, but it was enough for me to find what I needed. The reason I wanted to make the tool is because I wanted to find the strings in a certain Japanese GBC game, but I don't have much experience with reverse engineering GBC games and I didn't know what kind of character encoding the game was using. I looked in the tile viewer and saw the entire character set so I made a table and then wrote this tool to do a fuzzy string search. Turns out that the table I wrote was mostly right (most of the text in the game started at index 0), but not all of the text, so it was still useful for finding those odd strings. You can see an example at the bottom of my examples where the base of the character encoding starts at index 0xb0 but my tool is still able to interact with it. The search and replace functions are able to determine what index the character encoding starts at automatically, so they can be used to search/replace every instance in the ROM.

Admittedly I didn't do too much searching to see if there were already popular tools like this, but is it something that would be useful for anyone?
Does anyone have any suggestions/improvements/comments they could offer?
Are there any obvious bugs or crashes that I forgot to handle?

The tool is written in C# (min .NET 3.5) entirely open source, licensed under the MIT license and put on Github: https://github.com/polaris-/romtool-textutil
Binaries can be found here: https://github.com/polaris-/romtool-textutil/releases
I'm willing to accept pull requests if anyone wants to add their own features or bugfixes.

I've included the charset.xml for the game I was working on in the release file. It can be used as an example of how to make your own charset.xml. Sorry for the Xml. :(
The Index field can be in base 10 or base 16 (must be prefixed with 0x). The Character field can be a string of any length.

Edit just for more clarification in case I didn't state it clearly: The charset.xml does not have to be the exact character table. If, for instance, you know that A comes before B, which comes before C, you could define it as A = 0, B = 1, C = 2, then search for "ACBAC" and it'll search the entire ROM for any sequence of bytes that match the pattern x+0 x+2 x+1 x+0 x+2, where x is an unknown shared based index. Everything is relative here so A could be 15, B could be 16, and C could be 17 and it should still work.

How to use with detailed examples with results:
Code: [Select]
Usage:
search - Searching for string with unknown base offset
        romtool-textutil.exe search (datafile.bin) (search string)
        Ex: romtool-textutil.exe search rom.gbc "Hello!"

convert - Convert string using given base
        romtool-textutil.exe convert (string) (charset base)
        Ex: romtool-textutil.exe convert "Hello!" 145

gentable - Generate Thingy table based on character mapping with base offset added
        romtool-textutil.exe gentable (output.tbl) (charset base)
        Ex: romtool-textutil.exe gentable output.tbl 145

replace - Replace text of even or less length
        romtool-textutil.exe replace (datafile.bin) (outputfile.bin) (search string) (replace string)
        romtool-textutil.exe replace rom.gbc rom2.gbc "Hello!" "World!"

forcereplace - Force replace text of a greater length
        romtool-textutil.exe forcereplace (datafile.bin) (outputfile.bin) (search string) (replace string)
        romtool-textutil.exe forcereplace rom.gbc rom2.gbc "Hello!" "Overwritten"

display - Convert the string at the given address to plaintext
        romtool-textutil.exe display (datafile.bin) (hex address) (length) (charset base)
        romtool-textutil.exe forcereplace rom.gbc ce9fa 8 0

Finding arbitrary string in ROM based on character table:
Code: [Select]
C:\Projects\romtool-textutil\romtool-textutil\bin\Debug>romtool-textutil.exe search rom.gbc "はじめから"
Searching for 'はじめから' in rom.gbc...
Searching for pattern:
00000000 | 1a 39 22 06 27

Found 3 matches:
Character Base: 00
0006d443 | 1a 39 22 06 27 00 02 07 1f 0d fe e1 14 00 ec ff
0006d453 | 55 4c 2d 00 14 28 1a 3a 0c 10 19 1a fd 4d 4f 74

Character Base: 00
0007b939 | 1a 39 22 06 27 fe 0b 0f 2c 15 02 0a 14 40 0d fc
0007b949 | 5c 99 6b 33 fd 02 2f 47 02 15 27 00 0b 0f 2c 15

Character Base: 00
000ce81b | 1a 39 22 06 27 ff ff 12 3f 07 06 27 ff ff 9a a0
000ce82b | 9e a7 ad cb fd 9c a5 9a ac ac cb fd a2 9d 00 00

Convert string to use character table encoding:
Code: [Select]
C:\Projects\romtool-textutil\romtool-textutil\bin\Debug>romtool-textutil.exe convert "Hello, world!" 0
00000000 | a1 9e a5 a5 a8 ff 00 b0 a8 ab a5 9d c2

C:\Projects\romtool-textutil\romtool-textutil\bin\Debug>romtool-textutil.exe convert "Hello, world!" 10
00000000 | ab a8 af af b2 09 0a ba b2 b5 af a7 cc

Generate Thingy table using character table:
Code: [Select]
C:\Projects\romtool-textutil\romtool-textutil\bin\Debug>romtool-textutil.exe gentable output.tbl 0
Generated new file output.tbl with 236 entries.

Replace text of equal or less length:
Code: [Select]
C:\Projects\romtool-textutil\romtool-textutil\bin\Debug>romtool-textutil.exe replace rom.gbc rom2.gbc "はじめから" "Start"
Replacing 'はじめから' with 'Start' in rom.gbc...
Character Base: 00
0006d443 | 1a 39 22 06 27 00 02 07 1f 0d fe e1 14 00 ec ff
0006d453 | 55 4c 2d 00 14 28 1a 3a 0c 10 19 1a fd 4d 4f 74
->
0006d443 | ac ad 9a ab ad 00 02 07 1f 0d fe e1 14 00 ec ff
0006d453 | 55 4c 2d 00 14 28 1a 3a 0c 10 19 1a fd 4d 4f 74

Character Base: 00
0007b939 | 1a 39 22 06 27 fe 0b 0f 2c 15 02 0a 14 40 0d fc
0007b949 | 5c 99 6b 33 fd 02 2f 47 02 15 27 00 0b 0f 2c 15
->
0007b939 | ac ad 9a ab ad fe 0b 0f 2c 15 02 0a 14 40 0d fc
0007b949 | 5c 99 6b 33 fd 02 2f 47 02 15 27 00 0b 0f 2c 15

Character Base: 00
000ce81b | 1a 39 22 06 27 ff ff 12 3f 07 06 27 ff ff 9a a0
000ce82b | 9e a7 ad cb fd 9c a5 9a ac ac cb fd a2 9d 00 00
->
000ce81b | ac ad 9a ab ad ff ff 12 3f 07 06 27 ff ff 9a a0
000ce82b | 9e a7 ad cb fd 9c a5 9a ac ac cb fd a2 9d 00 00

Generated new file rom2.gbc.
Code: [Select]
C:\Projects\romtool-textutil\romtool-textutil\bin\Debug>romtool-textutil.exe replace rom.gbc rom2.gbc "はじめから" "Start Game"
Replacing 'はじめから' with 'Start Game' in rom.gbc...
ERROR: Length of replacement string (10) is greater than the length of the original string (5).
Use forcereplace if this was intentional.

Replace text of greater length:
Code: [Select]
C:\Projects\romtool-textutil\romtool-textutil\bin\Debug>romtool-textutil.exe forcereplace rom.gbc rom2.gbc "はじめから" "Start Game"
Replacing 'はじめから' with 'Start Game' in rom.gbc...
Character Base: 00
0006d443 | 1a 39 22 06 27 00 02 07 1f 0d fe e1 14 00 ec ff
0006d453 | 55 4c 2d 00 14 28 1a 3a 0c 10 19 1a fd 4d 4f 74
->
0006d443 | ac ad 9a ab ad 00 a0 9a a6 9e fe e1 14 00 ec ff
0006d453 | 55 4c 2d 00 14 28 1a 3a 0c 10 19 1a fd 4d 4f 74

Character Base: 00
0007b939 | 1a 39 22 06 27 fe 0b 0f 2c 15 02 0a 14 40 0d fc
0007b949 | 5c 99 6b 33 fd 02 2f 47 02 15 27 00 0b 0f 2c 15
->
0007b939 | ac ad 9a ab ad 00 a0 9a a6 9e 02 0a 14 40 0d fc
0007b949 | 5c 99 6b 33 fd 02 2f 47 02 15 27 00 0b 0f 2c 15

Character Base: 00
000ce81b | 1a 39 22 06 27 ff ff 12 3f 07 06 27 ff ff 9a a0
000ce82b | 9e a7 ad cb fd 9c a5 9a ac ac cb fd a2 9d 00 00
->
000ce81b | ac ad 9a ab ad 00 a0 9a a6 9e 06 27 ff ff 9a a0
000ce82b | 9e a7 ad cb fd 9c a5 9a ac ac cb fd a2 9d 00 00

Generated new file rom2.gbc.

Display arbitrary binary data using character table:
Code: [Select]
C:\Projects\romtool-textutil\romtool-textutil\bin\Debug>romtool-textutil.exe display rom.gbc 0007b939 0x32 0
00000000 | 1a 39 22 06 27 fe 0b 0f 2c 15 02 0a 14 40 0d fc
00000010 | 5c 99 6b 33 fd 02 2f 47 02 15 27 00 0b 0f 2c 15
00000020 | 02 fe 0a 14 3d 15 fc ec eb 2a 2e 27 08 15 41 c9
00000030 | c9 c9
Plaintext: はじめから さそわないことです チームが いっぱいなら さそわない ことだな   れんらくなど・・・
Code: [Select]
C:\Projects\romtool-textutil\romtool-textutil\bin\Debug>romtool-textutil.exe display rom.gbc 0xd11ca 20 0xb0
00000000 | c5 cf b4 29 03 0e 02 c8 de da 79 b2 0f 02 07 04
00000010 | be b2 f5 c2
Plaintext: なまえ    ねんれ い    せいべつ
« Last Edit: August 06, 2014, 11:55:26 am by Nagato »

KaioShin

  • RHDN Patreon Supporter!
  • Hero Member
  • *****
  • Posts: 5697
    • View Profile
    • The Romhacking Aerie
Re: Fuzzy string search tool
« Reply #1 on: August 06, 2014, 01:10:07 pm »
I haven't heard of a game that needs such a thing, but that doesn't mean it's not worth submitting.

Relative Searching is based on the same concept: you know the order of the characters but not the offset where it starts counting them. But that the offset changes around within the game is slightly different. Not related to this tool. but have you thought further about re-inserting your text? How do you determine which table offset to use for writing later? It seems to me like you'll have to figure out why the offsets change before you can do anything with the text dump. And then this searching method will probably not be needed anymore. I'm not saying your tool is the wrong approach, it just seems like there is a key piece missing that it would even be needed.
All my posts are merely personal opinions and not statements of fact, even if they are not explicitly prefixed by "In my opinion", "IMO", "I believe", or similar modifiers. By reading this disclaimer you agree to reply in spirit of these conditions.

Nagato

  • Jr. Member
  • **
  • Posts: 41
    • View Profile
Re: Fuzzy string search tool
« Reply #2 on: August 06, 2014, 01:48:58 pm »
It's not so much that it was "needed", but that it made it easier for me to find what I was looking for so I thought maybe someone else might find it useful.

The relative searching thing is pretty much the same thing I'm doing here. This tool isn't meant to be used for serious text insertion so that's not a problem I'm trying to solve with this tool (I put in replacement functions just so I could easily test things). I want it to be used together with things like Atlas or whatever everyone's favorite tools are. It's meant to be used to find text strings of an unknown encoding based on the relative offsets assigned to each character. It can also output a character table for use in other tools based on the charset.xml relative to whatever base index you specify (which can be found using this tool). It's up to the user to figure out things like why the offsets change. This tool will just be able to tell them where the text is regardless of whether the table changed or not, and it'll say what the new base index is relative to the current table.

The problem I faced last night was that I had a rough idea about the order of the characters in the character table after viewing everything in the tile viewer, but I did not know where the text was or even the starting index and I didn't want to try every combination of 3 bytes to find the word "なまえ" for example. Using the table I created (charset.xml) that contained a value assigned to each character, it tries to find all strings of 3 bytes that match the pattern for "なまえ". You can see the result of that search here. There may be some false positives for shorter strings like this, but most of them are the correct word even with different base indexes.

I guess you could consider it just a relative search tool but with some extra features that made it easier for me to find unknown text in the ROM I was looking at. I don't think it's all that useful of a tool once you're past the stage where you're trying to find the text and already know the character table, but I think it has some uses in the early stages.

Pennywise

  • Hero Member
  • *****
  • Posts: 2259
  • I'm curious
    • View Profile
    • Yojimbo's Translations
Re: Fuzzy string search tool
« Reply #3 on: August 06, 2014, 03:22:55 pm »
The GB/C hardware is one of the simplest and easiest systems to hack out there. If you're trying to build a table, all you have to do is look at VRAM and see if it stores the entire font there. If not, you can easily debug to find where the text is loaded by setting VRAM breakpoints etc. It really only takes a few minutes to do as the debugging capabilities are quite advanced for the GB/C.

My honest opinion here, but I feel like a case study on how to find text encoding/locations etc by debugging would be more useful than the tool as that is the ideal method to do all this stuff. I stopped searching for text in a hex editor when I learned how to debug and reverse-engineer games.

My personal opinions aside, there's probably some use to it for beginners though.
« Last Edit: August 06, 2014, 04:20:02 pm by Pennywise »

FAST6191

  • Hero Member
  • *****
  • Posts: 2648
    • View Profile
Re: Fuzzy string search tool
« Reply #4 on: August 06, 2014, 04:51:20 pm »
More open source tools are nice to see.

That said other than your workflow perhaps being more obvious/slightly less esoteric and the easier replace/alteration options what is all that different to the wildcards being supported in something like http://www.romhacking.net/utilities/513/ ?

Nagato

  • Jr. Member
  • **
  • Posts: 41
    • View Profile
Re: Fuzzy string search tool
« Reply #5 on: August 06, 2014, 06:26:01 pm »
The GB/C hardware is one of the simplest and easiest systems to hack out there. If you're trying to build a table, all you have to do is look at VRAM and see if it stores the entire font there. If not, you can easily debug to find where the text is loaded by setting VRAM breakpoints etc. It really only takes a few minutes to do as the debugging capabilities are quite advanced for the GB/C.

My honest opinion here, but I feel like a case study on how to find text encoding/locations etc by debugging would be more useful than the tool as that is the ideal method to do all this stuff. I stopped searching for text in a hex editor when I learned how to debug and reverse-engineer games.

My personal opinions aside, there's probably some use to it for beginners though.
Thanks for the comment. I think you're focusing too much on the GBC part, though. I would consider myself pretty decent at reverse engineering on other platforms already, but I didn't want to spend however many hours learning how to use a debugger for the GBC and then learning how the hardware works and learning the instruction set at the same time. It's easy to say that it'd take a few minutes to do if you already know the tools and the workflow, but it probably would've taken me a lot longer just to even become familiar with the system. I don't have plans to mess with the GBC anymore now that my itch has been scratched so that's more effort than I am willing to put in right now. The same problem would also exist if I wanted to look at another game on the SNES or Genesis or something like that. The tool is platform agnostic so I'd be able to reuse it anywhere.

That said other than your workflow perhaps being more obvious/slightly less esoteric and the easier replace/alteration options what is all that different to the wildcards being supported in something like http://www.romhacking.net/utilities/513/ ?
I've never seen that program before so I wouldn't be able to tell you. It doesn't look like it'd be all that different really.

Karatorian

  • Sr. Member
  • ****
  • Posts: 381
  • "Gotta get get ... 6502"
    • View Profile
    • Studio Karatorian
Re: Fuzzy string search tool
« Reply #6 on: November 02, 2014, 02:20:33 pm »
Honestly, using a standard table file will probably be more utilized by the community than some new XML format. We've only just managed to get some consensus hashed out on table formats. The hope is that new tools will implement that consensus, and not propagate further incompatible formats.
Current ProjectsFinal Fantasy EngineSMB Special for NESStudio Karatorian
@loop: lda (src),y — sta (dst),y — iny — bne @loop — inc src+1 — inc dst+1 — dex — bne @loop