News:

11 March 2016 - Forum Rules

Main Menu

[DMG] How to disassemble a ROM?

Started by hackfleisch, March 25, 2016, 02:58:18 PM

Previous topic - Next topic

hackfleisch

Hello,

as i'm getting familar with Z80 Opcodes and using BGB disassembler, i'd like to document the subs and loops found in the game-ROM. It would be a great help, if some experienced user may give me some advice on "how to do it right". I'm sure there are some best-practice rules to follow.

My goal is to compile the generated sourcecode to a new binary ROM file and get this executed. My starting point is address 0x0100, of course ;-)

KingMike

I'm not sure if anyone has successfully made a program to "intelligently" disassemble a ROM (separating code from data, somehow. Nevermind how a bot could automatically figure out things like indirect jumps, if those are in the Z80 set) but otherwise it is an extremely tedious effort to manually dump ASM (bgb has an option to disassemble a selected address range to a file).

Another example to challenge that automation: I've heard of pirates making a very simple code obfuscation:

jp Continue
(insert some random byte)
Continue:
...


The one GB game I've done extensive hacking on was a fairly simple game, Ayakashi no Shiro. It was a small enough game (64KB) to fit all its code/data within the 32KB addressable limit (the higher banks only contained the graphics). I can only imagine the extra complexity of logging all the code from a game that regularly uses bankswapping.
"My watch says 30 chickens" Google, 2018

AWJ

Quote from: hackfleisch on March 25, 2016, 02:58:18 PM
Hello,

as i'm getting familar with Z80 Opcodes and using BGB disassembler, i'd like to document the subs and loops found in the game-ROM. It would be a great help, if some experienced user may give me some advice on "how to do it right". I'm sure there are some best-practice rules to follow.

My goal is to compile the generated sourcecode to a new binary ROM file and get this executed. My starting point is address 0x0100, of course ;-)

As well as 0x0100, you should also trace from each of the interrupt vectors, especially VBLANK (0040) and LCDSTAT (0048). In most console games, the main execution loop essentially revolves around the VBLANK interrupt. Interrupts are one of the main differences between the custom GB CPU and a real Z80. The GB has the same software interrupt instructions as a Z80 (RST 0x38, etc.) but hardware interrupts don't correspond to any of the three Z80 interrupt modes. On the GB each interrupt source has its own, fixed vector.

Remember that GB cartridges are bankswitched. When you see a JP or CALL to an address between 4000 and 7FFF, you have to figure out which ROM bank is switched in to determine where the jump actually goes. I'd start by disassembling the entire fixed bank (0000-3FFF) and looking for writes to ROM addresses (typically 0x2100 on the GB). On 8-bit consoles, figuring out how a particular game manages ROM bankswitching is a good first step to figuring out the overall structure of the program.

I'll just warn you ahead of time that converting a ROM of a commercial game to source code that can be reassembled into the original ROM is a very complex and time-consuming process.

By any chance do you know Python (the programming language)? I've been working on a multi-target disassembler written in Python for the last little while. It's designed to handle common console-game gotchas like bankswitching and inline subroutine arguments (subroutines that pop their own return address off the stack and manipulate it--very common on 8-bit consoles) It's not fully automated, it requires some Python programming to use, but I think it's much more efficient than trying to disassemble an entire game using bgb's built-in disassembler. If you're interested in trying it out, drop me a PM.

STARWIN

Is there a use case which benefits greatly from intelligent disassembly vs dumb disassembly? In other words, code/data separation is an interesting task intellectually, but how big of a difference can it make? A bit higher level than asm (or equivalent comments) would IMO be a reasonable goal for understanding/modding a game, but that's manual work in any case (?).

AWJ

Quote from: STARWIN on March 25, 2016, 05:06:29 PM
Is there a use case which benefits greatly from intelligent disassembly vs dumb disassembly? In other words, code/data separation is an interesting task intellectually, but how big of a difference can it make? A bit higher level than asm (or equivalent comments) would IMO be a reasonable goal for understanding/modding a game, but that's manual work in any case (?).

I'm suspicious of disassemblers that claim to be able to "automatically" separate data from code, but if you don't have some way of handling inline data then some games like Final Fantasy Legend 3 are impossible to get a useful disassembly out of...

STARWIN

Well I was more wondering what is considered useful and what not.

I don't really know the system and I don't really know how this inline looks like. (but I'm curious about something being impossible to do)

IIMarckus

Quote from: hackfleisch on March 25, 2016, 02:58:18 PM
as i'm getting familar with Z80 Opcodes and using BGB disassembler, i'd like to document the subs and loops found in the game-ROM. It would be a great help, if some experienced user may give me some advice on "how to do it right". I'm sure there are some best-practice rules to follow.

I started this project and this project many years ago, although at this point my work has been eclipsed by the larger community that grew out of it.

If you're interested in doing a large‐scale disassembly, data is more interesting than code at first. Start with the easy stuff, text and graphics, and pointers to them if practical. A debugger will be your best friend (BGB is the only choice right now).

Automation is helpful where you can fit it in. I was skeptical at first, and was doing everything basically by hand, but after someone automatically disassembled all the map scripts at once (which in Pokémon Red are written in assembly), it became much easier to figure out what particular subroutines did by comparing to the gameplay behavior I knew the maps had.

Do what you can to release quality work, but don't be afraid to share ugly code. Initially I kept everything private, with the intent of releasing it publicly at some point; after a while, though, I came to my senses and realized that if I ever set the project down it would end like all those other ROM hacking projects that never get finished. Losing my ego, releasing my unfinished work, and accepting that reputation is not important is the only reason these two disassemblies got finished.

Yes, they are 100% disassembled thanks to many people, and have been used for countless new projects:

Sorry, this went a bit off the rails. But the ROM hacking community has a real problem with hoarding, and I just want to share how moving past that caused my own project to grow beyond my wildest expectations.

AWJ

Quote from: STARWIN on March 25, 2016, 06:21:23 PM
Well I was more wondering what is considered useful and what not.

I don't really know the system and I don't really know how this inline looks like. (but I'm curious about something being impossible to do)

The simplest case of inline data on the GB looks like this:

SomeFunction:
pop hl
ldi a,(hl)
push hl
(do stuff...)
ret


This function pops its return address off the stack, reads the byte at that address, and pushes the modified return address back onto the stack. What this means is that every time there's a call SomeFunction in the game, the byte after the call opcode is actually an argument to SomeFunction, and not an instruction. If you disassemble it as an instruction your disassembly will get out of alignment. And this pattern is not rare at all in 8-bit assembly code, either in consoles or computers (it's the way syscalls work on the Apple II, for example)

STARWIN

So the useful part is down to alignment? Would doing a dumb disassembly over all alignments and then merging fix all alignment issues? Assuming that disassembled data is detectable sort of garbage most of the time (as that would result in non-overlapping code segments anyway).

KingMike

Quote from: STARWIN on March 25, 2016, 05:06:29 PM
Is there a use case which benefits greatly from intelligent disassembly vs dumb disassembly? In other words, code/data separation is an interesting task intellectually, but how big of a difference can it make? A bit higher level than asm (or equivalent comments) would IMO be a reasonable goal for understanding/modding a game, but that's manual work in any case (?).
It's possible that following "data", the disassembler could end up not resyncing on actual code correctly and end up improperly disassembling sections of code. I can at least tell that Z80 instructions can be a variable number of bytes (from 1 to 3?) whereas I think CPUs like ARM and MIPS are a constant 4 bytes per instruction. (68000 seems to always be an even length?)
"My watch says 30 chickens" Google, 2018

zonk47

If you run the rom in an emulator and play the game before disassembling, you can make a list of all the addresses from which graphics are loaded during play. The emulator could also conceivably make a list of addresses on which the load and store instructions operate in addition to these (these are the stat and map data). By avoiding these areas so identified, could get a perfect automatic disassembly 98% of the time... so yeah, can't get an automatic disassembly without actually running program in question, but if you do, no reason the emulator can't do it for you.

Disassembly is only part of the problem though: the real issue is labeling, which is always complicated and can't be done with a machine.
A good slave does not realize he is one; the best slave will not accept that he has become one.

FAST6191

Perfect 98% of the time? That is some optimism, perhaps even more than your automated translation stuff from that how to increase hacking popularity thread. Maybe for a single shot program or my first homebrew but something as inherently variable as a game then not even on something as basic as the GB/GBC. You get anything in the emulation or dynamic/runtime compilation world (I would say self modifying code but the world gave up on that) and not a chance.
Emulator logging/play driven analysis is a very valuable technique, or perhaps group of techniques, but I barely trust that to give me an area of free memory enough to do some tests with.
Maybe if you really helped it and went through everything really methodically (every menu, every minigame, every variable level...) you might have something. However said something would probably have taken more time than human driven analysis/directed hacking, and either way you are probably going to want to know assembly.

Labelling by machine though... maybe not by machine but combined with traditional cheat making type approaches there might be something there.

"whereas I think CPUs like ARM and MIPS are a constant 4 bytes per instruction. "
The GBA's ARM7TDMI has THUMB mode which uses 16 bit instructions and poses a problem for those hacking things, indeed emulators do have a nice automatic mode in the disassembler that will change it accordingly. That said both modes are fixed length and only reference other registers or contain the immediates within the instruction rather than as a trailing value or some memory location so there is that.

zonk47

I don't think you can do machine disassembly without going through the program's motions, but it's not a problem even then. Game loops are pretty simple stuff... plus you can program an emulator to read (without executing) the subroutines into the dump cache (until they call themselves, anyway).

As far as that goes though, why not just use an instruction dumper?

Today or tomorrow I'll release my Z80 tree disassembler/decompiler, which uses a tree view widget to disassemble routines on command.
A good slave does not realize he is one; the best slave will not accept that he has become one.

hackfleisch

#13
Distinguish code from data seems a common challenge to do.
Completely disassemble and comment the game is more an learing issue for me. I guess real cheaters/patchers use some more pracmatical solutions.

And again, my expectiation that an old console game is easy to hack, has pointed out to be wrong... and the simple Z80 CPU is more complicated than expected.

So, what i can do is to follow the code using the debugger, identify the data areas by the registers loaded (HL, BC, DE) and write down the executet code, one by one. What would help in this process is a kind of memory-list, to mark the adresses seen.

March 26, 2016, 09:05:49 AM - (Auto Merged - Double Posts are not allowed before 7 days.)

Quote from: zonk47 on March 26, 2016, 07:50:23 AM
Today or tomorrow I'll release my Z80 tree disassembler/decompiler, which uses a tree view widget to disassemble routines on command.
Sounds interessting! Where can i get this tool?

zonk47

Quote from: hackfleisch on March 26, 2016, 08:59:59 AM
Sounds interessting! Where can i get this tool?

Whenever I finish the tree view tie ins to the massive CASE statement. :P
A good slave does not realize he is one; the best slave will not accept that he has become one.