Romhacking.net

Romhacking => Personal Projects => Topic started by: tcaudilllg on August 20, 2010, 01:21:20 pm

Title: Code Naturalizer
Post by: tcaudilllg on August 20, 2010, 01:21:20 pm
I'm meaning to write a code naturalizer, a kind of disassembler which uses natural language. (to an extent, anyway) I think this will greatly improve the readability of code and make hacking easier.

The program will function by reading all memory within a range specified by the user. It then attempts to convert the memory into opcodes.

Features I want to include:

Example output:
Code: [Select]
"MOV AX, &FFFF" -> "Move data at 0xFFFF/65535 into Accumulator"

I suspect it will not be difficult to write, although I may need some help with the details. The NES is my first target.
Title: Re: Code Naturalizer
Post by: Tauwasser on August 20, 2010, 01:30:59 pm
Bad example, as it doesn't translate back into ASM.

cYa,

Tauwasser
Title: Re: Code Naturalizer
Post by: tcaudilllg on August 20, 2010, 02:20:59 pm
Bad example, as it doesn't translate back into ASM.

cYa,

Tauwasser

Why does it have to?

What I aim to do with this is make it easier for hackers to figure out what's going on in the first place.

I'm reviewing the Javascript port of vNES. I'd like to excerpt the CPU core for use in my project. However, I'm not sure if its design will allow me that option.
Title: Re: Code Naturalizer
Post by: Tauwasser on August 20, 2010, 04:46:59 pm
I thought your approach was to make hacking easier. I understood your initial post as if it would be going from ASM to naturalized code and then back in order for the hacker to tinker with the naturalized code.

IMO just naturalizing isn't worth programming. Opcodes are there for a reason: They are concise and exact. Anybody with half a brain will be able to read them just as easily ― if not easier ― than your naturalized code. Simply because a defined and concise syntax is worth its money. Already three lines of your naturalized code will become incomprehensible and unreadable.

So basically, you want to achieve a way to output ASM code in a naturalized way to help people understand it. It's not worth the effort and nobody will be able to properly learn from it anyway. Abstraction is the way to go. Understanding what a piece of code does is not the sum of its parts. Just look at the old threads Rai had going on ― sometimes as many as three for the same piece of code/topic ― and he understood all opcodes, yet couldn't find the abstract routine they implemented or what they accomplished.

cYa,

Tauwasser
Title: Re: Code Naturalizer
Post by: tcaudilllg on August 20, 2010, 06:42:32 pm
Here's what I've got so far:
(this is a partial rip from JNES)

Link (http://pastebin.com/nVyCGWRe)

August 20, 2010, 06:43:05 pm - (Auto Merged - Double Posts are not allowed before 7 days.)
Link (http://pastebin.com/MAvzeEQh)

I disagree that abstraction is the way to go, for the simple reason that the opcodes are kinda abstract and difficult to deal with. Next I mean to put in the memory labeling scheme, followed by an option to view either memory addresses or their functional labels. Long term, divining intent (within reason) is an option as well.

--Moderator edit--
Replaced extensively long code tags with pasty code links.
Title: Re: Code Naturalizer
Post by: DarknessSavior on August 20, 2010, 07:23:57 pm
You need to make sure that if you do this, your opcode descriptions for output are exactly what they are supposed to be. For example, you have ADC as "Add value at address (address) to Accumulator". It should include something about it adding one, if the carry flag is set. Maybe you could make a separate case for when the carry flag is set, and when it isn't. But make sure those little details are included.

~DS
Title: Re: Code Naturalizer
Post by: tcaudilllg on August 20, 2010, 08:08:35 pm
About the address load functions... I think what's needed is to rip out the actual address reading functions and to replace them instead with the address itself.

Here's the original:
Code: [Select]

    load: function(addr){
        if (addr < 0x2000) {
            return mem[addr & 0x7FF];
        }
        else {
            return nes_mmap_load(addr);
        }
    },
   
    load16bit: function(addr){
        if (addr < 0x1FFF) {
            return mem[addr&0x7FF]
                | (mem[(addr+1)&0x7FF]<<8);
        }
        else {
            return nes_mmap_load(addr) | (nes_mmap_load(addr+1) << 8);
        }
    },


And here's the rip:
Code: [Select]

    load: function(addr){
        if (addr < 0x2000) {
            return addr & 0x7FF;
        }
        else {
            return addr;
        }
    },
   
    load16bit: function(addr){
        if (addr < 0x1FFF) {
            return addr & 0x7FF;
        }
        else {
            return addr;
        }
    },

I think that's right.

The key to assessing the meaning of the address is implicit in that "add < 0x2000" check: if the address is a certain number, then it represents a specific function of the system. For example if it's in the first page, it's CHR-ROM. After that, it's PRG-ROM. It can be a source of confusion to recall the meaning of specific address ranges when trying to figure out what's a program is doing. It's another language. I can see how people who like seeing things in terms of different symbolic systems would like dealing with the numbers in and of themselves. But that doesn't make it easier to figure out what's going on, unless you are an adept in that kind of mind transplant. (I'm not putting people who can do that down -- it's an important skill -- but correlating address ranges with functions is not the best use of that talent).

August 20, 2010, 08:10:44 pm - (Auto Merged - Double Posts are not allowed before 7 days.)
Quote from: DarknessSavior
For example, you have ADC as "Add value at address (address) to Accumulator". It should include something about it adding one, if the carry flag is set. Maybe you could make a separate case for when the carry flag is set, and when it isn't. But make sure those little details are included.

Thanks for pointing that out. I'll have to make an adjustment, certainly.

August 20, 2010, 08:32:14 pm - (Auto Merged - Double Posts are not allowed before 7 days.)
This JNES is really awesome. All sorts of tools that can be made from it.

August 20, 2010, 10:13:51 pm - (Auto Merged - Double Posts are not allowed before 7 days.)
UPDATE: I'm a little confused... $2000 is the beginning of the attribute table, and yet it's also the first register of the PPU?  :huh:

August 21, 2010, 12:37:52 pm - (Auto Merged - Double Posts are not allowed before 7 days.)
OK I figured it out. Thanks, Dwedit. :)

Given the PPU situation, it is evident that accurate and effective naturalization will not be straightforward. Here's my solution: use partial emulation to basically fill the registers to the the extent that their content can be filled. While not reading the actual intent, basic procedures themselves can be identified by interpolating instructions. For example:

Code: [Select]
1 LHA 1
2 STA $2006
3 LHA 2
4 STA $2006

is assuredly a redesignation of the VRAM I/O address. My technique involves using a tracer to keep a running tab of the values of the registers at each point in the execution. Of course such a trace would be imperfect because user-inputted values would be untraceable. Not really an issue though because it's plainly apparent when a value is untraceable.

Checking STA on line 4, we see it already has one byte written. That means this second byte complete the VRAM I/O redesignation. We'd might as well not verbalize any initial writes to STA, just making note of the initial write when the second comes along.

Code: [Select]
VRAM Address = $0102

I envision three levels of naturalization: outlining (as illustrated in the first example), register-only tracing, and full tracing (full CPU and RAM emulation).

This seems like a broad project... I'd like to work on it with others if there is interest. In particular, it seems to be in line with NESICIDE's goals.

August 27, 2010, 03:57:36 am - (Auto Merged - Double Posts are not allowed before 7 days.)
I've got the basic functionality down. It works, but not for all addressing modes. Need help with this (see the 6502 addressing modes post in Newbie Forum).

August 27, 2010, 04:33:06 pm - (Auto Merged - Double Posts are not allowed before 7 days.)
OK I'm going to make this a release and see if that gets it more attention.
Title: Re: Code Naturalizer
Post by: Kiyoshi Aman on August 27, 2010, 10:44:40 pm
Attention. Attention.

The edit button is less noisy than the reply button.

Please consider using it instead of spamming Reply whenever you have a thought two hours later.
Title: Re: Code Naturalizer
Post by: tcaudilllg on August 28, 2010, 12:17:45 am
If I edit it instead, will this thread still be marked as containing new posts?
Title: Re: Code Naturalizer
Post by: Trax on August 28, 2010, 01:48:56 am
I agree with Tauwasser on the pertinence of the project. When you know ASM and you're used to it, reading becomes clear enough. Just like when someone doesn't know C, it will look like gibberish until it's mastered to a degree. However, some commands, especially console-specific ones, like IO ports, bank switching, PPU writing/reading, etc, could be treated in a different way and read differently, and that could be interesting, in my opinion...
Title: Re: Code Naturalizer
Post by: BRPXQZME on August 28, 2010, 02:29:30 am
If I edit it instead, will this thread still be marked as containing new posts?
If it is the last post, yes.

I agree with Tauwasser on the pertinence of the project. When you know ASM and you're used to it, reading becomes clear enough. Just like when someone doesn't know C, it will look like gibberish until it's mastered to a degree. However, some commands, especially console-specific ones, like IO ports, bank switching, PPU writing/reading, etc, could be treated in a different way and read differently, and that could be interesting, in my opinion...
There’s a fairly interesting article (http://www.joelonsoftware.com/articles/Wrong.html) this reminds me of. Definitely worth a read, though it’s longish and gets into fairly strange topics (such as why there actually is a point to the much-reviled “hungarian notation”—if you understand what you should be using it for).
Title: Re: Code Naturalizer
Post by: tcaudilllg on August 28, 2010, 04:12:06 am
Well I've almost got the register and memory tracing done. In the next version there will be an option to preload the Accumulator, X, and Y registers. Come to think of it, I should also implement a memory reader so that that too can be pre-initialized. Then I aim to implement spot decompilation into BASIC, so that the example I mentioned above works. That will at least make it much clearer what the machine is actually doing. I relish the thought of seeing PPU writes reduced to one line of code. >:D

It'd be totally awesome if this functionality were implemented in say... FCEUX.  ;)
Title: Re: Code Naturalizer
Post by: Ryusui on August 28, 2010, 04:33:05 am
I'd recommend setting this up so that the original ASM still gets printed, but with the "natural language" versions supplied in comments.
Title: Re: Code Naturalizer
Post by: tcaudilllg on August 29, 2010, 05:32:34 am
OK the naturalizer is open for beta.

http://progressivesocionics.co.cc/downloads/NESNaturalizer.html

Instructions:
Load the ROM and hit start. The start address will be calculated. Put the start address of the ROM in the first range field. Put an ending address in the second field. Hit Start.


For the next version, I'm thinking it will be possible to implement some self-documenting algorithms. I noticed in the Bomb Sweeper ROM that after they stored immediate PPU states in the accumulator, the author stored the states in memory BEFORE storing the accumulator to the PPU. And then, just left them there.

Of course there was a purpose for that: they intended to pull those states out of memory later instead of explicitly specifying the states every time. And that's the thing: those states are associated, under certain conditions anyway, with the PPU. So as far as the naturalizer is concerned, you might as well just mark them as such from that point on.
Title: Re: Code Naturalizer
Post by: KingMike on August 29, 2010, 10:40:16 am
For the next version, I'm thinking it will be possible to implement some self-documenting algorithms. I noticed in the Bomb Sweeper ROM that after they stored immediate PPU states in the accumulator, the author stored the states in memory BEFORE storing the accumulator to the PPU. And then, just left them there.
I haven't looked at the code, but I believe that the NES PPU control registers are write-only. (as in, LDA $2000/$2001 should not be expected to return valid data)
So, you have to copy the values to RAM if you want to be able to read them later.
Title: Re: Code Naturalizer
Post by: UglyJoe on August 29, 2010, 01:30:49 pm
Couldn't a tiny bit of devious coding break this entire method?  Like this example (it's GB asm, but applies universally):

from http://web.archive.org/web/20070807121547/http://www.bripro.com/low/obscure/index.php?page=hko_sm3s (http://web.archive.org/web/20070807121547/http://www.bripro.com/low/obscure/index.php?page=hko_sm3s):
Quote
Additionally, they do other tricks using jump opcodes which will fool even BGB's disassembly. However, stepping through and using an interactive disassembler as reference on the side, it only slows down hacking, it could never stop it!

Here's an example:
How the disassembler sees it:
 1234: xx xx  jr 1237
 1236: xx xx  (a data byte of an opcode with operands)
 1238: <error invalid opcode>
How the real code is:
 1234: xx xx  jr 1237
 1236: xx     .db $xx
 1237: xx xx  (a valid opcode)
Title: Re: Code Naturalizer
Post by: Tauwasser on August 29, 2010, 03:23:23 pm
Couldn't a tiny bit of devious coding break this entire method?  Like this example (it's GB asm, but applies universally):

from http://web.archive.org/web/20070807121547/http://www.bripro.com/low/obscure/index.php?page=hko_sm3s (http://web.archive.org/web/20070807121547/http://www.bripro.com/low/obscure/index.php?page=hko_sm3s):
Quote
Additionally, they do other tricks using jump opcodes which will fool even BGB's disassembly. However, stepping through and using an interactive disassembler as reference on the side, it only slows down hacking, it could never stop it!

Here's an example:
How the disassembler sees it:
 1234: xx xx  jr 1237
 1236: xx xx  (a data byte of an opcode with operands)
 1238: <error invalid opcode>
How the real code is:
 1234: xx xx  jr 1237
 1236: xx     .db $xx
 1237: xx xx  (a valid opcode)

Sorry to disappoint you, but this is neither devious coding or trying to fool disassemblers. Z80gb has varying opcode widths and this will usually work out that code doesn't necessarily start aligned with the preceding code. It's a problem in the disassembly logic of BGB that makes it unable to display code starting from an offset it considers invalid or interpreted differently, even when specifying the offset by hand! As long as the code naturalizer only follows jump and interprets pointers and doesn't try to automatically disassemble all code, you will be fine, because the actual jumps will go to 0x1237.

cYa,

Tauwasser
Title: Re: Code Naturalizer
Post by: UglyJoe on August 29, 2010, 03:31:37 pm
Sorry to disappoint you

Why would I be disappointed?

As long as the code naturalizer only follows jump and interprets pointers and doesn't try to automatically disassemble all code, you will be fine, because the actual jumps will go to 0x1237.

Then I suppose the better question for me to have asked is, does it do this or doesn't it?
Title: Re: Code Naturalizer
Post by: tcaudilllg on August 29, 2010, 04:15:28 pm
Not yet. But it will with your help. :)

It might be a good idea to include naturalization functions in FCEUX, don't you think?

I think the primary advantage of naturalization is that reading natural code is simply easier from a conceptual standpoint than using mnemonics. If that weren't true, high level languages wouldn't be used in the first place.

Edit: I just realized you were referring to whether it auto-disassembles. What the naturalizer does is this: it starts trying to read instructions from a user-specified point in memory to a point specified by the user. Ideally the naturalizer would be a component of a debugger. It's the responsibility of the programmer to judge whether they are reading a correct or incorrect interpretation of the code.

Here's how I suggest using the naturalizer: naturalize 1 page of code starting from the code reset vector. This is calculated for you automatically. Copy the output to a text file. Then go back and naturalize from each subroutine address. Make a map of the code. To confirm, use a debugger.

If you make a map of the code, you can of course figure out where the pitfalls are. In fact, it should be possible to design the naturalizer to automatically fix code by comparing code naturalized from subroutine start addresses to the code following the subroutines (if the naturalizer disassembles the subroutine code).


Here's my plan for the next version of the naturalizer:
The machine emulator project will be siphoned off into a separate "advanced" version.

August 30, 2010, 11:38:08 pm - (Auto Merged - Double Posts are not allowed before 7 days.)
If I edit it instead, will this thread still be marked as containing new posts?
If it is the last post, yes.

I did a test, and the results were negative. So until you guys make the appropriate modification, I'm going to have to use additional replies to draw attention.

August 31, 2010, 05:28:14 am - (Auto Merged - Double Posts are not allowed before 7 days.)
UPDATE: I just tried loading the entire code page for the Bomb Sweeper ROM. Firefox began to choke. I only have 512M memory so that might have had something to do with it. Does anyone else have problems?

Oh yeah I don't think I mentioned it but this only works in Firefox.
Title: Re: Code Naturalizer
Post by: Lenophis on September 01, 2010, 01:51:26 am
I did a test, and the results were negative. So until you guys make the appropriate modification, I'm going to have to use additional replies to draw attention.
It won't show new for you, because you already read your own edit. It shows up as new for everybody else. The auto-merging this board does is just a glorified edit with an extra header.
Title: Re: Code Naturalizer
Post by: UglyJoe on September 01, 2010, 07:56:28 pm
UPDATE: I just tried loading the entire code page for the Bomb Sweeper ROM. Firefox began to choke. I only have 512M memory so that might have had something to do with it. Does anyone else have problems?

It does work for me, but the CPU usage spikes to 100%.  Taking a closer look with Firebug shows me that the processing done by naturalizeCode is very fast and not a problem.  The CPU usage cranks up once the scripts dumps the naturalized code into the textarea.  The reason for this, I suspect, is that the textarea form element is not really intended to hold 1.2 megabytes of text :P

A better solution is to fill a "pre" element with the output.  That is, add this to the bottom of the page:

Code: [Select]
<pre id="fillit"></pre>
And then, at the end of naturalizeCode, do this:

Code: [Select]
document.getElementById('fillit').innerHTML = output;
instead of this:

Code: [Select]
ReadoutBox.value = output;
Title: Re: Code Naturalizer
Post by: tcaudilllg on September 01, 2010, 08:24:10 pm
Yeah I think you're right.

However, I think this complicates matters greatly, because now users have to scroll the entire page to read the code. Although there are work arounds, they aren't easily implemented.

UPDATE #1: I put the code at the bottom, like you suggested, UglyJoe.

UPDATE#2: OK I'm realizing a bunch of complications with the design. For one, the naturalizer isn't distinguishing between reads and writes when it labels addresses.

Second, I'm beginning to see the full implications of "mapper hell". Locations in ROM have precise offsets, but the program only refers to these in the context of pages. Some way I must simulate the mapper switcher.
Title: Re: Code Naturalizer
Post by: UglyJoe on September 02, 2010, 08:26:44 pm
However, I think this complicates matters greatly, because now users have to scroll the entire page to read the code. Although there are work arounds, they aren't easily implemented.

If I'm understanding you correctly, it's actually very easily implemented.  Just use CSS to make the pre element act more like a textarea.

Ditch the pre element at the bottom of the page.  Remove the textarea (ReadoutBox) altogether and, in its place, drop in this:

Code: [Select]
<pre style="width: 600px; height: 150px; overflow: auto; border: 1px solid black; float: left;" id="fillit"></pre>
That'll give you a scrolling pre element.  Tweak the height/width/border to your liking.
Title: Re: Code Naturalizer
Post by: tcaudilllg on September 03, 2010, 04:52:53 am
There's a complication, though. See I already put it at the bottom, and used the top space to make room for the documentation function. The user can load and save label/address correspondences by typing them in a textbox.

Ah I think I've got an idea. I'll use the display property to hide the boxes when not in use, and use a switch up top to switch between them.

UPDATE:
I've made a decision about the line-based user doc system. It won't be implemented... best only to document addresses that have a specific, stable purpose, the same being the responsibility of the user to determine. I may implement the line-based doc system as an extension of the non-line based system at some point in the future. There will obviously be a need for mappers, the implementation of which is my focus for the next version.

I aim to make a list of mappers to be tested an read for when the program loads. This file will be in the same directory as the program. If not present, it won't be used.

UglyJoe, more people should know about the power of CSS scrollbars.
Title: Re: Code Naturalizer
Post by: tcaudilllg on October 07, 2010, 02:51:40 am
Adding basic mapper support. The naturalizer can now determine which ROM address an indirect or absolute address is referring to. (only if the address is > 32,767 (0x7FFF)) Coming feature: naturalize by bank.

(The bank feature was inspired by Silenthal's GBRead).

User doc support is not perfect because it doesn't distinguish between writes and reads yet. (although I don't see how that could be a problem)

UPDATE:
I experimented with decompilation. Only two commands were involved, LDA and STA. LDA is omitted from the readout, and STA is replaced with an assignment operator.

The non-decompiled naturalization of BombSweeper's first 100 program bytes.
Code: [Select]
621|$655: Disable decimal mode
1622|$656: Load value at 0 into X-index
1624|$658: Move X-index to address 8192 / 2000 [Picture Processor Control 1]
1627|$65B: Move X-index to address 8193 / 2001 [Picture Processor Control 2]
1630|$65E: X-index - 1
1631|$65F: Set stack pointer address to Index-X
1632|$660: Load value at 8194 / 2002 [Picture Processor Status] into Accumulator
1635|$663: Jump to instruction at address 1384 if last operation returned positive
1637|$665: Load value at 8194 / 2002 [Picture Processor Status] into Accumulator
1640|$668: Jump to instruction at address 1389 if last operation returned positive
1642|$66A: Load value at 192 into Accumulator
1644|$66C: Move Accumulator to address 16407 / 4017 [Control Pad 2 / Expansion Slot]
1647|$66F: Call subroutine at address 51813 / ca65 [ROM address 51813, bank 0] (program counter address saved to stack)
1650|$672: Call subroutine at address 49472 / c140 [ROM address 49472, bank 0] (program counter address saved to stack)
1653|$675: Load value at 6 into Accumulator
1655|$677: Move Accumulator to address 3
1657|$679: Move Accumulator to address 8193 / 2001 [Picture Processor Control 2]
1660|$67C: Load value at 136 into Accumulator
1662|$67E: Move Accumulator to address 2
1664|$680: Move Accumulator to address 8192 / 2000 [Picture Processor Control 1]
1667|$683: Load value at 0 into Accumulator
1669|$685: Move Accumulator to address 5
1671|$687: Jump to instruction at address 50832 / c690 [ROM address 50832, bank 0]
1674|$68A: Call subroutine at address 51547 / c95b [ROM address 51547, bank 0] (program counter address saved to stack)
1677|$68D: Call subroutine at address 51868 / ca9c [ROM address 51868, bank 0] (program counter address saved to stack)
1680|$690: Call subroutine at address 51530 / c94a [ROM address 51530, bank 0] (program counter address saved to stack)
1683|$693: Jump to instruction at address 50826 / c68a [ROM address 50826, bank 0]
1686|$696: Move X-index to address 16
1688|$698: Move Y-index to address 17
1690|$69A: Shift the value in Accumulator left one bit
1691|$69B: Move Accumulator to X-index
1692|$69C: Load value at 56204 / db8c [ROM address 56204, bank 0] into Accumulator
1695|$69F: Move Accumulator to address 248
1697|$6A1: Load value at 56205 / db8d [ROM address 56205, bank 0] into Accumulator
1700|$6A4: Move Accumulator to address 249
1702|$6A6: Load value at 0 into Y-index
1704|$6A8: Load value at 248 into Accumulator
1706|$6AA: Y-index + 1
1707|$6AB: Move Accumulator to address 20
1709|$6AD: Load value at 19 into X-index
1711|$6AF: Load value at 248 into Accumulator
1713|$6B1: Y-index + 1
1714|$6B2: Set carry bit to 0
1715|$6B3: Add value at address 17 to Accumulator (+ 1 if carry set)
1717|$6B5: Move Accumulator to address 512
1720|$6B8: Load value at 248 into Accumulator

The decompiled version:
Code: [Select]
1621|$655: Disable decimal mode
1622|$656: Load value at 0 into X-index
1624|$658: Move X-index to address 8192 / 2000 [Picture Processor Control 1]
1627|$65B: Move X-index to address 8193 / 2001 [Picture Processor Control 2]
1630|$65E: X-index - 1
1631|$65F: Set stack pointer address to Index-X
1635|$663: Jump to instruction at address 1384 if last operation returned positive
1640|$668: Jump to instruction at address 1389 if last operation returned positive
1644|$66C: 16407 / 4017 [Control Pad 2 / Expansion Slot] = 192
1647|$66F: Call subroutine at address 51813 / ca65 [ROM address 51813, bank 0] (program counter address saved to stack)
1650|$672: Call subroutine at address 49472 / c140 [ROM address 49472, bank 0] (program counter address saved to stack)
1655|$677: 3 = 6
1657|$679: 8193 / 2001 [Picture Processor Control 2] = 6
1662|$67E: 2 = 136
1664|$680: 8192 / 2000 [Picture Processor Control 1] = 136
1669|$685: 5 = 0
1671|$687: Jump to instruction at address 50832 / c690 [ROM address 50832, bank 0]
1674|$68A: Call subroutine at address 51547 / c95b [ROM address 51547, bank 0] (program counter address saved to stack)
1677|$68D: Call subroutine at address 51868 / ca9c [ROM address 51868, bank 0] (program counter address saved to stack)
1680|$690: Call subroutine at address 51530 / c94a [ROM address 51530, bank 0] (program counter address saved to stack)
1683|$693: Jump to instruction at address 50826 / c68a [ROM address 50826, bank 0]
1686|$696: Move X-index to address 16
1688|$698: Move Y-index to address 17
1690|$69A: Shift the value in Accumulator left one bit
1691|$69B: Move Accumulator to X-index
1695|$69F: 248 = 56204 / db8c [ROM address 56204, bank 0]
1700|$6A4: 249 = 56205 / db8d [ROM address 56205, bank 0]
1702|$6A6: Load value at 0 into Y-index
1706|$6AA: Y-index + 1
1707|$6AB: 20 = 248
1709|$6AD: Load value at 19 into X-index
1713|$6B1: Y-index + 1
1714|$6B2: Set carry bit to 0
1715|$6B3: Add value at address 17 to Accumulator (+ 1 if carry set)
1717|$6B5: 512 = 248

Clearly there are some bugs to work out. The decompiler doesn't take into account the accumulator's addressing mode, leading to some confusion over what values are actually being used. It would probably be better to include the LDAs so as to prevent confusion when branching instructions read the accumulator.

I want to emphasize that the point of the decompiler is not to port the code, but to make it more readable and less intimidating.
Title: Re: Code Naturalizer
Post by: tcaudilllg on October 14, 2010, 09:32:29 am
I completed the decompiler (mostly, no memory model yet) and distinguished between addresses and immediate operands. As such, this tool may be ready for regular use (I would like your opinion on whether that is true or not). The mode distinguishing function also enabled me to clean up the naturalizer's output a little.

Here's the naturalized output for Bombsweeper:
Code: [Select]
1621|$655: Disable decimal mode
1622|$656: Load 0 into X-index
1624|$658: Move X-index to @8192 / 2000 [Picture Processor Control 1]
1627|$65B: Move X-index to @8193 / 2001 [Picture Processor Control 2]
1630|$65E: X-index - 1
1631|$65F: Set stack pointer address to Index-X
1632|$660: Load @8194 / 2002 [Picture Processor Status] into Accumulator
1635|$663: Jump to instruction at @1384 if last operation returned positive
1637|$665: Load @8194 / 2002 [Picture Processor Status] into Accumulator
1640|$668: Jump to instruction at @1389 if last operation returned positive
1642|$66A: Load 192 into Accumulator
1644|$66C: Move Accumulator to @16407 / 4017 [Control Pad 2 / Expansion Slot]
1647|$66F: Call subroutine at @51813 / ca65 [ROM address 51813, bank 0] (program counter address saved to stack)
1650|$672: Call subroutine at @49472 / c140 [ROM address 49472, bank 0] (program counter address saved to stack)
1653|$675: Load 6 into Accumulator
1655|$677: Move Accumulator to @3
1657|$679: Move Accumulator to @8193 / 2001 [Picture Processor Control 2]
1660|$67C: Load 136 into Accumulator
1662|$67E: Move Accumulator to @2
1664|$680: Move Accumulator to @8192 / 2000 [Picture Processor Control 1]
1667|$683: Load 0 into Accumulator
1669|$685: Move Accumulator to @5
1671|$687: Jump to instruction at @50832 / c690 [ROM address 50832, bank 0]
1674|$68A: Call subroutine at @51547 / c95b [ROM address 51547, bank 0] (program counter address saved to stack)
1677|$68D: Call subroutine at @51868 / ca9c [ROM address 51868, bank 0] (program counter address saved to stack)
1680|$690: Call subroutine at @51530 / c94a [ROM address 51530, bank 0] (program counter address saved to stack)
1683|$693: Jump to instruction at @50826 / c68a [ROM address 50826, bank 0]
1686|$696: Move X-index to @16
1688|$698: Move Y-index to @17
1690|$69A: Shift Accumulator left one bit
1691|$69B: Move Accumulator to X-index
1692|$69C: Load @56204 / db8c [ROM address 56204, bank 0] into Accumulator
1695|$69F: Move Accumulator to @248
1697|$6A1: Load @56205 / db8d [ROM address 56205, bank 0] into Accumulator
1700|$6A4: Move Accumulator to @249
1702|$6A6: Load 0 into Y-index
1704|$6A8: Load @248 into Accumulator
1706|$6AA: Y-index + 1
1707|$6AB: Move Accumulator to @20
1709|$6AD: Load @19 into X-index
1711|$6AF: Load @248 into Accumulator
1713|$6B1: Y-index + 1
1714|$6B2: Set carry bit to 0
1715|$6B3: Add @17 to Accumulator (+ 1 if carry set)
1717|$6B5: Move Accumulator to @512
1720|$6B8: Load @248 into Accumulator

The decompiled version:
Code: [Select]
1621|$655: Decimals OFF
1624|$658: @8192 / 2000 [Picture Processor Control 1] = X (0)
1627|$65B: @8193 / 2001 [Picture Processor Control 2] = X (0)
1630|$65E: X (0) - 1
1631|$65F: Set stack pointer address to Index-X
1635|$663: if Accumulator (@8194 / 2002 [Picture Processor Status]) >= 0 then goto @1384
1640|$668: if Accumulator (@8194 / 2002 [Picture Processor Status]) >= 0 then goto @1389
1644|$66C: @16407 / 4017 [Control Pad 2 / Expansion Slot] = Accumulator (192)
1647|$66F: Call @51813 / ca65 [ROM address 51813, bank 0]
1650|$672: Call @49472 / c140 [ROM address 49472, bank 0]
1655|$677: @3 = Accumulator (6)
1657|$679: @8193 / 2001 [Picture Processor Control 2] = Accumulator (6)
1662|$67E: @2 = Accumulator (136)
1664|$680: @8192 / 2000 [Picture Processor Control 1] = Accumulator (136)
1669|$685: @5 = Accumulator (0)
1671|$687: Goto @50832 / c690 [ROM address 50832, bank 0]
1674|$68A: Call @51547 / c95b [ROM address 51547, bank 0]
1677|$68D: Call @51868 / ca9c [ROM address 51868, bank 0]
1680|$690: Call @51530 / c94a [ROM address 51530, bank 0]
1683|$693: Goto @50826 / c68a [ROM address 50826, bank 0]
1686|$696: @16 = X (0)
1688|$698: @17 = Y (0)
1690|$69A: Shift Accumulator left one bit
1691|$69B: X = Accumulator (0)
1695|$69F: @248 = Accumulator (@56204 / db8c [ROM address 56204, bank 0])
1700|$6A4: @249 = Accumulator (@56205 / db8d [ROM address 56205, bank 0])
1706|$6AA: Y (0) + 1
1707|$6AB: @20 = Accumulator (@248)
1713|$6B1: Y (0) + 1
1714|$6B2: RESET Carry
1715|$6B3: Add @17 to Accumulator (+ 1 if carry set)
1717|$6B5: @512 = Accumulator (@248)

I removed output for all the loading instructions. Instead, the values of the registers are referenced after the registers in parentheses when operations involving them are invoked.
Title: Re: Code Naturalizer
Post by: tcaudilllg on October 24, 2010, 09:19:55 pm
So what are the opinions on this project?
Title: Re: Code Naturalizer
Post by: Valendian on October 24, 2010, 10:10:50 pm
So is this gonna be a tool that generates the type of comments an amatuer asm coder would write or will it actually be more of a decompiler?

If its gonna be a decompiler I'd suggest you forget about the natural language interface and just focus on decompilation. The ideal user of such a tool would be someone well versed in coding asm and some form of HLL. Cater to this persons needs, let the person who's only learning asm pick things up in their own good time.

The difference is a tool that looks at a push opcode and says "a value is being pushed on top of the stack" rather than a tool that analyzes further and finds that this is a local variable that is kept on the stack and it appears to be an unsigned int, or a pointer to a structure which is made up of 5 int's and 2 char's. It's this kind of feedback that is needed IMO.

http://www.backerstreet.com/decompiler/introduction.htm
Title: Re: Code Naturalizer
Post by: tcaudilllg on October 24, 2010, 10:48:16 pm
Well the naturalizer is pretty well finished. It uses most of the same routines as the decompiler so improvements to the decompiler relevant to the naturalizer are also improvements to the naturalizer.

I hadn't thought about figuring the types of the datums.

Honestly seeing code output by the naturalizer makes the program more understandable. I'd rather see the naturalized code than a simple disassembly; but I'd rather see the decompiled code to either. Seeing the naturalized version did assist in the development of the decompiler, however.

Right now I'm trying to figure a routine that distinguishes between writes and reads. (memory writes are, of course, the means by which mapper functions are invoked).


UPDATE: I finished the mapper routine by trashing most of the output from the descriptor attachment function when the instruction is a memory write (STA/Y/X). Wasteful, true, but irrelevant. Also removing the NES specific bits (register labels) from the main routine, so as to make the program capable of supporting configurations for other 6502 systems. This marks a move to a general 6502 naturalizer/decompiler.

I've not updated the program since the last (stable) update, so the version accessible from my site is the same as before.

The new version (retitled "6502 Naturalizing Decompiler"), uses a script-based plug-in format. Once I get the NES config completed, I'll move on to a PC-Engine configuration.
Title: Re: Code Naturalizer
Post by: PolishedTurd on October 27, 2010, 06:22:13 pm
I just wanted to chime in and say I think the naturalizer is a very good idea. I've worked with assembly somewhat before, but I still find it helpful to have verbose clarification about what is happening, especially as things go into the wee hours and the likelihood of dumb errors on my part increases. If I'm stuck somewhere, it's just one more tool to help me get through. Can't turn that down. An offline version would be great, too.
Title: Re: Code Naturalizer
Post by: tcaudilllg on October 31, 2010, 03:27:23 am
Thank you PT. I'll have the next version of the naturalizer, which includes full RAM and mapper simulation, available soon.

I agree that the naturalizer is a good idea and I think it would be helpful to see them for other CPUs and systems. My chief reason for doing an NES naturalizer is to illustrate the method by which such a program would be designed. Maybe no one will use the naturalizer... but if its design is implemented in a disassembler then all the better.

For the next version, I'm going to try something new. I was thinking about the advantages of RAM simulation and realized that for purposes of reverse engineering, the ability to conduct the simulation is not necessarily as important as the ability to make the overall calculation processed through the operations as transparent as possible.

Consider:

LDA 4
ADC ACC, $3 (234)
SBC ACC, $10F5 (3)

which computed yeilds:

4 + 234 = 238 (assume carry zero)
238 - 3 = 235

Now let's imagine instead that we took the documentation approach.

Accumulator = 4
Accumulator = Accumulator [4] +  $3 (234)
Accumulator = Accumulator [4 + $3 (234)] - $10F5 (3)

[ Accumulator = 4 + 234 ($3) - 3 ($10F5) ]

And of course the two approaches could be mixed to offer both excellent simulation and documentation.

The most obvious application is the study of compression routines, however math formulas could be exposed as well and even identified.
Title: Re: Code Naturalizer
Post by: tcaudilllg on February 03, 2011, 06:43:13 pm
I've not worked on this for a while. I recently underwent a lifestyle change (working part-time + school) and with that happening my enthusiasm for this project has diminished. I'd like to work with someone to bring out the next version, but on the matter of the technical issues of the design I am nonplussed.
Title: Re: Code Naturalizer
Post by: Vanya on February 03, 2011, 11:09:32 pm
I hope things work out for you. I would definitely like to have this tool for my remake projects.
It would be very useful to be able to study NES code without having to spend hours translating ASM.
Title: Re: Code Naturalizer
Post by: tcaudilllg on February 09, 2011, 04:25:34 am
Quote
Right now I'm trying to figure a routine that distinguishes between writes and reads. (memory writes are, of course, the means by which mapper functions are invoked).

This is where I had trouble.
Title: Re: Code Naturalizer
Post by: tcaudilllg on March 29, 2011, 11:14:21 am
I submitted a zip of the naturalizer.