News: 11 March 2016 - Forum Rules
Current Moderators - DarkSol, KingMike, MathOnNapkins, Azkadellia, Danke

Author Topic: Code Naturalizer  (Read 25561 times)

tcaudilllg

  • Sr. Member
  • ****
  • Posts: 431
    • View Profile
Code Naturalizer
« on: August 20, 2010, 01:21:20 pm »
I'm meaning to write a code naturalizer, a kind of disassembler which uses natural language. (to an extent, anyway) I think this will greatly improve the readability of code and make hacking easier.

The program will function by reading all memory within a range specified by the user. It then attempts to convert the memory into opcodes.

Features I want to include:
  • hex and base 10 address readout
  • supplementing reserved addresses with their functional names

Example output:
Code: [Select]
"MOV AX, &FFFF" -> "Move data at 0xFFFF/65535 into Accumulator"

I suspect it will not be difficult to write, although I may need some help with the details. The NES is my first target.

Tauwasser

  • Hero Member
  • *****
  • Posts: 1392
  • Fantabulous!!
    • View Profile
    • My blog
Re: Code Naturalizer
« Reply #1 on: August 20, 2010, 01:30:59 pm »
Bad example, as it doesn't translate back into ASM.

cYa,

Tauwasser

tcaudilllg

  • Sr. Member
  • ****
  • Posts: 431
    • View Profile
Re: Code Naturalizer
« Reply #2 on: August 20, 2010, 02:20:59 pm »
Bad example, as it doesn't translate back into ASM.

cYa,

Tauwasser

Why does it have to?

What I aim to do with this is make it easier for hackers to figure out what's going on in the first place.

I'm reviewing the Javascript port of vNES. I'd like to excerpt the CPU core for use in my project. However, I'm not sure if its design will allow me that option.
« Last Edit: August 20, 2010, 02:34:34 pm by tcaudilllg »

Tauwasser

  • Hero Member
  • *****
  • Posts: 1392
  • Fantabulous!!
    • View Profile
    • My blog
Re: Code Naturalizer
« Reply #3 on: August 20, 2010, 04:46:59 pm »
I thought your approach was to make hacking easier. I understood your initial post as if it would be going from ASM to naturalized code and then back in order for the hacker to tinker with the naturalized code.

IMO just naturalizing isn't worth programming. Opcodes are there for a reason: They are concise and exact. Anybody with half a brain will be able to read them just as easily ― if not easier ― than your naturalized code. Simply because a defined and concise syntax is worth its money. Already three lines of your naturalized code will become incomprehensible and unreadable.

So basically, you want to achieve a way to output ASM code in a naturalized way to help people understand it. It's not worth the effort and nobody will be able to properly learn from it anyway. Abstraction is the way to go. Understanding what a piece of code does is not the sum of its parts. Just look at the old threads Rai had going on ― sometimes as many as three for the same piece of code/topic ― and he understood all opcodes, yet couldn't find the abstract routine they implemented or what they accomplished.

cYa,

Tauwasser

tcaudilllg

  • Sr. Member
  • ****
  • Posts: 431
    • View Profile
Re: Code Naturalizer
« Reply #4 on: August 20, 2010, 06:42:32 pm »
Here's what I've got so far:
(this is a partial rip from JNES)

Link

August 20, 2010, 06:43:05 pm - (Auto Merged - Double Posts are not allowed before 7 days.)
Link

I disagree that abstraction is the way to go, for the simple reason that the opcodes are kinda abstract and difficult to deal with. Next I mean to put in the memory labeling scheme, followed by an option to view either memory addresses or their functional labels. Long term, divining intent (within reason) is an option as well.

--Moderator edit--
Replaced extensively long code tags with pasty code links.
« Last Edit: August 27, 2010, 10:38:56 pm by Lenophis »

DarknessSavior

  • Hero Member
  • *****
  • Posts: 5031
  • Darkness.
    • View Profile
    • DS: No, not the Nintendo one.
Re: Code Naturalizer
« Reply #5 on: August 20, 2010, 07:23:57 pm »
You need to make sure that if you do this, your opcode descriptions for output are exactly what they are supposed to be. For example, you have ADC as "Add value at address (address) to Accumulator". It should include something about it adding one, if the carry flag is set. Maybe you could make a separate case for when the carry flag is set, and when it isn't. But make sure those little details are included.

~DS
Red Comet: :'( Poor DS. Nobody loves him like RC does. :'(
Sliver-X: LET ME INFRINGE UPON IT WITH MY MOUTH
DSRH - Currently working on: Demon's Blazon, Romancing SaGa, FFIV EasyType.
http://www.youtube.com/user/DarknessSavior

tcaudilllg

  • Sr. Member
  • ****
  • Posts: 431
    • View Profile
Re: Code Naturalizer
« Reply #6 on: August 20, 2010, 08:08:35 pm »
About the address load functions... I think what's needed is to rip out the actual address reading functions and to replace them instead with the address itself.

Here's the original:
Code: [Select]

    load: function(addr){
        if (addr < 0x2000) {
            return mem[addr & 0x7FF];
        }
        else {
            return nes_mmap_load(addr);
        }
    },
   
    load16bit: function(addr){
        if (addr < 0x1FFF) {
            return mem[addr&0x7FF]
                | (mem[(addr+1)&0x7FF]<<8);
        }
        else {
            return nes_mmap_load(addr) | (nes_mmap_load(addr+1) << 8);
        }
    },


And here's the rip:
Code: [Select]

    load: function(addr){
        if (addr < 0x2000) {
            return addr & 0x7FF;
        }
        else {
            return addr;
        }
    },
   
    load16bit: function(addr){
        if (addr < 0x1FFF) {
            return addr & 0x7FF;
        }
        else {
            return addr;
        }
    },

I think that's right.

The key to assessing the meaning of the address is implicit in that "add < 0x2000" check: if the address is a certain number, then it represents a specific function of the system. For example if it's in the first page, it's CHR-ROM. After that, it's PRG-ROM. It can be a source of confusion to recall the meaning of specific address ranges when trying to figure out what's a program is doing. It's another language. I can see how people who like seeing things in terms of different symbolic systems would like dealing with the numbers in and of themselves. But that doesn't make it easier to figure out what's going on, unless you are an adept in that kind of mind transplant. (I'm not putting people who can do that down -- it's an important skill -- but correlating address ranges with functions is not the best use of that talent).

August 20, 2010, 08:10:44 pm - (Auto Merged - Double Posts are not allowed before 7 days.)
Quote from: DarknessSavior
For example, you have ADC as "Add value at address (address) to Accumulator". It should include something about it adding one, if the carry flag is set. Maybe you could make a separate case for when the carry flag is set, and when it isn't. But make sure those little details are included.

Thanks for pointing that out. I'll have to make an adjustment, certainly.

August 20, 2010, 08:32:14 pm - (Auto Merged - Double Posts are not allowed before 7 days.)
This JNES is really awesome. All sorts of tools that can be made from it.

August 20, 2010, 10:13:51 pm - (Auto Merged - Double Posts are not allowed before 7 days.)
UPDATE: I'm a little confused... $2000 is the beginning of the attribute table, and yet it's also the first register of the PPU?  :huh:

August 21, 2010, 12:37:52 pm - (Auto Merged - Double Posts are not allowed before 7 days.)
OK I figured it out. Thanks, Dwedit. :)

Given the PPU situation, it is evident that accurate and effective naturalization will not be straightforward. Here's my solution: use partial emulation to basically fill the registers to the the extent that their content can be filled. While not reading the actual intent, basic procedures themselves can be identified by interpolating instructions. For example:

Code: [Select]
1 LHA 1
2 STA $2006
3 LHA 2
4 STA $2006

is assuredly a redesignation of the VRAM I/O address. My technique involves using a tracer to keep a running tab of the values of the registers at each point in the execution. Of course such a trace would be imperfect because user-inputted values would be untraceable. Not really an issue though because it's plainly apparent when a value is untraceable.

Checking STA on line 4, we see it already has one byte written. That means this second byte complete the VRAM I/O redesignation. We'd might as well not verbalize any initial writes to STA, just making note of the initial write when the second comes along.

Code: [Select]
VRAM Address = $0102

I envision three levels of naturalization: outlining (as illustrated in the first example), register-only tracing, and full tracing (full CPU and RAM emulation).

This seems like a broad project... I'd like to work on it with others if there is interest. In particular, it seems to be in line with NESICIDE's goals.

August 27, 2010, 03:57:36 am - (Auto Merged - Double Posts are not allowed before 7 days.)
I've got the basic functionality down. It works, but not for all addressing modes. Need help with this (see the 6502 addressing modes post in Newbie Forum).

August 27, 2010, 04:33:06 pm - (Auto Merged - Double Posts are not allowed before 7 days.)
OK I'm going to make this a release and see if that gets it more attention.
« Last Edit: August 27, 2010, 04:33:06 pm by tcaudilllg »

Kiyoshi Aman

  • RHDN Patreon Supporter!
  • Hero Member
  • *****
  • Posts: 2262
  • Browncoat Captain
    • View Profile
    • Aerdan's Blog
Re: Code Naturalizer
« Reply #7 on: August 27, 2010, 10:44:40 pm »
Attention. Attention.

The edit button is less noisy than the reply button.

Please consider using it instead of spamming Reply whenever you have a thought two hours later.

tcaudilllg

  • Sr. Member
  • ****
  • Posts: 431
    • View Profile
Re: Code Naturalizer
« Reply #8 on: August 28, 2010, 12:17:45 am »
If I edit it instead, will this thread still be marked as containing new posts?

Trax

  • RHDN Patreon Supporter!
  • Hero Member
  • *****
  • Posts: 589
    • View Profile
    • Trax ROM Hacking
Re: Code Naturalizer
« Reply #9 on: August 28, 2010, 01:48:56 am »
I agree with Tauwasser on the pertinence of the project. When you know ASM and you're used to it, reading becomes clear enough. Just like when someone doesn't know C, it will look like gibberish until it's mastered to a degree. However, some commands, especially console-specific ones, like IO ports, bank switching, PPU writing/reading, etc, could be treated in a different way and read differently, and that could be interesting, in my opinion...

BRPXQZME

  • Hero Member
  • *****
  • Posts: 4572
  • じー
    • View Profile
    • The BRPXQZME Network
Re: Code Naturalizer
« Reply #10 on: August 28, 2010, 02:29:30 am »
If I edit it instead, will this thread still be marked as containing new posts?
If it is the last post, yes.

I agree with Tauwasser on the pertinence of the project. When you know ASM and you're used to it, reading becomes clear enough. Just like when someone doesn't know C, it will look like gibberish until it's mastered to a degree. However, some commands, especially console-specific ones, like IO ports, bank switching, PPU writing/reading, etc, could be treated in a different way and read differently, and that could be interesting, in my opinion...
There’s a fairly interesting article this reminds me of. Definitely worth a read, though it’s longish and gets into fairly strange topics (such as why there actually is a point to the much-reviled “hungarian notation”—if you understand what you should be using it for).
we are in a horrible and deadly danger

tcaudilllg

  • Sr. Member
  • ****
  • Posts: 431
    • View Profile
Re: Code Naturalizer
« Reply #11 on: August 28, 2010, 04:12:06 am »
Well I've almost got the register and memory tracing done. In the next version there will be an option to preload the Accumulator, X, and Y registers. Come to think of it, I should also implement a memory reader so that that too can be pre-initialized. Then I aim to implement spot decompilation into BASIC, so that the example I mentioned above works. That will at least make it much clearer what the machine is actually doing. I relish the thought of seeing PPU writes reduced to one line of code. >:D

It'd be totally awesome if this functionality were implemented in say... FCEUX.  ;)

Ryusui

  • Hero Member
  • *****
  • Posts: 4989
  • It's the greatest day.
    • View Profile
    • Tumblr
Re: Code Naturalizer
« Reply #12 on: August 28, 2010, 04:33:05 am »
I'd recommend setting this up so that the original ASM still gets printed, but with the "natural language" versions supplied in comments.
In the event of a firestorm, the salad bar will remain open.

tcaudilllg

  • Sr. Member
  • ****
  • Posts: 431
    • View Profile
Re: Code Naturalizer
« Reply #13 on: August 29, 2010, 05:32:34 am »
OK the naturalizer is open for beta.

http://progressivesocionics.co.cc/downloads/NESNaturalizer.html

Instructions:
Load the ROM and hit start. The start address will be calculated. Put the start address of the ROM in the first range field. Put an ending address in the second field. Hit Start.


For the next version, I'm thinking it will be possible to implement some self-documenting algorithms. I noticed in the Bomb Sweeper ROM that after they stored immediate PPU states in the accumulator, the author stored the states in memory BEFORE storing the accumulator to the PPU. And then, just left them there.

Of course there was a purpose for that: they intended to pull those states out of memory later instead of explicitly specifying the states every time. And that's the thing: those states are associated, under certain conditions anyway, with the PPU. So as far as the naturalizer is concerned, you might as well just mark them as such from that point on.
« Last Edit: August 29, 2010, 09:44:24 am by tcaudilllg »

KingMike

  • Forum Moderator
  • Hero Member
  • *****
  • Posts: 7152
  • *sigh* A changed avatar. Big deal.
    • View Profile
Re: Code Naturalizer
« Reply #14 on: August 29, 2010, 10:40:16 am »
For the next version, I'm thinking it will be possible to implement some self-documenting algorithms. I noticed in the Bomb Sweeper ROM that after they stored immediate PPU states in the accumulator, the author stored the states in memory BEFORE storing the accumulator to the PPU. And then, just left them there.
I haven't looked at the code, but I believe that the NES PPU control registers are write-only. (as in, LDA $2000/$2001 should not be expected to return valid data)
So, you have to copy the values to RAM if you want to be able to read them later.
"My watch says 30 chickens" Google, 2018

UglyJoe

  • Hero Member
  • *****
  • Posts: 869
  • smoke and mirrors
    • View Profile
    • ximwix.net/xb
Re: Code Naturalizer
« Reply #15 on: August 29, 2010, 01:30:49 pm »
Couldn't a tiny bit of devious coding break this entire method?  Like this example (it's GB asm, but applies universally):

from http://web.archive.org/web/20070807121547/http://www.bripro.com/low/obscure/index.php?page=hko_sm3s:
Quote
Additionally, they do other tricks using jump opcodes which will fool even BGB's disassembly. However, stepping through and using an interactive disassembler as reference on the side, it only slows down hacking, it could never stop it!

Here's an example:
How the disassembler sees it:
 1234: xx xx  jr 1237
 1236: xx xx  (a data byte of an opcode with operands)
 1238: <error invalid opcode>
How the real code is:
 1234: xx xx  jr 1237
 1236: xx     .db $xx
 1237: xx xx  (a valid opcode)

Tauwasser

  • Hero Member
  • *****
  • Posts: 1392
  • Fantabulous!!
    • View Profile
    • My blog
Re: Code Naturalizer
« Reply #16 on: August 29, 2010, 03:23:23 pm »
Couldn't a tiny bit of devious coding break this entire method?  Like this example (it's GB asm, but applies universally):

from http://web.archive.org/web/20070807121547/http://www.bripro.com/low/obscure/index.php?page=hko_sm3s:
Quote
Additionally, they do other tricks using jump opcodes which will fool even BGB's disassembly. However, stepping through and using an interactive disassembler as reference on the side, it only slows down hacking, it could never stop it!

Here's an example:
How the disassembler sees it:
 1234: xx xx  jr 1237
 1236: xx xx  (a data byte of an opcode with operands)
 1238: <error invalid opcode>
How the real code is:
 1234: xx xx  jr 1237
 1236: xx     .db $xx
 1237: xx xx  (a valid opcode)

Sorry to disappoint you, but this is neither devious coding or trying to fool disassemblers. Z80gb has varying opcode widths and this will usually work out that code doesn't necessarily start aligned with the preceding code. It's a problem in the disassembly logic of BGB that makes it unable to display code starting from an offset it considers invalid or interpreted differently, even when specifying the offset by hand! As long as the code naturalizer only follows jump and interprets pointers and doesn't try to automatically disassemble all code, you will be fine, because the actual jumps will go to 0x1237.

cYa,

Tauwasser

UglyJoe

  • Hero Member
  • *****
  • Posts: 869
  • smoke and mirrors
    • View Profile
    • ximwix.net/xb
Re: Code Naturalizer
« Reply #17 on: August 29, 2010, 03:31:37 pm »
Sorry to disappoint you

Why would I be disappointed?

As long as the code naturalizer only follows jump and interprets pointers and doesn't try to automatically disassemble all code, you will be fine, because the actual jumps will go to 0x1237.

Then I suppose the better question for me to have asked is, does it do this or doesn't it?

tcaudilllg

  • Sr. Member
  • ****
  • Posts: 431
    • View Profile
Re: Code Naturalizer
« Reply #18 on: August 29, 2010, 04:15:28 pm »
Not yet. But it will with your help. :)

It might be a good idea to include naturalization functions in FCEUX, don't you think?

I think the primary advantage of naturalization is that reading natural code is simply easier from a conceptual standpoint than using mnemonics. If that weren't true, high level languages wouldn't be used in the first place.

Edit: I just realized you were referring to whether it auto-disassembles. What the naturalizer does is this: it starts trying to read instructions from a user-specified point in memory to a point specified by the user. Ideally the naturalizer would be a component of a debugger. It's the responsibility of the programmer to judge whether they are reading a correct or incorrect interpretation of the code.

Here's how I suggest using the naturalizer: naturalize 1 page of code starting from the code reset vector. This is calculated for you automatically. Copy the output to a text file. Then go back and naturalize from each subroutine address. Make a map of the code. To confirm, use a debugger.

If you make a map of the code, you can of course figure out where the pitfalls are. In fact, it should be possible to design the naturalizer to automatically fix code by comparing code naturalized from subroutine start addresses to the code following the subroutines (if the naturalizer disassembles the subroutine code).


Here's my plan for the next version of the naturalizer:
  • Allow user documentation. The plan is to offer an input box into which a file can be loaded with appropriate labels for memory addresses depending on the line. Memory addresses may be used for different purposes from line to line, thus it is necessary to distinguish between purposes according to line.
  • Saving readouts to disk.
  • Output sub and jump addresses to a list to make mapping easier.
The machine emulator project will be siphoned off into a separate "advanced" version.

August 30, 2010, 11:38:08 pm - (Auto Merged - Double Posts are not allowed before 7 days.)
If I edit it instead, will this thread still be marked as containing new posts?
If it is the last post, yes.

I did a test, and the results were negative. So until you guys make the appropriate modification, I'm going to have to use additional replies to draw attention.

August 31, 2010, 05:28:14 am - (Auto Merged - Double Posts are not allowed before 7 days.)
UPDATE: I just tried loading the entire code page for the Bomb Sweeper ROM. Firefox began to choke. I only have 512M memory so that might have had something to do with it. Does anyone else have problems?

Oh yeah I don't think I mentioned it but this only works in Firefox.
« Last Edit: August 31, 2010, 05:28:14 am by tcaudilllg »

Lenophis

  • Discord Staff
  • Hero Member
  • *****
  • Posts: 971
  • The return of the sombrero!
    • View Profile
    • Slick Productions
Re: Code Naturalizer
« Reply #19 on: September 01, 2010, 01:51:26 am »
I did a test, and the results were negative. So until you guys make the appropriate modification, I'm going to have to use additional replies to draw attention.
It won't show new for you, because you already read your own edit. It shows up as new for everybody else. The auto-merging this board does is just a glorified edit with an extra header.


https://ff6randomizer.codeplex.com/ - Randomize your FF6 experience!