logo
 drop

Main

Community

Submissions

Help

71498478 visitors

Author Topic: Program Design and Data Structure (TextAngel)  (Read 8935 times)

Nightcrawler

  • Hero Member
  • *****
  • Posts: 5734
    • View Profile
    • Nightcrawler's Translation Corporation
Program Design and Data Structure (TextAngel)
« on: April 07, 2011, 02:34:52 pm »
I'm looking to get a little help or ideas on some design aspects of my WIP utility TextAngel. In a bit of a novel approach, this program aims to both dump and insert using the same interface. I found that if I had enough information to get the text out, that same information is usually enough to get the text back in. That holds true for most supported cases. I offer Atlas compatible dumping output in the event you need to go beyond what my program can handle insertion wise. Sometimes a novel approach is novel only because everybody else thought it was too stupid to bother with! And for every successful idea, there are many failed ones. I hope that is not the case here! :laugh:

Here are some screenshots and an overview of how the utility is currently laid out:
http://transcorp.parodius.com/forum/YaBB.pl?num=1273690996

Software design wise, I have an input, a pipeline, and an output. Then I simply have descriptor attributes about each to determine exactly what to do with it in the pipeline, and helper objects to do the work. The idea here is when you are dumping, you assemble all source pointers and encoded text strings into a standardized data structure. When you insert, you would assemble all destination pointers and decoded hex strings into that same standardized data structure. Then you can post process in any variety of ways to transform, order, and process before you sent it over to the final output stage. In the output stage, you handle outputting to text files, inserting into ROM etc. The output stage is somewhat simple because it's just going to traverse the data structure and insert pointers and strings per what's defined there.
 
The first question one might ask me is what kind of data structure are we talking about here and how it can serve both dumping and inserting needs? That's a good question and one of the main design issues I'm looking for assistance on.

I'm sure I may need to collect more information later, but this is what I have now which works for most of my test cases thus far.

StringInfo Object:
Properties:
StringStart - Starting location from data file
StringEnd - Ending location from data file
Text - Text String
Hex - Hex Encoded Representation of String
EncodedByteLength - Original byte length when dumping (assumes post processing will alter 'Hex' property.

PointerInfo Object:
Properties:
Format - A PointerFormat structure with all associated parameters describing the format of this pointer.
PointerLocation - The location the pointer was taken from for dumping. The location it should be written to for insertion.
PointerValue - The pointer value itself.

Note: Methods are immaterial here and not listed. The issue at hand is with data structure/representation only.

Now, I just tie them together and make an associative array out of it. Each entry has a pointer and associated string and all information we need to do a variety of things with them before output. So, the standardized data structure is the associative array.


Questions:

1. Does this approach make sense to anybody else? I am a solid mechanical coder, but I have never been great at software design.

2. If the approach doesn't sound like one a retarded monkey would take, does anyone have any better ideas for a standardized data structure? Mine feels a bit off to me.

3. I seemed to back myself into a bit of a wall if I attempt to do something a bit more complicated. For example, the pointer tree method, even with just two layers, assumes you'd have a 'master' pointer table with every pointer pointing to smaller pointer tables with pointers to strings. How do I represent this? Well, with my seemingly naive design, I would simply have a list of pointers (that do not have string info) with each pointing to it's own associative string/pointer pair array. That doesn't sound too horrible at first until I think about adding more layers, and how cumbersome it will become to navigate. It seems to point to a problem in my data structure.

Any feedback or ideas would be helpful.
TransCorp - Over 15 years of community dedication.
Dual Orb 2, Wozz, Emerald Dragon, Tenshi No Uta, Herakles IV SFC/SNES Translations

Klarth

  • Sr. Member
  • ****
  • Posts: 422
  • Location: Pittsburgh
    • View Profile
Re: Program Design and Data Structure (TextAngel)
« Reply #1 on: April 07, 2011, 04:44:56 pm »
When doing a joint structure for dumping and insertion, there's a few considerations.  First is how (and if) you are going to keep all the data persistent between the dumping and insertion stages.  No problem here for XML.  But if you dump all the data to a single text file, you can easily overwhelm the translator/editor with information that's unnecessary from their standpoint.  Ideally, the latter approach would have either text <-> spreadsheet or a companion editor.

I'd need some examples about the type of trees you're going to support.  And probably more input from other people who have dumped tree structures.  Whether they're true trees or if it's more like an exercise in building a VFS and realigning a lot of data so that expansion is possible.

abw

  • Newbie
  • *
  • Posts: 42
    • View Profile
Re: Program Design and Data Structure (TextAngel)
« Reply #2 on: April 07, 2011, 11:40:17 pm »
StringInfo looks quite reasonable, but PointerInfo looks a little on the light side compared to what I'd expect. What exactly does your PointerFormat contain? Does it include string format information? If so, would it make more sense to store that information inside StringInfo? Assuming C# has decent regular expression support, I'd be tempted to store string format as a regular expression (passing in any required info to StringInfo's constructor, of course):

C-style: [^\0]*\0 (or whatever your end token is)
1-byte Pascal: (.).{hex2dec(\1)}
fixed length: .{length}
from next pointer: .{hex2dec(next pointer's target address - this pointer's target address)}
etc.

You could then use the regular expression to figure out the length/end point in cases where it might otherwise be difficult (or indeed in any case, which should cut down on the required coding). Maybe you could pre-define common formats and allow users to specify arbitrary formats with interpolated quantities.

There are probably about a million things I haven't taken into consideration, but I think I'd start with a structure something like this (and then vacillate on names for a few days :P):

Pointer:
property nameproperty descriptionexampledump sourceinsert source
source addresswhere the pointer lives in ROMthe $28016 part of #W16($28016)specified by user or calculated from pointer tablespecified by user or calculated from pointer table
formatendianness, length2-byte little-endian pointer: B2B1
3-byte big-endian pointer: B1B2B3
specified by userspecified by user
contentvalue of the pointer in ROMread from ROMcalculated from target address and offset
target addressaddress of beginning of string
(purely for convenience, since this information is replicated in the String object)
calculated from content and offsetspecified by user or calculated from target address and offset
stringdata to which the pointer pointsconstructed String objectconstructed String object
offsetconstant used to convert between content and target address
(this conversion is always target address = content + offset, right?)
calculated from #JMP($29500) and #W16($28016)specified by userspecified by user

String:
property nameproperty descriptiondump sourceinsert source
start addressROM address of beginning of stringpointer's target addresspointer's target address
end addressROM address of ending of stringcalculated from formatcalculated from format
hexROM contentROMcalculated by insert function
textdump/insert file contentcalculated by dump functionspecified by user
formatregular expression (created during object construction)calculated by dump functioncalculated by insert function

Nightcrawler

  • Hero Member
  • *****
  • Posts: 5734
    • View Profile
    • Nightcrawler's Translation Corporation
Re: Program Design and Data Structure (TextAngel)
« Reply #3 on: April 08, 2011, 10:07:45 am »
When doing a joint structure for dumping and insertion, there's a few considerations.  First is how (and if) you are going to keep all the data persistent between the dumping and insertion stages.  No problem here for XML.  But if you dump all the data to a single text file, you can easily overwhelm the translator/editor with information that's unnecessary from their standpoint.  Ideally, the latter approach would have either text <-> spreadsheet or a companion editor.

I don't think the data needs to be persistent much at all. Although you use the same interface, many parameters are going to be different. It's very typical for me to dump a script and when I insert, stick it somewhere entirely different, change pointer format, and change string formats. Or I may choose to do none of that and insert exactly where it came from. Either way, you'd have to set up a different 'project' file for insertion if nothing else but to choose your new input and output file scheme. Most necessary information is provided in the GUI. The only data that would need to be persistent outside of the GUI would be a few markers for ends of pointer tables or blocks/tree levels to be able to insert (The markers could simply be a subset of Atlas commands for simplicity). There will be a point where you'd need to use Atlas because the level of tuning or changes you want to make is going to exceed what the utility provides. I do not plan on duplicating full Atlas functionality by parsing all Atlas commands. I want to do it entirely via the GUI, but understand some markings, especially in the case where dump output went all to one file is necessary. I wasn't even going to have the utility insert at all originally, but I found it very convenient for my own usage for many simpler scenarios.

Quote
I'd need some examples about the type of trees you're going to support.  And probably more input from other people who have dumped tree structures.  Whether they're true trees or if it's more like an exercise in building a VFS and realigning a lot of data so that expansion is possible.

OK, let's take this real world example that I have seen. I'm limited to in-depth knowledge of application on the NES/SNES here as I have not worked with other systems enough to really know what type of trees you would have or how closely it aligns with typical VFS scenarios.

You start with a 24-bit pointer table. Each pointer points to a small group of 16-bit pointers. Each 16-bit pointer there points to a final table with 16-bit relative pointers to strings. So, you have pointers that point to strings and two layers of pointers that just point to pointer tables. It probably doesn't get much deeper than that for the systems I have worked with, but it's probably possible.


abw:

PointerFormat contains much. It contains all of the information from the Pointer Tab and methods to convert to and from various formats and platforms. It does not contain StringFormat information. I have a StringFormat object in the project settings containing that information. It's assumed the same string format will be used throughout the project. I did this because I view the string format as only really being needed to parse into strings initially and do hex-to-text or hex-to-text conversion. After that, I didn't need it anymore. I set up different projects for main dialog, items etc. that may use different string formats. I think the interface can get very complicated/cumbersome if it's not set up like that. However, I can see it might be useful and logical to attach this to each individual string in StringInfo. I will give it some thought.

I don't know if regular expressions work. With C-style, you have any number of end tokens from any of your tables that can indicate the end. I don't need to know what they all are, just know that I stop when I hit one because it's C-style. The table object will indicate to the decoding stream when an end token has been parsed. Pascal is more complicated as well. The length byte may or may not be included, the pascal length may be multiple bytes, and then endian would come into play. I have a hard time imaging how all of the operations I do would work with regular expressions. Although I have a difficult time in general with regular expressions. Can you expand on this?

We more or less have the same actual data. You fleshed out the meaning better than me. :) You organized differently, which is helpful in seeing why my original was a bit 'off'. A few ideas I liked. You put the string data/object in the Pointer object. I initially thought of it as not really describing the Pointer. I also thought sometimes there would be no pointers involved such as a raw offset to offset dump. But, that's still technically a pointer and it simplifies everything greatly to add it in there rather than glue a string and pointer together like I did with an associative array. What about pointers that only point to pointers such as the pointer tree example above to Klarth? You can have a List of pointers, but what if it goes a layer up where you have pointers to pointers to the pointers with strings? I'm a little hung up here.

You added the StringFormat to String. I guess that makes sense. I touched on why I didn't already do that above. But it does simplify things even if I left the GUI the same.

I have the 'offset' in my PointerFormat object as I believe it's part of the format. There are several parameters in there and the offset may not always be a constant. There needs to be a function to convert from linear to machine and vice versa. I took the approach of storing all linear addresses internally and then use the conversion function when necessary for output of machine addresses or upon input converting a pointer.

I initially wanted to label it a String object as well, but I don't think you can do that. The compiler complained. That's how I ended up with StringInfo.
TransCorp - Over 15 years of community dedication.
Dual Orb 2, Wozz, Emerald Dragon, Tenshi No Uta, Herakles IV SFC/SNES Translations

Geiger

  • Newbie
  • *
  • Posts: 38
  • Location: Indianapolis, IN, USA
    • View Profile
    • Geiger's Crypt
Re: Program Design and Data Structure (TextAngel)
« Reply #4 on: April 08, 2011, 12:27:51 pm »
From a cursory glance, I do not immediately see anything wrong with your layout.  The pointer data (as I understand it) would not be able to support the complex situations I have encountered in Chrono Trigger though.  You would need the ability to have an arbitrary number of pointers and the capability to support pointers that do not have all of their data together.  You might also want to add the ability to read substring data from the ROM as part of the string table.
This is the patent age of new inventions -/- For killing bodies, and for saving souls, -/- All propagated with the best intentions. --Lord Byron

Nightcrawler

  • Hero Member
  • *****
  • Posts: 5734
    • View Profile
    • Nightcrawler's Translation Corporation
Re: Program Design and Data Structure (TextAngel)
« Reply #5 on: April 08, 2011, 01:21:58 pm »
From a cursory glance, I do not immediately see anything wrong with your layout.  The pointer data (as I understand it) would not be able to support the complex situations I have encountered in Chrono Trigger though.  You would need the ability to have an arbitrary number of pointers and the capability to support pointers that do not have all of their data together.  You might also want to add the ability to read substring data from the ROM as part of the string table.

I'd love to see some specific examples. I'm pretty sure I've got it covered in some way if it's practical to do in a generic utility. SNES is my forte after-all. Arbitrary pointers can be handled via the pointer list option. You can specify individual pointers anywhere throughout the entire ROM if you so desire. Additionally, there is auto discovery mode that will find and extract the pointers for you from a given range based on your pointer parameters and string start locations from a string data range (subject to possible false positives for small pointers). Between those, you should even be able to use it on cases where the 'pointer' is hard-coded via LDA immediate instructions. If it goes beyond that, you probably need a custom solution.

The term 'substring' is a bit ambiguous. We've spoken briefly in another topic on substrings as overlapping string pointers, which are possible to handle. However, you may be talking about a 'substring' in the context of an embedded pointer within a string that commands the game to jump to another string. If that's the case, that's much more difficult requiring some form of script interpretation and game specific logic. At this time, that's probably not practical unless you have some ideas on implementation or program design modifications one could make to support something like this. Also, there's the whole user interface issue for something like that.

Perhaps script interpretation could be done via a post processing routine that uses some  instructions in an external file where your game specific stuff would be. Maybe that file would have some regular expression rules that could dictate when you see a certain sequence, it loads a new pointer. Or maybe something more advanced like a scripting language extension. Just bouncing some ideas around. I don't know if I want to get that ambitious. However, it is smart to think about it now so groundwork is laid so it could be done later if desired.

If you have any specific examples that you think are still not handled after my explanation, feel free to run them by me for benefit of all. :)
TransCorp - Over 15 years of community dedication.
Dual Orb 2, Wozz, Emerald Dragon, Tenshi No Uta, Herakles IV SFC/SNES Translations

abw

  • Newbie
  • *
  • Posts: 42
    • View Profile
Re: Program Design and Data Structure (TextAngel)
« Reply #6 on: April 09, 2011, 03:11:11 am »
Yeah, "String" is a reserved word in several languages, so I can see C# maybe complaining too. If you're willing to assume all strings in a project are of the same format, then storing that format at a higher level should reduce your program's memory requirements. On the other hand, storing string format information at the string level allows you to handle strings in the same dump/insert session with different formats, though I agree this is unlikely to happen very often. It might also help with code compartmentalization, depending on where you do other related activities. Since it isn't used very often, I don't see any compelling reason to store it in one place as opposed to another; it just seems more logical to store it in StringInfo, what with it being information about the string and all  :-\.

A string format is just a pattern, and regular expressions are all about pattern matching. I can spit out some examples from perl (which has excellent regular expression support) off the top of my head, if that's helpful. I'm short a compiler at the moment, so let's just assume I got the syntax right. I'm using them inline, but you could also store the format in StringInfo if you ever needed it later. Also, these examples are not particularly concerned with computational efficiency. My main point here is that I can imagine writing a hundred lines of character-by-character string parsing code that would not accomplish anything more than these few lines using regular expressions, so if you can do the same thing in C#, you might be able to save yourself a lot of time.

C-style: the process is identical whether reading from ROM or from file; only the end tokens change
Code: [Select]
my @end_tokens = ('<end>', '<really end>', '<end right now>', '<this time I mean it!>'); # these would really come from the Table object
my $combined_end_token = join('|', @end_tokens); # $combined_end_token is now '<end>|<really end>|<end right now>|<this time I mean it!>'
my ($string) = (substr($data, $start_address) =~ /^(.*?$combined_end_token)/);
That last line looks complicated but really isn't - starting from $start_address, it searches through $data until it finds the first occurance of any of the end tokens, then stores whatever it found in $string.

Pascal: this one only works for dumping... people write pascal inserts in C-style, no? If not, there are other ways to do this
Code: [Select]
my ($pascal_length, $pascal_endian, $include_length) = (3, 'big', 1); # these would really be user-specified (also: 1 is a true value in perl)
my ($string) = (substr($data, $start_address) =~ /^(.{$pascal_length}).{hex2dec(\1, $pascal_endian, $include_length)}/);
I'm pretty sure no common programming language comes with a built in multibyte endian aware radix converter, so you'd have to write your own hex2dec function for this to work, but that wouldn't take long.

Fixed Length: the process is identical whether reading from ROM or from file; only the length changes
Code: [Select]
my $length = 432; # this would be user-specified
my ($string) = (substr($data, $start_address) =~ /^(.{$length})/);

From Next Pointer: probably also bidirectional, but I don't think there's a standard insert representation for this type
Code: [Select]
my $length = $next_start_address - $start_address; # assuming these are in decimal; these would come from the Pointer object referencing this String
my ($string) = (substr($data, $start_address) =~ /^(.{$length})/);
As you noted in the other thread, these are almost identical... with the addition of a fake n+1 pointer, Fixed Length could be implemented as From Next Pointer where all the pointers just happen to differ by the same amount.


I also had pointer offset as a method rather than a property but didn't feel like introducing methods since you hadn't either ;). In order to handle things like Cartographer's POINTER_RELATIVE_PC, it would have to be a function rather than a constant, but you could always just store the results of that function in a property.

As a general comment, when working with tree-like data structures, recursion is your friend. In this case, though, I think polymorphism is what you're looking for. Instead of having PointerInfo contain a StringInfo reference, I'd probably make a Pointee (or whatever you want to call it) class, and then have PointerTable, PointerInfo, and StringInfo all inherit from Pointee. Then you could say things like:

Code: [Select]
pure virtual function Pointee.dump() {} // or however you say it in C# - this is a method that must be independently implemented by any subclass of Pointee

function PointerTable.dump() {
    for (Pointee in this.PointerList) {
        Pointee.dump();
    }
}

function PointerInfo.dump() {
    pointee.dump(); // is pointee (the name of this object's Pointee) a PointerTable or a StringInfo (or even another PointerInfo)? it doesn't matter anymore!
}

function StringInfo.dump() {
    output text/hex;
}

How you get the user to specify an arbitrary pointer tree structure is another question entirely, of course, but at least you'd be able to support it if they did manage to specify one :P

How much progress have you made on mapping out program flow apart from the GUI (which I see you've already got :thumbsup:) and file I/O (which is probably boring)?

Nightcrawler

  • Hero Member
  • *****
  • Posts: 5734
    • View Profile
    • Nightcrawler's Translation Corporation
Re: Program Design and Data Structure (TextAngel)
« Reply #7 on: April 11, 2011, 03:10:58 pm »
A string format is just a pattern, and regular expressions are all about pattern matching. I can spit out some examples from perl (which has excellent regular expression support) off the top of my head, if that's helpful. I'm short a compiler at the moment, so let's just assume I got the syntax right. I'm using them inline, but you could also store the format in StringInfo if you ever needed it later. Also, these examples are not particularly concerned with computational efficiency. My main point here is that I can imagine writing a hundred lines of character-by-character string parsing code that would not accomplish anything more than these few lines using regular expressions, so if you can do the same thing in C#, you might be able to save yourself a lot of time.

Thanks for the expansion. I understand. However, I think I disagree here with application on C-style and I'll tell you why.

Quote
C-style: the process is identical whether reading from ROM or from file; only the end tokens change
Code: [Select]
my @end_tokens = ('<end>', '<really end>', '<end right now>', '<this time I mean it!>'); # these would really come from the Table object
my $combined_end_token = join('|', @end_tokens); # $combined_end_token is now '<end>|<really end>|<end right now>|<this time I mean it!>'
my ($string) = (substr($data, $start_address) =~ /^(.*?$combined_end_token)/);
That last line looks complicated but really isn't - starting from $start_address, it searches through $data until it finds the first occurrence of any of the end tokens, then stores whatever it found in $string.

I agree this is true for the insertion direction, however it does not work in the dumping direction. You cannot parse based on your hex end tokens because they may appear elsewhere in the string and NOT be an end token. Example of two bytes for Kanji and one byte end token. The end token byte may appear as part of a Kanji character. Additionally, you have things like linked entries where the end token hex may appear as parameter bytes there. You can't know the hex is actually an end token without parsing the data to that point. You need to know the context of which it appears.

Actually, that fact has been a pain for me with program flow and design simplicity. I cannot extract into strings without parsing. Then, if I'm already going to have to parse, I end up having to do the string separation and hex to text conversion in the same step. The text encoder ends up having to yell out to the higher ups when it has decoded an end token, so they can then do the string separation and reset the stream variables etc. It would be much simpler if the it was able to be parsed into strings ahead of time without that intimate knowledge.

Quote
Pascal: this one only works for dumping... people write pascal inserts in C-style, no? If not, there are other ways to do this
Code: [Select]
my ($pascal_length, $pascal_endian, $include_length) = (3, 'big', 1); # these would really be user-specified (also: 1 is a true value in perl)
my ($string) = (substr($data, $start_address) =~ /^(.{$pascal_length}).{hex2dec(\1, $pascal_endian, $include_length)}/);
I'm pretty sure no common programming language comes with a built in multibyte endian aware radix converter, so you'd have to write your own hex2dec function for this to work, but that wouldn't take long.

There is still need to insert strings in pascal. Depending on your game, it can be a monumental task to recode the engine to use C-style and pointers rather than say Pascal and string counting. It can be much simpler just to insert in pascal. I certainly would not discount or omit the ability to insert in pascal. The utility should be able to dump or insert with all of it's supporting string types. Lastly, I actually have a project myself where I want this pascal ability.

Quote
As a general comment, when working with tree-like data structures...,
How you get the user to specify an arbitrary pointer tree structure is another question entirely, of course, but at least you'd be able to support it if they did manage to specify one :P

Nice idea there with the polymorphism. I will think about it further. The hangup is there are pointers to strings and pointers to pointers. To start, to support the trees I know of, all that is needed from the user is the pointer format and start/end for each tree level. I'm sure it may be limited, but it is experimental and I'm just dipping feet in to see how useful it will be. Primarily it will cover what I am familiar with.

Quote
How much progress have you made on mapping out program flow apart from the GUI (which I see you've already got :thumbsup:) and file I/O (which is probably boring)?

The program is functional in it's rawest form. I've used it on a few projects of mine. It needs lots of polish and several areas fleshed out. It also needs some test code written as it supports many possibilities. A small change for case three may break case twelve without decent regression testing. Since my design has been in constant flux, I haven't done a lot of that stuff.

As I mentioned, there's Input, Pipeline, and Output. I've tried to do the whole thing with a MVP design pattern. I'm good at MVC on the web with a database, but I'm not sure how to properly handle a multi-tab desktop application like this. So, I have a single large View and Presenter for it all that handles the whole user interface end. Upon trying to dump or insert, the data is handed off to the Model in suitable format by the Presenter.

TextAngelProject - This is a master object that ends up getting all data from all tabs into organized Input, Output, and helper objects, including global program options. It can save and load itself from XML. It handles initiating the pipeline process and tying it together with input and output.

Currently the input object has a set of methods such as RawDump, RawInsert, PointerListDump, PointerListInsert, PointerDump, PointerInsert etc. Those will take care of what they need to do each of those operations. The result of each would be the StringInfo/PointerInfo data structure we spoke about. That's passed on to the output object for writing to file. It just traverses the data structure and turns into the final output. In the dump case, a UTF-8 text file. I haven't implemented any serious post processing yet.

Simplified version of method TextAngleProject.DumpText():

Code: [Select]
public void DumpText()
        {

            int block = 1;
            foreach(DataFileInfo f in inputsource)
            {
               
                Dictionary<Pointer,StringInfo> output;
                if(f.PointerMode==DataFileInfo.PointerHandlingType.NO_POINTERS)
                {
                    output = f.RawDump(tablefiles, encodingformat, stringformat);
                }
                else if(f.PointerMode==DataFileInfo.PointerHandlingType.POINTER_LIST)
                {
                    output = f.PointerListDump(tablefiles, encodingformat, stringformat, pointerformat);
                }
                outputsource.WriteFile(output,f,block,encodingformat);
                block++;
            }
           
        }

It's a bit bastardized right now from changing design and there's no post processing options yet, but that's what I've got at present. The idea was the input should know everything it needs to know to dump or insert itself to our standardized format (the dictionary in the code above). Then it would be post processed and passed to the output object for final output. There's an awful lot of logic in the input object. I don't know if it makes sent to have it all there. I've been looking at extracting some out. I blur my own definition of what my input object is. I sort of treat it as a file sometimes and other times I treat it more as any given input operation.

Anyway, I wouldn't mind suggestions here either. I can always create working code to get the job done, but I struggle with smart, flexible design. The only way to get better is to keep trying and keep exposing yourself to others code and ideas to further build upon your own. :)
TransCorp - Over 15 years of community dedication.
Dual Orb 2, Wozz, Emerald Dragon, Tenshi No Uta, Herakles IV SFC/SNES Translations

abw

  • Newbie
  • *
  • Posts: 42
    • View Profile
Re: Program Design and Data Structure (TextAngel)
« Reply #8 on: April 11, 2011, 11:41:26 pm »
Ah, sorry, I should have been more clear with my Pascal example. I meant that the insert file would use artificial end tokens (e.g. /<END>\n\n) which look like C-style terminators rather than explicitly saying e.g.

#W16($0002)<$05>Hello
#W16($0004)<$06>World!

Also, just as a proof of concept, here's a more robust regexp for C-style strings:
Code: [Select]
# these would be programmatically constructed
my $regular_tokens = '01|FD00|';
my $end_tokens = 'FD94|FF|';
my $linked_tokens = 'E0....|';

my $non_end_tokens = $regular_tokens . $linked_tokens;
$non_end_tokens =~ s/\|\|/\|/g; # compress duplicate |
$non_end_tokens =~ s/^\||\|$//g; # remove leading and trailing |
# $non_end_tokens is now 01|FD00|E0....

my ($string) = (substr($data, $start_address) =~ /^($non_end_tokens*?$end_tokens)/);

However, as you quite rightly point out, you are writing an entire tokenizer. How could you not? And if you're doing that already, you might as well leverage it for higher level functionality. Clearly I wasn't thinking enough about the bigger picture here... Actually, this looks kind of like writing a compiler. Anyway, I think you could still do everything you need with regular expressions, but given that you already have a working implementation, there's not much point to changing it. Even if you didn't already have this implemented, I'm not actually sure regular expressions would be better. I just brought them up as another tool worth considering.


The input stage does need to know an awful lot. I suspect that's where the bulk of my important code would live too. I've spent enough time cleaning up after other people's bad design decisions to have some notion of what works and what doesn't. Maybe I'll sit down with pencil and paper and try to glue the various components I've been thinking about together into a cohesive whole.


On a related tangent, is this still the most recent new table file format documentation?

In section 2.3, how would you feel about allowing \ as a general escape character, as is the case in many other languages? It doesn't matter much when \n is the only defined escape sequence, but as this standard evolves more such sequences may be added (\t comes to mind). Thinking about evolution, it might also be handy to provide some mechanism for including comments and version information.

Should the example in 2.5 read "$E0=<Color>,2" instead of just "E0=<Color>,2"?

Also, I'd like to take this opportunity to say that I like specs written by developers much more than I like specs written by marketing people :beer:

Geiger

  • Newbie
  • *
  • Posts: 38
  • Location: Indianapolis, IN, USA
    • View Profile
    • Geiger's Crypt
Re: Program Design and Data Structure (TextAngel)
« Reply #9 on: April 12, 2011, 09:00:21 am »
I'd love to see some specific examples.

A lot of the string pointers are implemented in LDA statements.  (The first three arguments to creating a PointerRecord here are raw Bank, Range, and Byte addresses.)

Code: [Select]
Rec.Pointer[0] = new PointerRecord(0x02E4D9, 0x02E4D4, 0x02E4D3, 0, true, true);
Rec.Pointer[1] = new PointerRecord(0x02CA8B, 0x02CA86, 0x02CA85, 0, true, true);
Rec.Pointer[2] = new PointerRecord(0x02E5F7, 0x02E5F2, 0x02E5F1, 0, true, true);
Rec.Pointer[3] = new PointerRecord(0x02E85B, 0x02E856, 0x02E855, 0, true, true);
Rec.nRecords = 0x07;

And while I don't know that this is within the scope of your program, Flux also updates some hard-coded indexes so string banks can be moved.  (Seems the code tag removes tabs.)

// Adjust index based Years pointers
if(nID == (byte) StrRecType.Years) {
   switch(G.nRomType) {
   // Japanese
   case 0:
      // Indexed origin
      for(byte i = 0; i < 8; i++) {
         G.WorkingData[0x026944 + i] = i;
      }
      // '???' string
      G.WorkingData[0x0268F0] = 0x08;
      G.WorkingData[0x0268DE] = 0x08;
      // Other hardcoded year indexes
      G.WorkingData[0x0268E5] = 0x01;
      G.WorkingData[0x0268D3] = 0x02;
      break;
   // English
   case 1:
   default:
      // Indexed origin
      for(byte i = 0; i < 8; i++) {
         G.WorkingData[0x0269F0 + i] = i;
      }
      // '???' string
      G.WorkingData[0x02699C] = 0x08;
      G.WorkingData[0x02698A] = 0x08;
      // Other hardcoded year indexes
      G.WorkingData[0x026991] = 0x01;
      G.WorkingData[0x02697F] = 0x02;
      break;
   // Beta
   case 2:
      // Indexed origin
      for(byte i = 0; i < 8; i++) {
         G.WorkingData[0x026DD3 + i] = i;
      }
      // '???' string
      G.WorkingData[0x026D7F] = 0x08;
      // Other hardcoded year indexes
      G.WorkingData[0x026D74] = 0x02;
      break;
   }
}
This is the patent age of new inventions -/- For killing bodies, and for saving souls, -/- All propagated with the best intentions. --Lord Byron

Nightcrawler

  • Hero Member
  • *****
  • Posts: 5734
    • View Profile
    • Nightcrawler's Translation Corporation
Re: Program Design and Data Structure (TextAngel)
« Reply #10 on: April 12, 2011, 11:28:11 am »
Ah, sorry, I should have been more clear with my Pascal example. I meant that the insert file would use artificial end tokens (e.g. /<END>\n\n) which look like C-style terminators rather than explicitly saying e.g.

Gotcha. Agreed. Although, I have done some work in the past where I didn't use line breaks or end tokens because I wanted a a really clean script dump. Insertion was done by matching double line breaks as string end versus single as line break. A bit cumbersome, but did end up working fine. I can't say I'd do it again though, and I won't be supporting such a method for insertion in TextAngel either! Artificial control codes it is!

Quote
On a related tangent, is this still the most recent new table file format documentation?
Yes.

Quote
In section 2.3, how would you feel about allowing \ as a general escape character, as is the case in many other languages? It doesn't matter much when \n is the only defined escape sequence, but as this standard evolves more such sequences may be added (\t comes to mind). Thinking about evolution, it might also be handy to provide some mechanism for including comments and version information.

That has been a hot topic. Fine cases have been made to remove them entirely. Fine cases have been made to do a general escape character. A little history. Originally, there were a few other controls. They ended up becoming unnecessary and were removed. '\t' was one of them. You don't need it. Tab is a valid character that can be used on the text side of your table entries. Why do you need a special control for it? So, only '\n' remains. As such, for parsing simplicity, it was made not to be an escape character after some discussion. The standard needs to be simple enough that amateur coders can take a crack at using it for it to be successful. I know when we added table switching that may have defeated that ideal right there, but simplicity was always in mind nonetheless. Also, as proof, most utilities source code that already supported such a feature also only did it with a simple string replace. Raising the bar on implementation to need escape sequences, especially for only one, seemed counterproductive to the cause of wider use.

So, that's how it ended up the way it is. It's certainly not a bad idea to have an escape, but there is only one and I don't really see the format expanding. It's successor should probably be a new XML based format that goes to spreadsheet or something. But, for that to ever happen and catch on, a suite of associated utilities need to be made such as table makers, editors, and/or converters etc.

Quote
Should the example in 2.5 read "$E0=<Color>,2" instead of just "E0=<Color>,2"?

Also, I'd like to take this opportunity to say that I like specs written by developers much more than I like specs written by marketing people :beer:

Yes. Thanks for catching that mistake.  :beer:


Geiger:
Quote
A lot of the string pointers are implemented in LDA statements.  (The first three arguments to creating a PointerRecord here are raw Bank, Range, and Byte addresses.)

I had originally intended upon supporting this via pointer list and a pointer format option for spacing between the individual pointer bytes (Not found in GUI yet). However, your implementation gave me a good idea. I should instead expand what you can put on a line in the pointer list file to do something similar. I could allow specifying location of each byte of the pointer there looking a bit similar to your PointerRecord call. That's a more intuitive way to do it than I could do via the GUI because it would allow you to have a list of pointers that may use DIFFERENT methods of loading or LDA statements. Example, some pointers may be loaded byte by byte with 8-bit LDA's while others are a 16-bit and 8-bit load etc. This would allow you to group both together if you desire. What do you think of that idea?

Quote
And while I don't know that this is within the scope of your program, Flux also updates some hard-coded indexes so string banks can be moved.  (Seems the code tag removes tabs.)

That looks pretty game specific.  However, I think you can do something similar in POINTER_TREE mode. You'd have a single 'pointer' in the upper tree level that would be that hard coded index to the string bank pointer list. If you have many of those indexes and string banks, you'd need to setup a separate operation for each one with using POINTER_TREE and the associated single index. A little effort required by the user but I believe it is doable that way and far less work than a custom utility. I hadn't thought about a usage scenario like that. Interesting. :)
TransCorp - Over 15 years of community dedication.
Dual Orb 2, Wozz, Emerald Dragon, Tenshi No Uta, Herakles IV SFC/SNES Translations

Geiger

  • Newbie
  • *
  • Posts: 38
  • Location: Indianapolis, IN, USA
    • View Profile
    • Geiger's Crypt
Re: Program Design and Data Structure (TextAngel)
« Reply #11 on: April 12, 2011, 04:58:08 pm »
What do you think of that idea?

Sounds very similar to my setup, which should work well.  Mine is a little more general purpose though, since PointerRecord is used for all record types.  The other three arguments to the new operator up there are, in order:

. Address Adjustment (short) - when the stored pointer does not point directly to the data.  Usually it points to some value in the data of a fixed size record.  It is used in the Treasure data (and probably several other record types that I do not move).
. Bank Present (bool) - the pointer's Bank address is present
. Address Present (bool) - the pointer's Range and Byte addresses are present

The booleans are true for all instances of strings in Chrono Trigger, but other data sometimes only has a Bank or the Range and Byte addresses present.  While local pointers should be no mystery, the reason there is sometimes only a Bank pointer is because the Range and Byte addresses have already been loaded somewhere else.  Frequently through special register byte math.
This is the patent age of new inventions -/- For killing bodies, and for saving souls, -/- All propagated with the best intentions. --Lord Byron

Nightcrawler

  • Hero Member
  • *****
  • Posts: 5734
    • View Profile
    • Nightcrawler's Translation Corporation
Re: Program Design and Data Structure (TextAngel)
« Reply #12 on: April 14, 2011, 03:14:18 pm »
I'm pretty happy with the direction gone in thus far with our Pointer/String data structures in handling basic strings and pointers. However, I'm still trying to wrap my head around this pointer tree deal. I think it's important to support this as I've seen in many times. I have a real world test case I'm using as an example and I'm trying to figure out the best way to represent this to handle dumping and inserting.

32-bit pointer table:
Each pointer in this table is 32-bits and points to a secondary pointer table.

16-bit secondary pointer table:
Each pointer here is a 16-bit relative pointer added to the base 32-bit pointer that points to a 16-bit string pointer table.

16-bit string pointer table:
Each pointer here points to a string.

In my game specific utilities, I'd just load up a unsigned int list of the 32-bit table and loop through. For each loop, I'd put together the 16-bit pointer list and do a nested loop through those. In the inner most loop, I'd load up the string pointer list and finally dump the strings. More or less the same nested process for insertion.

For TextAngel, this doesn't quite work:

1. The number of tree layers should be variable (but probably wouldn't practically be more then say 1-4).
2. Our pointer format is flexible and thus pointers aren't just unsigned int's, they're PointerInfo objects.
3. The pointer objects we've spoken about in this topic are really string pointers and point to string data. Now we have a pointer, which is mostly the same, however it points to pointer data instead of string data and should contain such data.

I just keep getting get hung up over pointers that point to strings and pointers that point to pointers that may point to more pointers that point to strings. A pointer that points to a string and a pointer that points to a pointer are mostly the same object. So, it made me think about having base Pointer object and then inheritance to make a StringPointer object with the added string data and a PointerPointer (sounds like a dumb name huh?) that contains the added target pointer data. So, then you have a master list of PointerPointers (32-bit pointers). Each one of those contains it's own list of PointerPointers (16-bit pointers) which each contain a list of StringPointers.

That just sounds way overcomplicated.  :-\
TransCorp - Over 15 years of community dedication.
Dual Orb 2, Wozz, Emerald Dragon, Tenshi No Uta, Herakles IV SFC/SNES Translations

BRPXQZME

  • Hero Member
  • *****
  • Posts: 4750
  • Location: Centreville, Virginia, USA
  • everything sucks forever
    • View Profile
    • The BRPXQZME Network
Re: Program Design and Data Structure (TextAngel)
« Reply #13 on: April 14, 2011, 04:17:11 pm »
P-P-P-Pointers

Spoiler:
And drawing upon the whole of my college CS education, that is my contribution to this thread. :laugh:
we are in a horrible and deadly danger

Klarth

  • Sr. Member
  • ****
  • Posts: 422
  • Location: Pittsburgh
    • View Profile
Re: Program Design and Data Structure (TextAngel)
« Reply #14 on: April 15, 2011, 12:04:30 am »
Here's a quick mockup (UI code only) I made that may help you think through a few things (I left out some properties).  I envisioned property editing in the TreeView itself, but the PropertyGrid isn't bad.  You could use a "TreeListView" with custom controls since the properties between node types are almost the same.


There are two types of nodes, PointerToPointerTable (1 pointer still considered a table) and PointerTableToStrings.  The Room nodes are all PointerTableToStrings and one is being edited on the screen. 

Process for setting this up:
  • Add a master node - This can be a real pointer or "virtual pointer" to a pointer table.  Virtual pointer meaning the start address is defined by the user so the user doesn't have to find the topmost pointer (usually buried in game code and could be split up between instructions).  This tree is a "contiguous" tree, meaning that there's no data unrelated to the tree inside of the data structure.  This contrasts from SNES games that can have banks strewn throughout the ROM.  Contiguous trees are focused a bit differently in rebuilding as pointer tables will be displaced from their original locations (due to strings being longer or shorter).
  • Setup properties for the master node - Basically we're defining a struct (a real struct definition would be better than the way this UI works), but only including the details we want.  For this sample such as PointerSpacing and PointerStartOffset help us dodge around data.
  • Setup properties for each subnode (could be automated from user defined structures) - Process is the same as above
  • ...Setup anymore subnodes below those
  • Setup properties for each room node (ie. nodes that have pointers to strings) - Similar to above, but we add String Settings for String Type (and other settings you might need) as well.

As far as your data structures go, TreeView can organize it for you.  I'm not sure how strict you are with MVP/MVC, but the TreeView can -be- your data structure or you could write a wrapper to convert back and forth.

So there's a few things missing:
  • Structure definitions - Particularly being able to reference pointer counts which are often in the data instead of making the user define it each time.  You could put the field such as "SceneNode1\Room3[3]:2" meaning the path of the node, the offset (4 bytes from StartAddress) and 2 bytes long for the size so you can read the number of pointers.  Doesn't cover if the struct stores size (in bytes) of the pointer table rather than count though.  You could append a [3]:2S or [3]:2C I guess.
  • Priority for converting back to a tree structure.  We could store SceneNodes in order, then Room Nodes, then the string tables afterwards...or we could store one scenenode, one room node, its strings, one more room node, its strings, and so on.  The UI I presented shows pointer hierarchy well, but not data order.
  • Pointers to data other than strings that may need recalculated...not sure how useful this is.

On a side note, you could easily create an XML Serializer for a game's text (and misc junk data) using this method.  Which is interesting in itself.  Then again, you can think of an Atlas script as a Word document.  Things like string type, fixed length strings, etc are like bold and italics which apply to ranges of text.  Pointers can be like the various formatting options (indentation, spacing, tables) that position text and apply only at one spot.  The difference is that the user in Word never sees the actual metacode behind it, only the text on the screen.

abw

  • Newbie
  • *
  • Posts: 42
    • View Profile
Re: Program Design and Data Structure (TextAngel)
« Reply #15 on: April 17, 2011, 01:51:18 am »
A lot of the string pointers are implemented in LDA statements.

Just to make sure I'm following this: CT loads a byte from here, a byte from there, a byte from somewhere else, glues them together, and the result is a pointer? Hmm.  I guess the only difference from normal is that the bytes aren't continuous. And maybe that also means we should keep track of operations applied to individual bytes of a pointer instead of to the pointer as a whole. Complexity++ :P

Here's something I haven't seen discussed here yet: what kind of support will there be for items referenced from multiple locations (e.g. if pointers at $7410 and $8520 both point to the same pointer table at $9630)?

That has been a hot topic.

No escape and no version works for me, but I still think the ability to add comments would be useful. You could makes notes about which game the table applies to, which version of that game (or even a checksum for a clean ROM), which portions of text within that game (assuming the game uses multiple tables), and any other interesting bits of information (maybe even - gasp - a bookmark/jumplist like I make much use of in Windhex  >:D). Basically it could contain its own metadata.

This is kind of a nitpick since 2.3 doesn't have a full example, but saying
Quote
Sample script string one, line one.<linebreak>
//Sample script string one line two.<end>

//Sample script string two, line one.<linebreak>
//
etc.
instead might better illustrate the behaviour of tokens with a newline followed by comment delimiters, just to make it clear that they aren't quite the same as, say, Cartographer's #COMMENTS option.

(1 pointer still considered a table)

A pointer and a pointer table are close enough that it's very tempting to consider one of them as a special case of the other. How do you define a pointer table? Start/end addresses? Pointer format * number of pointers? Something else? I like thinking of a pointer as just a pointer table of size 1, but I've been playing around with this on paper, and while I haven't finished yet, I'm currently in favour of going the other way - thinking that there is no such thing as a "pointer table". Instead the pointer to the pointer table just points to a regular pointer and then there's a list of other pointers, each defined in terms of the previous pointer (or in terms of the pointer to the pointer table, if that's easier). I think this better reflects what actually happens in-game, and it avoids issues I was having with pointer tables needing to know too much about their contents (e.g. in the from-next-pointer case, where the pointer table points to n things but takes up n+1 pointers' worth of space).

Whichever way you slice it, though, the TreeView UI is much better than any UI I was thinking about :).

I just keep getting get hung up over pointers that point to strings and pointers that point to pointers that may point to more pointers that point to strings.

Assuming you went with polymorphism, is your problem really that you can't tell whether to cast a pointer's pointee as a PointerInfo or a StringInfo? If so, +1 for weakly-typed languages :P. Polymorphism should be able to handle 1 and 3 easily. What kind of operations are you calling on the pointees (so far I've wrapped everything except parsing and post-processing into their constructors, and I suspect parsing won't stay separate for much longer)? For 2, you might find it useful for the current pointer to keep a reference to its parent (maybe also its previous and next siblings?). Then you could make your offset function powerful enough to handle situations like this.

Klarth

  • Sr. Member
  • ****
  • Posts: 422
  • Location: Pittsburgh
    • View Profile
Re: Program Design and Data Structure (TextAngel)
« Reply #16 on: April 17, 2011, 03:23:12 am »
A pointer and a pointer table are close enough that it's very tempting to consider one of them as a special case of the other. How do you define a pointer table? Start/end addresses? Pointer format * number of pointers? Something else? I like thinking of a pointer as just a pointer table of size 1, but I've been playing around with this on paper, and while I haven't finished yet, I'm currently in favour of going the other way - thinking that there is no such thing as a "pointer table". Instead the pointer to the pointer table just points to a regular pointer and then there's a list of other pointers, each defined in terms of the previous pointer (or in terms of the pointer to the pointer table, if that's easier). I think this better reflects what actually happens in-game, and it avoids issues I was having with pointer tables needing to know too much about their contents (e.g. in the from-next-pointer case, where the pointer table points to n things but takes up n+1 pointers' worth of space).

PointerToPointerTable contains no information about how the child pointer table is used, but only how it is accessed and the number of pointers (type of child pointer is determined by each child node).

Take the example of a 24bit "master" pointer to a 16bit "child" pointer table.  This master node would contain information on the address of the 24bit master pointer itself, the method on how to calculate to reach the child pointer table (HiROM and add 0x10000 as an example), and a structural format of the child pointer table (increment 6 bytes after each child pointer because there's other struct data inside of it).  You would then define the child pointer table in the same fashion for every child pointer (you could code an "apply node type to all" feature to automate this).

For systems that require two pointers (to calculate data size is most common), you'd need a tree where nodes can access siblings.  It would be a pretty niche feature.

Nightcrawler

  • Hero Member
  • *****
  • Posts: 5734
    • View Profile
    • Nightcrawler's Translation Corporation
Re: Program Design and Data Structure (TextAngel)
« Reply #17 on: June 24, 2011, 04:16:07 pm »
I ended up making some decent progress with this.
http://www.romhacking.net/forum/index.php/topic,10333.msg186888.html#new

I took some of your ideas and started meshing them with what I have. A suitable and convenient interface is difficult, and the code can get crazy depending on how flexible you make the interface. So, I ended up deciding the skeletal starting structure will be automatically generated from the previous tabs information. In those tabs you can define the number of levels, master pointer information, some default string/pointer formats for all files etc. That will generate something pretty close to what you need automatically. Then, the advanced tab will let you fine tune and tweak the already created levels, tables, and formats. Combined it allows advanced users to do quite a lot and it keeps it somewhat simple and easy to use in other cases. It also reduces how much I'd have to allow someone to do from the advanced tab if they were starting entirely from scratch. Obviously some more controls will be added to the Advanced screen to apply some changes across all child nodes and some other items that are currently lacking.

One small design issue is I'm having some difficulty deciding what to do with groups of strings that have no pointers from say a raw start/stop dump or cluster of pascal strings where the game uses a counting system to count strings rather than point to them. What we have discussed previously with the tree structure is fine for strings that have pointers to them. It just seems a bit more awkward when a pointer table's pointers point to simply a group of strings with no pointers to the individual strings. What I have done so far is still try to set them up the same and fabricating individual string pointers based on their discovered start/stop locations. The 'pointer' to each string then becomes virtual as it won't actually write to anywhere for insertion or come out in the dump. The whole thing seems a little hackish.  :-\
TransCorp - Over 15 years of community dedication.
Dual Orb 2, Wozz, Emerald Dragon, Tenshi No Uta, Herakles IV SFC/SNES Translations

Pennywise

  • Hero Member
  • *****
  • Posts: 1816
  • I'm curious
    • View Profile
    • Yojimbo's Translations
Re: Program Design and Data Structure (TextAngel)
« Reply #18 on: June 24, 2011, 06:13:54 pm »
I'm not really that knowledgeable to speak on programming matters, but those strings don't really have pointers or anything to write to when inserting. So, having pointers for them seems kind of silly. What if you just specified a start and end address for the block and broke them down into individual strings labeled 1, 2 ,3 etc?

Nightcrawler

  • Hero Member
  • *****
  • Posts: 5734
    • View Profile
    • Nightcrawler's Translation Corporation
Re: Program Design and Data Structure (TextAngel)
« Reply #19 on: June 24, 2011, 06:49:27 pm »
I think that would require another type of node, and additional configuration overhead to differentiate the new type and handle it. I thought about that, but it's fundamentally the same as what we already have. The only difference is the pointers are virtual and don't actually go in the dump or be inserted.

Current Node Hierarchy Types are:
File
PointerToPointerTable
PointerTableToStrings
PointerToString
All inherent from Pointer

My smallest node for a string is a PointerToString Node because for every string, you always need the accompanying 'pointer' information. The very least of which is the location of the string, which is the pointer.
TransCorp - Over 15 years of community dedication.
Dual Orb 2, Wozz, Emerald Dragon, Tenshi No Uta, Herakles IV SFC/SNES Translations