A string format is just a pattern, and regular expressions are all about pattern matching. I can spit out some examples from perl (which has excellent regular expression support) off the top of my head, if that's helpful. I'm short a compiler at the moment, so let's just assume I got the syntax right. I'm using them inline, but you could also store the format in StringInfo if you ever needed it later. Also, these examples are not particularly concerned with computational efficiency. My main point here is that I can imagine writing a hundred lines of character-by-character string parsing code that would not accomplish anything more than these few lines using regular expressions, so if you can do the same thing in C#, you might be able to save yourself a lot of time.
Thanks for the expansion. I understand. However, I think I disagree here with application on C-style and I'll tell you why.
C-style: the process is identical whether reading from ROM or from file; only the end tokens change
my @end_tokens = ('<end>', '<really end>', '<end right now>', '<this time I mean it!>'); # these would really come from the Table object
my $combined_end_token = join('|', @end_tokens); # $combined_end_token is now '<end>|<really end>|<end right now>|<this time I mean it!>'
my ($string) = (substr($data, $start_address) =~ /^(.*?$combined_end_token)/);
That last line looks complicated but really isn't - starting from $start_address, it searches through $data until it finds the first occurrence of any of the end tokens, then stores whatever it found in $string.
I agree this is true for the insertion direction, however it does not work in the dumping direction. You cannot parse based on your hex end tokens because they may appear elsewhere in the string and NOT be an end token. Example of two bytes for Kanji and one byte end token. The end token byte may appear as part of a Kanji character. Additionally, you have things like linked entries where the end token hex may appear as parameter bytes there. You can't know the hex is actually an end token without parsing the data to that point. You need to know the context of which it appears.
Actually, that fact has been a pain for me with program flow and design simplicity. I cannot extract into strings without parsing. Then, if I'm already going to have to parse, I end up having to do the string separation and hex to text conversion in the same step. The text encoder ends up having to yell out to the higher ups when it has decoded an end token, so they can then do the string separation and reset the stream variables etc. It would be much simpler if the it was able to be parsed into strings ahead of time without that intimate knowledge.
Pascal: this one only works for dumping... people write pascal inserts in C-style, no? If not, there are other ways to do this
my ($pascal_length, $pascal_endian, $include_length) = (3, 'big', 1); # these would really be user-specified (also: 1 is a true value in perl)
my ($string) = (substr($data, $start_address) =~ /^(.{$pascal_length}).{hex2dec(\1, $pascal_endian, $include_length)}/);
I'm pretty sure no common programming language comes with a built in multibyte endian aware radix converter, so you'd have to write your own hex2dec function for this to work, but that wouldn't take long.
There is still need to insert strings in pascal. Depending on your game, it can be a monumental task to recode the engine to use C-style and pointers rather than say Pascal and string counting. It can be much simpler just to insert in pascal. I certainly would not discount or omit the ability to insert in pascal. The utility should be able to dump or insert with all of it's supporting string types. Lastly, I actually have a project myself where I want this pascal ability.
As a general comment, when working with tree-like data structures...,
How you get the user to specify an arbitrary pointer tree structure is another question entirely, of course, but at least you'd be able to support it if they did manage to specify one 
Nice idea there with the polymorphism. I will think about it further. The hangup is there are pointers to strings and pointers to pointers. To start, to support the trees I know of, all that is needed from the user is the pointer format and start/end for each tree level. I'm sure it may be limited, but it is experimental and I'm just dipping feet in to see how useful it will be. Primarily it will cover what I am familiar with.
How much progress have you made on mapping out program flow apart from the GUI (which I see you've already got
) and file I/O (which is probably boring)?
The program is functional in it's rawest form. I've used it on a few projects of mine. It needs lots of polish and several areas fleshed out. It also needs some test code written as it supports many possibilities. A small change for case three may break case twelve without decent regression testing. Since my design has been in constant flux, I haven't done a lot of that stuff.
As I mentioned, there's Input, Pipeline, and Output. I've tried to do the whole thing with a
MVP design pattern. I'm good at
MVC on the web with a database, but I'm not sure how to properly handle a multi-tab desktop application like this. So, I have a single large View and Presenter for it all that handles the whole user interface end. Upon trying to dump or insert, the data is handed off to the Model in suitable format by the Presenter.
TextAngelProject - This is a master object that ends up getting all data from all tabs into organized Input, Output, and helper objects, including global program options. It can save and load itself from XML. It handles initiating the pipeline process and tying it together with input and output.
Currently the input object has a set of methods such as RawDump, RawInsert, PointerListDump, PointerListInsert, PointerDump, PointerInsert etc. Those will take care of what they need to do each of those operations. The result of each would be the StringInfo/PointerInfo data structure we spoke about. That's passed on to the output object for writing to file. It just traverses the data structure and turns into the final output. In the dump case, a UTF-8 text file. I haven't implemented any serious post processing yet.
Simplified version of method TextAngleProject.DumpText():
public void DumpText()
{
int block = 1;
foreach(DataFileInfo f in inputsource)
{
Dictionary<Pointer,StringInfo> output;
if(f.PointerMode==DataFileInfo.PointerHandlingType.NO_POINTERS)
{
output = f.RawDump(tablefiles, encodingformat, stringformat);
}
else if(f.PointerMode==DataFileInfo.PointerHandlingType.POINTER_LIST)
{
output = f.PointerListDump(tablefiles, encodingformat, stringformat, pointerformat);
}
outputsource.WriteFile(output,f,block,encodingformat);
block++;
}
}
It's a bit bastardized right now from changing design and there's no post processing options yet, but that's what I've got at present. The idea was the input should know everything it needs to know to dump or insert itself to our standardized format (the dictionary in the code above). Then it would be post processed and passed to the output object for final output. There's an awful lot of logic in the input object. I don't know if it makes sent to have it all there. I've been looking at extracting some out. I blur my own definition of what my input object is. I sort of treat it as a file sometimes and other times I treat it more as any given input operation.
Anyway, I wouldn't mind suggestions here either. I can always create working code to get the job done, but I struggle with smart, flexible design. The only way to get better is to keep trying and keep exposing yourself to others code and ideas to further build upon your own.
