News: 11 March 2016 - Forum Rules
Current Moderators - DarkSol, KingMike, MathOnNapkins, Azkadellia, Danke

Author Topic: Algorithm to replace "overlapping" tags in a script  (Read 1270 times)

USC

  • Submission Reviewer
  • Sr. Member
  • *****
  • Posts: 342
  • Obviously Outdated
    • View Profile
Algorithm to replace "overlapping" tags in a script
« on: June 13, 2021, 09:55:34 am »
Hello! I'm building a script importer for a group working on a Dreamcast project.
To save space, the programmer has implemented MTE using tags - for instance, "[ a ]" in the script will convert to a single short value rather than 3, saving a good amount of space.

I'm trying to make the conversion process seamless - when the translators/reviewers open the script, I replace all the tags with the actual words, then when they save, I convert it back.

What I've realized is that my current code does not produce the smallest file. For instance, if we have a line "I was ", my code converts it to "[I ]wa[s ]" (4 shorts) when the smallest value would be "I[ was ]" (2 shorts).

Can anyone recommend an algorithm I can use to optimize the save conversion? My current approach is:

For each letter 'x' in original_string:
   Get a substring from 'x' to 'x' + longest tag length
   while substring > 0
      if taglist contains substring
            add taglist[substring] to return_string
            x += length of substring
            break

      else
            substring = substring - 1


    if no tag matches
         add x to return_string
         x++

         

Cyneprepou4uk

  • Hero Member
  • *****
  • Posts: 662
  • I am the baldest romhacker
    • View Profile
Re: Algorithm to replace "overlapping" tags in a script
« Reply #1 on: June 13, 2021, 10:25:35 am »
If I understood all of this correctly, first you need to sort your taglist array from longest to shortest tags (or the other way around, depending how the search works in your language), and then do the searching. Because it looks like your script gets the very first match it has found and settles with it. Or you can collect all found matches and choose the longest.

abw

  • Hero Member
  • *****
  • Posts: 585
    • View Profile
Re: Algorithm to replace "overlapping" tags in a script
« Reply #2 on: June 13, 2021, 04:07:58 pm »
It looks like you're using a longest prefix algorithm for encoding, which as you've noticed produces sub-optimal results. You might like giving abcde a try; among other things, it implements an A* graph search algorithm for insertion that guarantees an optimal encoding, assuming one exists.

Once you have an optimal insertion algorithm in place, I probably wouldn't bother storing the MTE markers in the script at all, which makes them easier for everybody to work with - just use the plain text in your script files and let the inserter choose the best binary to use.

USC

  • Submission Reviewer
  • Sr. Member
  • *****
  • Posts: 342
  • Obviously Outdated
    • View Profile
Re: Algorithm to replace "overlapping" tags in a script
« Reply #3 on: June 13, 2021, 04:46:20 pm »
Cyneprepou4uk: Thanks, that gave me an idea! The other issue that I have to deal with is overlapping tags. E.g. There's a tag for [that ] *and* a tag for [t ], so if I just go through the list and replace all, we get invalid tags like "[tha[t ]]."

I was trying to avoid that, but I think a better solution is to just replace-all (starting from the longest to the shortest), and then walk-through the string to check for errors. So if I come across a "[" and then another "[" before I hit the closing "]", I run a function to remove the next 'n' [ & ] I see to fix it. I think that will do it.

abw: Thanks for the suggestion! I'll definitely look into that for future projects. The script importer I wrote is actually working on custom file formats for the game, so I think trying to integrate abcde into that workflow would be more challenging then the idea Cyneprepou4uk inspired. I'll let you guys know how it turns out!


[Unknown]

  • Jr. Member
  • **
  • Posts: 61
    • View Profile
    • PPSSPP
Re: Algorithm to replace "overlapping" tags in a script
« Reply #4 on: June 14, 2021, 12:04:06 am »
This is essentially a compression algorithm case, and it sounds like you're working with smaller strings.

In that case, I'd recommend working backwards over the string to find the best match.  Like this:

Code: [Select]
array tags = new uint16_t[string.length];
array scores = new size_t[string.length];
for (i = string.length - 1; i >= 0; --i) {
    if (i == string.length - 1) {
        // If we end up here, we have to encode as a single short.
        tags[i] = string[i];
        scores[i] = 2;
        continue;
    }

    size_t best = 2 + scores[i + 1];
    for (tag : tags) {
        if (tag[] == string[i .. i + tag.length]) {
            size_t score = 2 + lengths[i + tag.length];
            if (score < best) {
                best = score;
                tags[i] = tag.id;
            }
        }
    }
    scores[i] = best;
}

Basically, you go backwards and find the optimal encoding one more letter at a time.  This way, you can look into the future, and decide whether a shorter tag is more optimal, because you know the future benefit or cost of that choice.

Looking at your example "I was ", this would produce these scores:

" " = 2(" ")
"s " = 2("s") + 2(" ")
"as " = 2("a") + 4("s ")
"was " = 2"w") + 6("as ")
" was " = 2(" was ") + 0(end of string)
"I was " = 2("I") + 2(" was ")

Ultimately, the value at scores[0] gives you the total size required for the optimal encoding of the entire string.

You then move forward, and start by encoding the first tag (at [ 0 ]), then jump ahead by that tag's length and encode the next tag, etc.

This is slow and exhaustive, but it mostly costs memory.  Assuming you don't actually loop over all tags and instead group tags by length, it can be fairly fast especially for smaller strings.

-[Unknown]
« Last Edit: June 14, 2021, 12:11:14 am by [Unknown] »

abw

  • Hero Member
  • *****
  • Posts: 585
    • View Profile
Re: Algorithm to replace "overlapping" tags in a script
« Reply #5 on: June 17, 2021, 09:08:22 pm »
abw: Thanks for the suggestion! I'll definitely look into that for future projects. The script importer I wrote is actually working on custom file formats for the game, so I think trying to integrate abcde into that workflow would be more challenging then the idea Cyneprepou4uk inspired. I'll let you guys know how it turns out!
You're welcome! abcde has some fairly powerful insertion capabilities (even more so than Atlas, which was already pretty good), so there's a decent chance it might be able to handle whatever format you're working with. :beer:

In that case, I'd recommend working backwards over the string to find the best match.  Like this:
+1

Even as a text-based approach for minimizing the number of tokens, this is a stronger algorithm than longest prefix, and unlike a purely text-based approach, this uses the binary length to decide which tokenization is shorter, so it works just as well for variable-length encodings as fixed-length encodings. As long as you don't need to handle things like table switching or encodings where some codes are binary prefixes of other codes, this is an excellent choice - I've used it myself in the past, and in practice I've found that it isn't much slower than longest-prefix since usually there are only a couple of matches to check at each position.