News:

11 March 2016 - Forum Rules

Main Menu

Tablelib ported to .NET

Started by Malias, November 28, 2015, 11:26:35 PM

Previous topic - Next topic

Malias

So, because I was doing script work in c# and because I thought it might be useful to other people who were, I rewrote Klarth's c++ library in c#.  It can be found here: https://github.com/greenjaed/tablelib.NET.

Along the way, I cleaned up the code and extended the functionality of both TextEncoder and TextDecoder.  The documentation for it is still in development, but intellisence descriptions should be complete.  Let me know what you think.
The great achievement is to lose one's reason for no reason, and to let my lady know that if I can do this without cause, what should I do if there were cause?
     ~Don Quixote~

BlackDog61

Any chance you'd create an equivalent of sjis_dump for any table (instead of just SJIS), since you're at it? ;D
I'd gladly use it, and I'm feeling lazy.  8)

Malias

Well, that was simple enough (once I got around to it).  I believe this is what you're asking for: TextScan.  Again, feedback is always appreciated.
The great achievement is to lose one's reason for no reason, and to let my lady know that if I can do this without cause, what should I do if there were cause?
     ~Don Quixote~

BlackDog61

Thanks! The command line options are exactly what I had in mind.
I tried it on a 3.7MB file and !i stopped it after 13 min. It hadn't finished. :o (Last dump was at0x29AD5.) Any idea why it would take so much time? I'm using similar parameters as I do with sjis_dump (sjis-encoded table, 10 minimum string length). I'm getting consistant "less than 2 min" execution by sjis_dump on my system.
Also,why always dump in UTF-8 even if the table isn't encoded in such? (Just being curious. ;))
Well, thanks for your work.

Malias

Quote from: BlackDog61 on December 06, 2015, 09:06:09 AM
Thanks! The command line options are exactly what I had in mind.
I tried it on a 3.7MB file and !i stopped it after 13 min. It hadn't finished. :o (Last dump was at0x29AD5.) Any idea why it would take so much time? I'm using similar parameters as I do with sjis_dump (sjis-encoded table, 10 minimum string length). I'm getting consistant "less than 2 min" execution by sjis_dump on my system.
Yeah, I noticed that when I first ran it.  I've optimized the code to run faster.  It should now finish in less than a minute.

Quote from: BlackDog61 on December 06, 2015, 09:06:09 AM
Also,why always dump in UTF-8 even if the table isn't encoded in such? (Just being curious. ;))
Well, thanks for your work.
Your welcome  :).  The dump encoding was a minor oversight.  The output file now has the same format as the table file.
I've updated the download file.  If you download and run it again, you should notice a marked improvement.
The great achievement is to lose one's reason for no reason, and to let my lady know that if I can do this without cause, what should I do if there were cause?
     ~Don Quixote~

BlackDog61

Quote from: Malias on December 07, 2015, 02:27:20 AM
I've updated the download file.  If you download and run it again, you should notice a marked improvement.
OK, that's better.
I have to confess that sjis dump still does that in bearly 2 seconds :) but that's an improvement.
Are you going to post this with you lib in the utilities section?

Malias

Quote from: BlackDog61 on December 07, 2015, 04:02:16 PM
OK, that's better.
I have to confess that sjis dump still does that in bearly 2 seconds :) but that's an improvement.
Are you going to post this with you lib in the utilities section?
Well, that's to be expected.  It's a lot easier and faster to find strings with a predefined encoding than with a custom one provided by the user.

I've just submitted TextScan as a utility and will submit tablelib.NET once I've finalized things and solidified the documentation.
The great achievement is to lose one's reason for no reason, and to let my lady know that if I can do this without cause, what should I do if there were cause?
     ~Don Quixote~

BlackDog61

Thank you!
I'll be advertising for it to newbs when they have relevant questions. ;D

tryphon

Quote from: Malias on December 08, 2015, 04:12:54 PM
Well, that's to be expected.  It's a lot easier and faster to find strings with a predefined encoding than with a custom one provided by the user.

Sorry, I don't know what this software is supposed to do but I don't understand why it's easier. Could you develop ?

Malias

The reason it's easier is because the program doesn't have to consult a list of definitions when evaluating a character while a table-defined encoding does.

Take shift jis, for example; If the first byte is within normal ascii range, then it's a normal ascii character.  If the first byte is a certain subset of values and the second byte is within a certain range, then the character is a jis character.  To check to see if a character is a shift jis character, it's merely a matter of checking to see if the first byte, and possibly the second, are within a certain value range.

Encodings defined by tables are a lot more free-form.  The character definitions can be any value and can be any number of bytes long; the table can have gaps where no characters are defined; and most importantly, none of this is known beforehand.  So, to see if a certain sequence of bytes is a character, the program has to look it up in the table.  If it doesn't find a match, it checks progressively smaller sequences of bytes. And, it has to do this for every single character.

So, for those that are familiar with Big O notation:

  • Shift jis character identification (and any other regular encoding): Θ(1)
  • Table character identfication: O(mlog n), where m is the longest sequence of bytes defined and n is the number of defined characters
The great achievement is to lose one's reason for no reason, and to let my lady know that if I can do this without cause, what should I do if there were cause?
     ~Don Quixote~

tryphon

#10
Just to clarify, suppose the encoding is :

01: A
02: B
0102: C

And you have to analyze the following bytes :

01 02 03

What's the result of your algo ?

nothing (because of the 3) ?
"AB"
"C"
or the two above ?
other ? (["A", "B", "AB", "C"] for example)

Or would the encoding be rejected because of ambiguity ?

henke37

He said to use progressively smaller table entries. So the longest entry is tried first. It matches two octets. Then there is a match failure for the remaining octet. He did not specify how to handle match failure.

Malias

Quote from: tryphon on December 10, 2015, 04:31:15 AM
Just to clarify, suppose the encoding is :

01: A
02: B
0102: C

And you have to analyze the following bytes :

01 02 03

What's the result of your algo ?

nothing (because of the 3) ?
"AB"
"C"
or the two above ?
other ? (["A", "B", "AB", "C"] for example)

Or would the encoding be rejected because of ambiguity ?

Tablelib.NET's TextDecoder, which TextScan uses, can be configured to either terminate a string on an unidentified character or write out the hex value of the byte.  TextScan itself is configured to terminate a string on an undefined character.  So, in your scenario, it would find the string "C". 
The great achievement is to lose one's reason for no reason, and to let my lady know that if I can do this without cause, what should I do if there were cause?
     ~Don Quixote~