News: 11 March 2016 - Forum Rules
Current Moderators - DarkSol, KingMike, MathOnNapkins, Azkadellia, Danke

Author Topic: kanji recognition  (Read 3791 times)

tryphon

  • Hero Member
  • *****
  • Posts: 722
    • View Profile
kanji recognition
« on: December 25, 2014, 10:06:38 am »
Hello,

I've dumped this bitmap and would like to get an ordered list of the kanji listed inside :



Thanks in advance to anyone who can help

Pyriel

  • Jr. Member
  • **
  • Posts: 23
    • View Profile
Re: kanji recognition
« Reply #1 on: December 25, 2014, 01:17:07 pm »
Chunk it out and use an OCR program.  There are several available for free, and Google used to have one (online) that worked pretty well.  Chances are that will only get you 95% of the characters at best, but you're probably more likely to find help for 5-6 unknowns than a whole sheet.

tryphon

  • Hero Member
  • *****
  • Posts: 722
    • View Profile
Re: kanji recognition
« Reply #2 on: December 25, 2014, 02:08:38 pm »
I asked the question about an OCR at least twice on this board and was always answered that they sucked at kanji recognition.

i2ocr recognized... nothing
nhocr gave :
Code: [Select]
霧鑓瀞遜藩
瀞.∴
瀞.脅
.......;
欝難.

.
麟笛溌習縫瀞縫霧
鱗竣
..A....t....
霧電.霧
蟹懇麟擬
鎌態襲襲

畳骨
Seems there's still a lot of work.

If you have some good OCR to try, please give their name.

As to chunk them..., there's more than 400 kanjis on this sheet : if it takes say 30 seconds to upload one, get the result and copy it at the right spot, it'll take 200 minutes to process the sheet, and this time would be more spent trying to disassemble the compression routine of the game (by the way, it's Surging Aura MD).

mz

  • Sr. Member
  • ****
  • Posts: 418
  • Whore
    • View Profile
Re: kanji recognition
« Reply #3 on: December 25, 2014, 02:16:28 pm »
Can't you dump that image again with some black pixels between characters? Maybe that could make it easier for OCR programs to parse (and a bit easier to read for humans too.)
There has to be a better life.

FAST6191

  • Hero Member
  • *****
  • Posts: 2422
    • View Profile
Re: kanji recognition
« Reply #4 on: December 25, 2014, 02:45:15 pm »
Can't you dump that image again with some black pixels between characters? Maybe that could make it easier for OCR programs to parse (and a bit easier to read for humans too.)

I was bored so I did that, also scaled (just nearest neighbour) and made an alt colour version, I was not however bored enough to do vertical as well, though I guess I could have delved into the dark arts of imagemagick.

Anyway


http://postimg.org/image/dwklplz81/

goldenband

  • Sr. Member
  • ****
  • Posts: 286
    • View Profile
Re: kanji recognition
« Reply #5 on: December 25, 2014, 03:25:44 pm »
I should have this done in a few minutes.

EDIT: OK, here it is. I added whitespace (or, really, blackspace) between the characters in GIMP, but even then the OCR I used (NHocr) was very finicky, and I ended up uploading no more than two rows and 1/2 column (i.e. 7-9 characters) at a time.

In proofreading this, I spotted about 16-18 errors. I've corrected a few, but there are a bunch left, which I've marked in bold red. All this could use a going-over by a fluent Japanese speaker/kanji reader, which I'm certainly not.

EDIT #2: Corrected the first one in row D.

1: 愛悪安闇衣遣域運衛炎王屋下家会壊
2: 海界外活官間気義救強恐軍撃結月剣
3: 建見軒鍵古光囗広港行国婚魂座砦罪
4: 殺使士姉子師死紙時式疾者邪守手酒
5: 呪周襲宿出所女勝将小輔堤心神臣身
6: 人水世正生精聖跡説宣戦想装賊退
7: 代大男知地中仲町長典天伝東逃動
8: 道日入年配買泊発妃備姫父部風復
9: 物兵平宝法北魔娘名命目勇理旅
A: 力老和話                  放盗船
B: 元親失抱少修航牢声姿獄気防御護
C: 々扉      攻                        青龍刀
D: 波鮮斬曲刃丸白狼輝止切幻斧鬼紫鎚
E: 爆杖弓明吠翔槍爪布服皮鎧鉄赤
F: 招帽木楯破指輪幸腕空胸飾符套売右
G: 雨何火向左山事上西村南方薬様用石
H: 竜牙雷太陽真紅鉛鋼茨硬革騎仮面全
I: 快書固念属性岬舞五玉敗決着詠唱
J: 体味方敵逆回復補助思陰状態髪貴私
K: 僧期持片腹痛希望形消再来実専短持
L: 両半重幅帯鋭能昇速度特定確率避無
M: 遠距離可武器円秘謎軽量堅作同付込
N: 高削除象徴脱巻病治不闘記念兜塔最
O: 具前呼未輿集野言祝砲母化新
« Last Edit: December 25, 2014, 05:18:55 pm by goldenband »

Seihen

  • Sr. Member
  • ****
  • Posts: 405
    • View Profile
Re: kanji recognition
« Reply #6 on: December 25, 2014, 07:51:58 pm »
Did a manual check of the table kindly put together by goldenband. Red text are those that I corrected.

1: 愛悪安闇衣域運衛炎王屋下家会壊
2: 海界外活官間気義救強恐軍撃結月剣
3: 建見軒鍵古光囗広港行国婚魂座砦罪
4: 殺使士姉子師死紙時式疾者邪守手酒
5: 呪周襲宿出所女勝将小城場心神臣身
6: 人水世正生精聖跡説宣戦想装賊退
7: 代大男知地中仲町長撤典天伝東逃動
8: 道日入年配買泊発妃備姫父部風復
9: 物兵平宝法北魔妹娘名命目勇理旅
A: 力老和話『』「」≪≫盗船渡金
B: 元親失抱少修航牢声姿獄気防御護
C: 々扉⁉‼攻---------青龍刀
D: 波鮮斬曲刃丸白狼輝止切幻斧鬼紫鎚
E: 爆裂棒杖弓明吠翔槍爪布服皮鎧鉄赤
F: 招帽木楯破指輪幸腕空胸飾符套売右
G: 雨何火向左山事上西村南方薬様用石
H: 竜牙雷太陽真紅鉛鋼茨硬革騎仮面全
I: 快書固念属性岬舞五玉敗決着詠唱
J: 体味方敵逆回復補助思陰状態髪貴私
K: 僧期片腹痛希望形消再来専短持
L: 両半重幅帯鋭能昇速度特定確率避無
M: 遠距離可武器円秘謎軽量堅作同付込
N: 高削除象徴脱巻病治不闘記念兜塔最
O: 具前呼未輿集野言祝砲母化新

goldenband

  • Sr. Member
  • ****
  • Posts: 286
    • View Profile
Re: kanji recognition
« Reply #7 on: December 25, 2014, 09:19:12 pm »
Hey, thanks for that, Seihen! You caught several errors I missed, confirmed a few kanji I wasn't sure about (I should've been clearer that a few red ones were just me being uncertain), and filled out the table, which I'm grateful for on all counts.

BTW just in case it's relevant for tryphon, you also corrected 姉 in Row 9 to 妹.

Also, interesting that O-5 is indeed correctly 輿 as the OCR had it -- the kanji in the font dump looks more like an asterisk shape in the center, but I take it that's inbounds?

mz

  • Sr. Member
  • ****
  • Posts: 418
  • Whore
    • View Profile
Re: kanji recognition
« Reply #8 on: December 25, 2014, 09:25:12 pm »
Also, interesting that O-5 is indeed correctly 輿 as the OCR had it -- the kanji in the font dump looks more like an asterisk shape in the center, but I take it that's inbounds?
I'm pretty sure that's actually 奥.
There has to be a better life.

tryphon

  • Hero Member
  • *****
  • Posts: 722
    • View Profile
Re: kanji recognition
« Reply #9 on: December 26, 2014, 02:46:28 am »
Thanks to both of you :)

I didn't know adding spaces around the glyphes coyld make such a difference. I will think about it the next time I dump a font.

FAST6191

  • Hero Member
  • *****
  • Posts: 2422
    • View Profile
Re: kanji recognition
« Reply #10 on: December 26, 2014, 07:18:43 am »
Thanks to both of you :)

I didn't know adding spaces around the glyphes coyld make such a difference. I will think about it the next time I dump a font.

I have not done the most work with Japanese OCR and you probably can get them to work with Japanese being a fixed width, self contained/line contained character system. However most OCR will tend to use some kind of edge detection and if it is touching an adjacent character it will probably confuse it.

Seihen

  • Sr. Member
  • ****
  • Posts: 405
    • View Profile
Re: kanji recognition
« Reply #11 on: December 26, 2014, 09:19:09 am »
At the very least, the spacing between the characters is worlds better for human proofreaders. I swear, my eyes start to hurt after looking at some of those blobs.

I'm curious why 方 (lines G and J) and 気 (lines 2 and B) repeat in the table, and if both copies are used or if one can be ignored. I've seen this in several other kanji dumps before, though, so it's not too surprising. Just weird (and annoying).

goldenband

  • Sr. Member
  • ****
  • Posts: 286
    • View Profile
Re: kanji recognition
« Reply #12 on: December 26, 2014, 11:32:37 am »
NHocr definitely needed the extra whitespace (both horizontal and vertical), and I added at least 3 px on every side of each character. But it also behaved very inconsistently and even paradoxically, e.g. it would repeatedly only read rows 1 & 4 of a four-row image, even with plenty of whitespace above and below rows 2 & 3 and between characters. (And as soon as it got confused in any way, it'd bail out completely for that row.)

But then if I just uploaded row 2 and row 3 on their own, with no other changes, they'd OCR fine with no trouble.

It'd be worth doing a set of systematic tests with NHocr to see if any combination could get the whole table to dump properly: vary the number of pixels between characters, see if doubling the size (without anti-aliasing) would make any difference, and so on.

But that should be done by one of us who doesn't have tryphon's hacking skills. :D

tryphon

  • Hero Member
  • *****
  • Posts: 722
    • View Profile
Re: kanji recognition
« Reply #13 on: December 27, 2014, 06:17:13 pm »
My hacking skills are not so great but my japanese are much worse :)

In fact, I don't know any kanji. There's (at least) another table in the game but it contains mainly kanas, and those I can identify them myself. :)

Than again, thanks fpr your help :)

STARWIN

  • Sr. Member
  • ****
  • Posts: 444
    • View Profile
Re: kanji recognition
« Reply #14 on: December 27, 2014, 06:50:10 pm »
But then if I just uploaded row 2 and row 3 on their own, with no other changes, they'd OCR fine with no trouble.

Does that mean it would work well if the entire table was put on one long line?