OCR specific approximate string matching library - java

I have a text extracted from image using OCR. Some of the words are not correctly recognized in the text as follows:
'DRDER 0F OFF1CE RESTAURAUT, QNE THO...'
As you can see optically some characters is easy to mix for others: 1 -> I, O -> D -> Q, H -> W, U -> N and so on.
Question: Apart from standard algorithms like Levenshtein distance, is there a Java or Python library implementing OCR specific algorithm that can help compare words to a predefined dictionary and give a score, taking into account possible OCR character mixups?

I don't know of anything OCR-specific, but you might be able to make this work with Biopython, because the basic problem of comparing one string to another using a matrix that scores each character's similarity to every other character is very common in bioinformatics. We call it a sequence alignment problem.
Have a look at the pairwise2 module that Biopython provides; you would be able to compare each input word against each dictionary word with pairwise2.align.globaldx, using a dict that has all the pairwise character similarities. There are also functions in there for scoring deleted/inserted characters.
Computing the pairwise character similarities would be something you'd have to do yourself, maybe by rendering each character in your chosen font and comparing the images, or maybe manually by just rating which characters look similar to you. You could also have a look at this other SO answer where characters are broken into classes based on the presence/absence of strokes.
If you want something better than O(input * dictionary), you'd have to switch from brute force comparison to some kind of seed-match-based algorithm. If you assume that you'll always have a 2-character perfect match for example, you can index your dictionary by which words contain each length-2 string, and only compare the input words against the dictionary words that share a length-2 string with them.

Related

Resetting fancy font to normal

I have a String named fancy, the String fancy is this "𝖑𝖒𝖆𝖔", however, I need to make "lmao" out of it.
I've tried calling String#trim, however with no success.
Example code:
var fancy = "𝖑𝖒𝖆𝖔"
var normal = //Magic to convert 𝖑𝖒𝖆𝖔 to lmao
EDIT: So I figured out, if I take the UTF-8 code of this fancy character, and subtract it by 120101, I get the original character, however, there are more types of these fancy texts so it does not seem like a solution for my problem.
You can take advantage of the fact that your "𝖆" character decomposes to a regular "a":
Decomposition LATIN SMALL LETTER A (U+0061)
Java's java.text.Normalizer class contains different normalizer forms. The NKFD and NKFC forms use the above decomposition rule.
String normal = Normalizer.normalize(fancy, Normalizer.Form.NFKC);
Using compatibility equivalence is what you need here:
Compatibility equivalence is a weaker type of equivalence between characters or sequences of characters which represent the same abstract character (or sequence of abstract characters), but which may have distinct visual appearances or behaviors.
(The reason you do not lose diacritics is because this process simply separates these diacritic marks from their base letters - and then re-combines them if you use the relevant form.)
Those are unicode characters: https://unicode-table.com also provides reverse lookup to identify them (copy-paste them into the search).
The fancy characters identify as:
𝖑 Mathematical Bold Fraktur Small L (U+1D591)
𝖒 Mathematical Bold Fraktur Small M 'U+1D592)
𝖆 Mathematical Bold Fraktur Small A (U+1D586)
𝖔 Mathematical Bold Fraktur Small O (U+1D594)
You also find them as 'old style english alphabet' on this list: https://unicode-table.com/en/sets/fancy-letters. There we notice that they are ordered and in the same way that the alphabetic characters are. So the characters have a fixed offset:
int offset = 0x1D586 - 'a' // 𝖆 is U+1D586
You can thus transform the characters back by subtracting that offset.
Now comes the tricky part: these unicode code points cannot be represented by a single char data type, which is only 16 bit, and thus cannot represent every single unicode character on its own (1-4 chars are actually needed, depending on unicode char).
The proper way to deal with this is to work with the code points directly:
String fancy = "𝖑𝖒𝖆𝖔";
int offset = 0x1D586 - 'a' // 𝖆 is U+1D586
String plain = fancy.codePoints()
.map(i-> i - offset)
.mapToObj(c-> (char)c)
.map(String::valueOf)
.collect(java.util.stream.Collectors.joining());
System.out.println(plain);
This then prints lmao.

convert similar sound word parts

I'm having trouble searching for the right terms here to solve the below problem; I'm sure it's a done thing, I just can't find the right terms to express the problem!
I'm basically trying to create a classifier that will take word comparison outputs (e.g. some outputs from Levenstein distances) and decide whether the words are sufficiently different. An important input would probably be something like a soundex comparison. The trouble I'm having is creating the training set for the algorithm (an SVM in this case). I have a long list of names and I need to mutate them a bit (based on similar sounds within the word).
E.g. John and Jon would be a mutation to make, and I could label this in the test set as being equivalent. John and Johann have sufficiently different sound and letter distance to be considered different.
So I'm kinda asking for is a way to achieve a phoneme variation generator, but need to be able to retain the English lettering structure.
Even simple translation might suffice, like "f" could (sometimes) be replaced by "ph". I'm doing this in Java so any tips in that direction would be great too! Thanks.
EDIT
This is the closest I've come across so far: http://www.isi.edu/natural-language/people/hovy/papers/07IJCAI-spelling-variants.pdf
I'm just thinking aloud.
Rule-based: Apply a rule-based system where you could use standard substitution rules such as 'ph' for 'f', and insertion rules such as insert an h between a vowel and a consonant.
Character n-gram alignment:
Use a word alignment tool such as Giza++ to align character n-grams from parallel corpora such as Europarl. I guess you would be able to find interesting word spelling variations such as "house", "haus" etc. You can play with various values of n.
Bootstraping character n-gram alignment with rule-based: You might also want to use a combination of the two, in which you could, in principle, boost the probabilities of some alignments by using a set of external rules and heuristics.

java string and hashset-membership matching

I am converting single Chinese characters into roman letters(pinyin) using package pinyin4j in java. However, this will often yield multiple pinyins for one character(same character has different pronunciations). Say,character C1 converts to 2 pinyin forms p1 and p2, character C2 converts to 3 pinyin forms, q1,q2,q3.
When I combine C1C2 into a word, it yields 2*3=6 combinations. Usually only one of these is a real word. I want to check these combinations against a lexicon text file I built, with many lines start with \w that is a lexical entry(so for instance, only p1q2 out of the 6 combinations is found in the lexicon). i'm thinking about reading the lexicon file into a hashset. However I'm not sure about how to best implement this whole process. Any suggestions?
HashSet seems quite alright. If the lexicon is extra large and you have to be super fast, consider using Trie data structure. There is, however, no implementation in the Java.

guess words using dictionary

I am guessing the key of a less-simple simple substitution ciphertext. The rule that I evaluate the correctness of the key is number of english words in the putative decryption.
Are there any tools in java that can check the number of english words in a string. For example,
"thefoitedstateswasat"-> 4 words
"thefortedxyzstateswasathat"->5 words.
I loaded words list and using HashSet as a dictionay. As I dont know the inter-word spaces belong in the text, I can't validate words using simple dictionary.
Thanks.
I gave an answer to a similar question here:
If a word is made up of two valid words
It has some Java-esque pseudocode in it that might be adaptable into something that solves this problem.
Sorry I'm new and does not have the rep to comment yet.
But wouldn't the code be very slow as the number of checks and permutations is very big?
I guess you just have to brute force your way through by using (n-1) words nested for loop. And then search the dictionary for each substring.
Surely there's a better way to test the accuracy of your key?
But that's not the point, here's what I'd do:
Using "quackdogsomethinggodknowswhat"
I'd have a recursive method where starting at the beginning of the string, I'd call a recursive method for all the words with which the subject string starts, in this case "qua", and "quack" with the string not containing the word ("dogsomethinggodknowswhat" for quack). Return whatever is greater: 1 + the greatest value returned out of all your method calls OR 0 + the method call for the string starting at index 1 ("uackdogsomethinggodknowswhat").
This would probably work best if you kept your wordlist in a tree of some sort.
If you need some pseudocode, ask!

Java Anagram Solver

I can work out how to create anagrams of a string but I don't know how I can compare them to a dictionary of real words to check if the anagram is a real word. Is there a class in the Java API that contains the entire English dictionary?
No, but you can get a wordlist from various places. From there, you could read the wordlist file into a list:
List<String> lines = new ArrayList<String>();
BufferedReader in = new BufferedReader(new FileReader("wordlist.txt"));
String line = null;
while (null!=(line=in.readLine()))
{
lines.add(line);
}
in.close();
And finally binary search use lines.contains() for your candidate word.
One method of determining whether a set of characters is an anagram of a word involves using prime numbers. Assign each letter a prime number, for example, a=2, b=3, c=5, d=7. Now precompute the product of primes for each word in your dictionary. For example, 'add' = 2*7*7 = 98, or 'bad' = 3*2*7 = 42.
Now determining if a set of letters is an anagram of any word in a dictionary can be done by computing the value of the set of letters. For example, the letters 'abd'= 2*3*7 = 42 = 'bad'. Just check whether the computed value for the letters exists in your precomputed dictionary. For any anagram, you need only do this computation once versus trying to generate every possible anagram. Note however this method will only work well for relatively small words, otherwise you will run into overflow issues and need to use BigInteger.
No, you have to use an external library, such as JWNL, which is a wrapper for WordNet -- a machine-readable lexicalΒ database organized by meanings, that contains pretty much every English word.
Maybe the English dictionary in jazzy can help you.
There's no such specialized class in the standard Java library, but you can use any implementation you like of the Set interface and initialize it by loading it up with words of your choosing, picked from any of the innumerable word lists you can find in many places (just check out carefully that the license for the word list you choose is compatible with your intended application, e.g., does it allow commercial use, closed-source apps if that's what you require, and so forth).

Categories