java string and hashset-membership matching - java

I am converting single Chinese characters into roman letters(pinyin) using package pinyin4j in java. However, this will often yield multiple pinyins for one character(same character has different pronunciations). Say,character C1 converts to 2 pinyin forms p1 and p2, character C2 converts to 3 pinyin forms, q1,q2,q3.
When I combine C1C2 into a word, it yields 2*3=6 combinations. Usually only one of these is a real word. I want to check these combinations against a lexicon text file I built, with many lines start with \w that is a lexical entry(so for instance, only p1q2 out of the 6 combinations is found in the lexicon). i'm thinking about reading the lexicon file into a hashset. However I'm not sure about how to best implement this whole process. Any suggestions?

HashSet seems quite alright. If the lexicon is extra large and you have to be super fast, consider using Trie data structure. There is, however, no implementation in the Java.

Related

Java String.split() spiralling out of control

I am trying to parse strings (some can be very long, paragraphs) based on white space (spaces, return keys, tabs). Currently using String.split("\\s++"). In the previous project we are updating, we had simply used StringTokenizer. Using String.split("\\s++") works just fine in all our testing and with all our beta testers.
The minute we release it to expanded users, it runs for a while until it soaks up all server resources. From what I've researched, it appears to be catastrophic backtracking. We get errors like:
....was in progress with java.base#11.0.5/java.util.regex.Pattern$GroupHead.match(Pattern.java:4804)
java.base#11.0.5/java.util.regex.Pattern$Start.match(Pattern.java:3619)
java.base#11.0.5/java.util.regex.Matcher.search(Matcher.java:1729)
java.base#11.0.5/java.util.regex.Matcher.find(Matcher.java:746)
java.base#11.0.5/java.util.regex.Pattern.split(Pattern.java:1264)
java.base#11.0.5/java.lang.String.split(String.java:2317)
Users can type some crazy text. What is the best option to parse strings that could be anywhere from 10 characters to 1000 characters long? I am at a brick wall. Been trying different patterns (regex is not my strongest area) for the past 4 days without long term success.
The simple solution if you dont trust the regex is to use a non regex based solution such as ApacheCommons StringUtils#split. Alternatively, its pretty easy to write one yourself.
Keep in mind the difference between using StringTokenizer versus a split function is the tokenizer is lazy. If you were only retrieving a subset of the split results you may be eating up more memory with a split. I would only expect this to be a problem with large strings though.

OCR specific approximate string matching library

I have a text extracted from image using OCR. Some of the words are not correctly recognized in the text as follows:
'DRDER 0F OFF1CE RESTAURAUT, QNE THO...'
As you can see optically some characters is easy to mix for others: 1 -> I, O -> D -> Q, H -> W, U -> N and so on.
Question: Apart from standard algorithms like Levenshtein distance, is there a Java or Python library implementing OCR specific algorithm that can help compare words to a predefined dictionary and give a score, taking into account possible OCR character mixups?
I don't know of anything OCR-specific, but you might be able to make this work with Biopython, because the basic problem of comparing one string to another using a matrix that scores each character's similarity to every other character is very common in bioinformatics. We call it a sequence alignment problem.
Have a look at the pairwise2 module that Biopython provides; you would be able to compare each input word against each dictionary word with pairwise2.align.globaldx, using a dict that has all the pairwise character similarities. There are also functions in there for scoring deleted/inserted characters.
Computing the pairwise character similarities would be something you'd have to do yourself, maybe by rendering each character in your chosen font and comparing the images, or maybe manually by just rating which characters look similar to you. You could also have a look at this other SO answer where characters are broken into classes based on the presence/absence of strokes.
If you want something better than O(input * dictionary), you'd have to switch from brute force comparison to some kind of seed-match-based algorithm. If you assume that you'll always have a 2-character perfect match for example, you can index your dictionary by which words contain each length-2 string, and only compare the input words against the dictionary words that share a length-2 string with them.

Java : How to find string patterns in a LARGE binary file?

I'm trying to write a program that will read a VERY LARGE binary file and try to find the occurrence of 2 different strings and then print the indexes that matches the patterns. For the example's sake let's assume the character sequences are [H,e,l,l,o] and [H,e,l,l,o, ,W,o,r,l,d].
I was able to code this for small binary files because I was reading each character as a byte and then saving it in an Arraylist. Then starting from the beginning of the Arraylist, I was comparing the byte arraylist(byte[] data) with the byte[] pattern.
I need to find a way to do the same but WITHOUT writing the entire binary file in memory for comparison. That means I should be able to compare while reading each character (I should not save the entire binary file in memory). Assume the binary file only contains characters.
Any suggestions on how this can be achieved ? Thank you all in advance.
Seems like you are really looking for Aho-Corasick string matching algorithm.
The algorithm builds an automaton from the given dictionary you have, and then allows you to find matches using a single scan of your input string.
The wikipedia article links to this java implementation
Google "finite state machine".
Or, read the file one byte at a time, if the byte just doesn't match the first character of the search term, go on to the next byte. If it does match, now you're looking for the next character in the sequence. I.e., your state has gone from 0, to 1. If your state equals (or passes) the length of the search string, you found it!
Implementation/debugging left to the reader.
There are specialised algorithms for this but let's try a simple one first.
You can start with making the comparison on the fly, always after reading the next byte. Once you do that, it's easy to spot that you don't need to keep any bytes that are from earlier than your longest pattern.
So you can just use a buffer that is as long as your longest pattern, put new bytes in at one end and drop them at the other.
As I said, there are algorithms more effective than this but it's a good start.
Use a FileInputStream wrapped in a BufferedInputStream and compare each byte. Keep a buffer the length of the sequence you're looking for so you backtrack if it doesn't match at some point. If the sequence you're looking for is too large, you could save the offset and re-open the file for reading.
Working with streams: http://docs.oracle.com/javase/tutorial/essential/io/
String matching algorithms: http://en.wikipedia.org/wiki/String_searching_algorithm
Or if you just want something to copy and paste you could look at this SO question.

Regex search pattern in very large file

I'd like to search pattern in very large file (f.e above 1 GB) that consists of single line.
It is not possible to load it into memory. Currently, I use BufferedReaderto read into buffers (1024 chars).
The main steps:
Read data into two buffers
Search pattern in that buffers
Increment variable if pattern was found
Copy second buffer into first
Load data into second buffers
Search pattern in both buffers.
Increment variable if pattern was found
Repeat above steps (start from 4) until EOF
That algorithm (two buffers) lets me to avoid situation, where searched piece of text is split by chunks. It works like a chram unless pattern result is smaller that two buffers length. For example I can't manage with case, when result is longer - let's say long as 3 buffers (but I've only data in two buffers, so match will fail!). What's more, I can realize such a case:
Prepare 1 GB single line file, that consits of "baaaaaaa(....)aaaaab"
Search for pattern ba*b.
The whole file match pattern!
I don't have to print the result, I've only to be able to say: "Yea, I was able to find pattern" or "No, I wasn't able to find that".
It's possible with java? I mean:
Ability to determine, whether a pattern is present in file (without loading whole line into memory, see case above
Find the way handle the case, when match result is longer than chunk.
I hope my explanation is pretty clear.
I think the solution for you would be to implement CharSequence as a wrapper over very large text files.
Why? Because building a Matcher from a Pattern takes a CharSequence as an argument.
Of course, easier said than done... But then you only have three methods to implement, so that shouldn't be too hard...
EDIT I took the plunge and I ate my own dog's food. The "worst part" is that it actually works!
It seems like you may need to break that search-pattern down into pieces, since, given your restrictions, searching for it in its entirety is failing.
Can you determine that a buffer contains the beginning of a match? If so, save that state and then search the next portion for the next part of the match. Continue until the entire search-term is found.

Print Arabic (RTL) and English (LTR) in the correct directions at the same time

I want to output "Arabic" and "English" text at the same time in Java for example, outputting the following statement: مرحبا I am Adham.
I searched the internet and I found that the BiDi algorithm is needed in this case. Are there any java classes for BiDi.
I have tried this class BiDiReferenceJava and I tested it, but when I call runSample() in the class BidiReferenceTest and entering an arabic string as parameter, I got an OutOfIndexException as the count of the character is duplicated (exactly at this line of code in the class BidiReferenceTestCharmap)
byte[] result = new byte[count];
Where if the string length is 4 the count is 8!
The ICU4J is more or less the standard comprehensive Unicode library for Java, and thus supports the bidirectional algorithm. I really wonder why you need this, though; BiDi is usually applied by the display layer, unless you're a word-processor or something.
BidiReference.java is apparently a demonstration piece; it's designed to show how the algorithm works on ASCII characters instead of using actual Unicode characters.

Categories