convert similar sound word parts - java

I'm having trouble searching for the right terms here to solve the below problem; I'm sure it's a done thing, I just can't find the right terms to express the problem!
I'm basically trying to create a classifier that will take word comparison outputs (e.g. some outputs from Levenstein distances) and decide whether the words are sufficiently different. An important input would probably be something like a soundex comparison. The trouble I'm having is creating the training set for the algorithm (an SVM in this case). I have a long list of names and I need to mutate them a bit (based on similar sounds within the word).
E.g. John and Jon would be a mutation to make, and I could label this in the test set as being equivalent. John and Johann have sufficiently different sound and letter distance to be considered different.
So I'm kinda asking for is a way to achieve a phoneme variation generator, but need to be able to retain the English lettering structure.
Even simple translation might suffice, like "f" could (sometimes) be replaced by "ph". I'm doing this in Java so any tips in that direction would be great too! Thanks.
EDIT
This is the closest I've come across so far: http://www.isi.edu/natural-language/people/hovy/papers/07IJCAI-spelling-variants.pdf

I'm just thinking aloud.
Rule-based: Apply a rule-based system where you could use standard substitution rules such as 'ph' for 'f', and insertion rules such as insert an h between a vowel and a consonant.
Character n-gram alignment:
Use a word alignment tool such as Giza++ to align character n-grams from parallel corpora such as Europarl. I guess you would be able to find interesting word spelling variations such as "house", "haus" etc. You can play with various values of n.
Bootstraping character n-gram alignment with rule-based: You might also want to use a combination of the two, in which you could, in principle, boost the probabilities of some alignments by using a set of external rules and heuristics.

Related

Detecting phone numbers using sapi

This is regarding to the question in Detecting phone numbers using sapi. I used that grammar in the answer.
But it gets numbers with space how can I create grammar get numbers without space?
Short answer - it would be quite difficult to do that, as SAPI is designed to recognize words, which have spaces in between them.
However, it's pretty straightforward to annotate the grammar so that you can find the start & end position of the phone number within the reco result, at which point you can remove the spaces using string replacement.
Alternatively, it's also pretty straightforward to place the phone number in a tag, at which point you can extract the number from the rules structure. (This might be trickier from Java.)

Does java world has the counterpart Regexp::Optimizer in perl? [duplicate]

I wrote a Java program which can generate a sequence of symbols, like "abcdbcdefbcdbcdefg". What I need is Regex optimizer, which can result "a((bcd){2}ef){2}g".
As the input may contain unicodes, like "a\u0063\u0063\bbd", I prefer a Java version.
The reason I want to get a "shorter" expression is for saving space/memory. The sequence of symbols here could be very long.
In general, to find the "shortest" optimized regex is hard. So, here, I don't need ones that guarantee the "shortest" criteria.
I've got a nasty feeling that the problem of creating the shortest regex that matches a given input string or set of strings is going to be computationally "difficult". (There are parallels with the problem of computing Kolmogorov Complexity ...)
It is also worth noting that the optimal regex for abcdbcdefbcdbcdefg in terms of matching speed is likely to be abcdbcdefbcdbcdefg. Adding repeating groups may make the regex string shorter, but it won't make the regex faster. In fact, it is likely to be slower unless the regex engine unrolls the repeating groups.
The reason that I need this is due to the space/memory limits.
Do you have clear evidence that you need to do this?
I suspect that you won't save a worthwhile amount of space by doing this ... unless the input strings are really long. (And if they are long, then you'll get better results using a regular text compression algorithm to compress the strings.)
Regular expressions are not a substitute for compression
Don't use a regular expression to represent a string constant. Regular expressions are designed to be used to match one of many strings. That's not what you're doing.
I assume you are trying to find a small regex to encode a finite set of input strings. If so, you haven't chosen the best possible subject line.
I can't give you an existing program, but I can tell you how to approach writing one.
There is no canonical minimum regex form and determining the true minimum size regex is NP hard. Certainly your sets are finite, so this may be a simpler problem. I'll have to think about it.
But a good heuristic algorithm would be:
Construct a trivial non-deterministic finite automaton (NFA) that accepts all your strings.
Convert the NFA to a deterministic finite automaton (DFA) with the subset construction.
Minimize the DFA with the standard algorithm.
Use the construction from the proof of Kleene's theorem to get to a regex.
Note that step 3 does give you a unique minimum DFA. That would probably be the best way to encode your string sets.

Java regular expression for checking inputed names

I know regular expressions aren't necessarily the best tool for the job, but I was wondering whether this would be possible with Java regexen:
Let's say I have a data set with names separated by line breaks like so:
John Doe
Jane Roe
Richard Miles
(there are naturally a lot more names in the actual system)
I'll be reading in data where I'll get both the first and the last name separately, but they won't necessarily be in the same order.
Now, the question is, is there any way to construct a regex for, say, Richard Miles that would match both "Miles Richard" and "Richard Miles"? I know there are plenty of other ways to do this, but I'm specifically looking for a regex-based solution (not because it's necessarily practical, but I'd find it interesting)
Edit for clarification: what I mean is that I need a regex for, say, "Richard Miles" that will match on both "Richard Miles" and "Miles Richard", and preferably not just (Richard Miles|Miles Richard) because where's the fun in that?
This is in no way supposed to be practical, I'm merely interested whether regexen can do something like this.
Does it need to be complex and clever? I mean this would work.
\b(Miles Richard|Richard Miles)\b
Yes, take a look at this: -
^(\\w+)\\s(\\w+)$
It will match a word at the beginning(^\\w+), followed by a space (\\s), followed by another word at the end(\\w+$)
Are you looking for this only?

Regex unordered matches

This feels like it should be an extremely simple thing to do with regex but I can't quite seem to figure it out.
I would like to write a regex which checks to see if a list of certain words appear in a document, in any order, along with any of a set of other words in any order.
In boolean logic the check would be:
If allOfTheseWords are in this text and atLeastOneOfTheseWords are in this text, return true.
Example
I'm searching for (john and barbara) with (happy or sad).
Order does not matter.
"Happy birthday john from barbara" => VALID
"Happy birthday john" => INVALID
I simply cannot figure out how to get the and part to match in an orderless way, any help would be appreciated!
You don't really want to use a regex for this unless the text is very small, which from your description I doubt.
A simple solution would be to dump all the words into a HashSet, at which point checking to see if a word is present becomes a very quick and easy operation.
If you want to do it with regex, I'd try positive lookahead:
// searching for (john and barbara) with (happy or sad)
"^(?=.*\bjohn\b)(?=.*\bbarbara\b).*\b(happy|sad)\b"
The performance should be comparable to doing a full text search for each of the words in the allOfTheseWords group separately.
If you really need a single regex, then it would be very large and very slow due to backtracking. For your particular example of (John AND Barbara) AND (Happy or Sad), it would start like this:
\bJohn\b.*?\bBarbara\n.*?\bHappy\b|\bJohn\b.*?\bBarbara\n.*?\bSad\b|......
You'd ultimately need to put all combinations in the regex. Something like:
JBH, JBS, JHB, JSB, HJB, SJB, BJH, BJS, BHJ, BSJ, HBJ, SBJ
Again backtracking would be prohibitive, as would the explosion in the number of cases. Stay away from regexes here.
With your example, this is a regex that may help you :
Regex
(?:happy|sad).*?john.*?barbara|
(?:happy|sad).*?barbara.*?john|
barbara.*?john.*?(?:happy|sad)|
john.*?barbara.*?(?:happy|sad)|
barbara.*?(?:happy|sad).*?john|
john.*?(?:happy|sad).*?barbara
Output
happy birthday john from barbara => Matched
Happy birthday john => Not matched
As mentionned in other responses, a regex may not be well suited here.
It might be possible to do it with regexp, but it would be so complicated that it's better to use some different way (for example using a HashSet, as mentioned in the other answers).
One way to do it with regex would be to calculate all the permutations of the words which you are looking for, and then write a regex which mentions all those permutations. With 2 words there would be 2 permutations, as in (.*foo.*bar.*)|(.*bar.*foo.*) (plus word boundaries), with 3 words there would be 6 permutations, and quite soon the number of permutations would be larger than your input file.
If your data is relatively constant, and you are planning on searching a lot, using Apache Lucene will ensure better peformance.
Using information retrieval techniques, you will first index all your documents/sentences, and then search for your words, in your example you would want to search for "+(+john +barbara) +(sad happy)" [or "(john AND barbarar) AND (sad OR HAPPY)" ]
this approach will consume some time when indexing, however, searching will be much faster then any regex/hashset approach (since you don't need to iterate over all documents...)

Fuzzy string search in Java, including word swaps

I am a Java beginner, trying to write a program that will match an input to a list of predefined strings. I have looked at Levenshtein distance, but I have come to problems such as this:
If I have an input such as "fillet of beef" I want it to be matched to "beef fillet". The problem is that "fillet of beef" is closer, according to Levenshtein distance, to something like "fillet of tuna", which of course is wrong.
Should I be using something like Lucene for this? Does one use Lucene methods within a Java class?
Thanks!
You need to compute the relevance of your search terms to the input strings. Lucene does have relevance calculations built in, and this article might be a good start to understanding them (I just scanned it, but it seems reasonably authoritative).
The basic process is this:
Initialization: tokenize your search terms, and store them in a series of HashSets, one per term. Or, if you want to give different weights to each word, use HashMap where the word is the key.
Processing: tokenize each input string, and probe each of the sets of search terms to determine how closely they apply to the input. See above for a description of algorithms.
There's an easy trick to handle misspellings: during initialization, you create sets containing potential misspellings of the search terms. Peter Norvig's post on "How to Write a Spelling Corrector" describes this process (it uses Python code, but a Java implementation is certainly possible).
Lucene does support fuzzy search based on Levenshtein distance.
https://lucene.apache.org/java/2_4_0/queryparsersyntax.html#Fuzzy%20Searches
But lucene is meant to search on set of documents rather than string search, so lucene might be an overkill for you. There are other Java implementation available. Take a look at http://www.merriampark.com/ldjava.htm
It should be possible to apply the Levenshtein distance to words, not characters. Then, to match words, you could again apply Levenshtein on the character level, so that "filet" in "filet of beef" should match "fillet" in "beef fillet".

Categories