I know regular expressions aren't necessarily the best tool for the job, but I was wondering whether this would be possible with Java regexen:
Let's say I have a data set with names separated by line breaks like so:
John Doe
Jane Roe
Richard Miles
(there are naturally a lot more names in the actual system)
I'll be reading in data where I'll get both the first and the last name separately, but they won't necessarily be in the same order.
Now, the question is, is there any way to construct a regex for, say, Richard Miles that would match both "Miles Richard" and "Richard Miles"? I know there are plenty of other ways to do this, but I'm specifically looking for a regex-based solution (not because it's necessarily practical, but I'd find it interesting)
Edit for clarification: what I mean is that I need a regex for, say, "Richard Miles" that will match on both "Richard Miles" and "Miles Richard", and preferably not just (Richard Miles|Miles Richard) because where's the fun in that?
This is in no way supposed to be practical, I'm merely interested whether regexen can do something like this.
Does it need to be complex and clever? I mean this would work.
\b(Miles Richard|Richard Miles)\b
Yes, take a look at this: -
^(\\w+)\\s(\\w+)$
It will match a word at the beginning(^\\w+), followed by a space (\\s), followed by another word at the end(\\w+$)
Are you looking for this only?
Related
I'm trying to select top 3 strings which contains the most matches..
I'll explain it like this:
assume that we have the following keywords: "pc, programming, php, java"
and the following sentences:
a[0]="what is java??"<br>
a[1]="I love playing and programming on pc"<br>
a[2]="I'm good at programming php and java"<br>
a[3]="I'm programming php and java on my pc"<br>
so only the last 3 strings must be selected cause they are the top 3 strings containing the most matches.
How to do this in java???
If your dataset is small and you only care about exact matches, you could do something like the following:
Loop over each of your sentences performing an indexOf check for each keyword. If this returns something that isn't -1 then increment a counter for that sentence. Repeat for each keyword. At the end find the 3 sentences that have the highest counter.
This approach will have all kinds of issues though including things such as:
Case insensitivity
Tags matching partial words, e.g. "java" matching "javascript"
Ideally you would use a full text engine like Lucene/Solr/ElasticSearch and let that do all the heavy lifting for you
Arguably the easiest method would be to use Regex, an expression based system which searches for patterns within strings.
Pick up a website which teaches Regex. I suggest this one for starters.
http://regexone.com/
Afterwards, familiarize yourself with Java Regex. I suggest looking into capture groups.
I will not give you code to do this, because I believe there are many online examples you can look at, and it is in your best interest to learn how to do this by yourself.
I'm having trouble searching for the right terms here to solve the below problem; I'm sure it's a done thing, I just can't find the right terms to express the problem!
I'm basically trying to create a classifier that will take word comparison outputs (e.g. some outputs from Levenstein distances) and decide whether the words are sufficiently different. An important input would probably be something like a soundex comparison. The trouble I'm having is creating the training set for the algorithm (an SVM in this case). I have a long list of names and I need to mutate them a bit (based on similar sounds within the word).
E.g. John and Jon would be a mutation to make, and I could label this in the test set as being equivalent. John and Johann have sufficiently different sound and letter distance to be considered different.
So I'm kinda asking for is a way to achieve a phoneme variation generator, but need to be able to retain the English lettering structure.
Even simple translation might suffice, like "f" could (sometimes) be replaced by "ph". I'm doing this in Java so any tips in that direction would be great too! Thanks.
EDIT
This is the closest I've come across so far: http://www.isi.edu/natural-language/people/hovy/papers/07IJCAI-spelling-variants.pdf
I'm just thinking aloud.
Rule-based: Apply a rule-based system where you could use standard substitution rules such as 'ph' for 'f', and insertion rules such as insert an h between a vowel and a consonant.
Character n-gram alignment:
Use a word alignment tool such as Giza++ to align character n-grams from parallel corpora such as Europarl. I guess you would be able to find interesting word spelling variations such as "house", "haus" etc. You can play with various values of n.
Bootstraping character n-gram alignment with rule-based: You might also want to use a combination of the two, in which you could, in principle, boost the probabilities of some alignments by using a set of external rules and heuristics.
This is one of those questions that has been asked and answered hundreds of times over, but I'm having a hard time adapting other solutions to my needs.
In my Java-application I have a method for censoring bad words in chat messages. It works for most of my words, but there is one particular (and popular) curse word that I can't seem to get rid of. The word is "faen" (which is simply a modern slang for "satan", in the language in question).
Using the pattern "fa+e+n" for matching multiple A's and E's actually works; however, in this language, the word for "that couch" or "that sofa" is "sofaen". I've tried a lot of different approaches, using variations of [^so] and (?!=so), but so far I haven't been able to find a way to match one and not the other.
The real goal here, is to be able to match the bad words, regardless of the number of vowels, and regardless of any non-letters in between the components of the word.
Here's a few examples of what I'm trying to do:
"String containing faen" Should match
"String containing sofaen" Should not match
"Non-letter-censored string with f-a#a-e.n" Should match
"Non-letter-censored string with sof-a#a-e.n" Should not match
Any tips to set me off in the right direction on this?
You want something like \bf[^\s]+a[^\s]+e[^\s]+n[^\s]\b. Note that this is the regular expression; if you want the Java then you need to use \\b[^\\s]+f[^\\s]+a[^\\s]+e[^\\s]+n[^\\s]\b.
Note also that this isn't perfect, but does handle the situations that you have suggested.
It's a terrible idea to begin with. You think, your users would write something like "f-aeen" to avoid your filter but would not come up with "ffaen" or "-faen" or whatever variation that you did not prepare for? This is a race you cannot win and the real loser is usability.
This feels like it should be an extremely simple thing to do with regex but I can't quite seem to figure it out.
I would like to write a regex which checks to see if a list of certain words appear in a document, in any order, along with any of a set of other words in any order.
In boolean logic the check would be:
If allOfTheseWords are in this text and atLeastOneOfTheseWords are in this text, return true.
Example
I'm searching for (john and barbara) with (happy or sad).
Order does not matter.
"Happy birthday john from barbara" => VALID
"Happy birthday john" => INVALID
I simply cannot figure out how to get the and part to match in an orderless way, any help would be appreciated!
You don't really want to use a regex for this unless the text is very small, which from your description I doubt.
A simple solution would be to dump all the words into a HashSet, at which point checking to see if a word is present becomes a very quick and easy operation.
If you want to do it with regex, I'd try positive lookahead:
// searching for (john and barbara) with (happy or sad)
"^(?=.*\bjohn\b)(?=.*\bbarbara\b).*\b(happy|sad)\b"
The performance should be comparable to doing a full text search for each of the words in the allOfTheseWords group separately.
If you really need a single regex, then it would be very large and very slow due to backtracking. For your particular example of (John AND Barbara) AND (Happy or Sad), it would start like this:
\bJohn\b.*?\bBarbara\n.*?\bHappy\b|\bJohn\b.*?\bBarbara\n.*?\bSad\b|......
You'd ultimately need to put all combinations in the regex. Something like:
JBH, JBS, JHB, JSB, HJB, SJB, BJH, BJS, BHJ, BSJ, HBJ, SBJ
Again backtracking would be prohibitive, as would the explosion in the number of cases. Stay away from regexes here.
With your example, this is a regex that may help you :
Regex
(?:happy|sad).*?john.*?barbara|
(?:happy|sad).*?barbara.*?john|
barbara.*?john.*?(?:happy|sad)|
john.*?barbara.*?(?:happy|sad)|
barbara.*?(?:happy|sad).*?john|
john.*?(?:happy|sad).*?barbara
Output
happy birthday john from barbara => Matched
Happy birthday john => Not matched
As mentionned in other responses, a regex may not be well suited here.
It might be possible to do it with regexp, but it would be so complicated that it's better to use some different way (for example using a HashSet, as mentioned in the other answers).
One way to do it with regex would be to calculate all the permutations of the words which you are looking for, and then write a regex which mentions all those permutations. With 2 words there would be 2 permutations, as in (.*foo.*bar.*)|(.*bar.*foo.*) (plus word boundaries), with 3 words there would be 6 permutations, and quite soon the number of permutations would be larger than your input file.
If your data is relatively constant, and you are planning on searching a lot, using Apache Lucene will ensure better peformance.
Using information retrieval techniques, you will first index all your documents/sentences, and then search for your words, in your example you would want to search for "+(+john +barbara) +(sad happy)" [or "(john AND barbarar) AND (sad OR HAPPY)" ]
this approach will consume some time when indexing, however, searching will be much faster then any regex/hashset approach (since you don't need to iterate over all documents...)
InputString: A soldier may have bruises , wounds , marks , dislocations or other Injuries that hurt him .
ExpectedOutput:
bruises
wounds
marks
dislocations
Injuries
Generalized Pattern Tried:
".[\s]?(\w+?)"+ // bruises.
"(?:(\s)?,(\s)?(\w+?))*"+ // wounds marks dislocations
"[\s]?(?:or|and) other (\w+)."; // Injuries
The pattern should be able to match other input strings like: A soldier may have bruiser or other injuries that hurt him.
On trying the generalized pattern above, the output is:
bruises
dislocations
Injuries
There is something wrong with the capturing group for "(?:(\s)?,(\s)?(\w+?))*". The capturing group has one more occurences.. but it returns only "dislocations". "marks" and "dislocation: are devoured.
Could you please suggest what should be the right pattern, and where is the mistake?
This question comes closest to this question, but that solution didn't help.
Thanks.
When the capture group is annotated with a quantifier [ie: (foo)*] then you will only get the last match. If you wanted to get all of them then you need to quantifier inside the capture and then you will have to manually parse out the values. As big a fan as I am of regex, I don't think it's appropriate here for any number of reasons... even if you weren't ultimately doing NLP.
How to fix: (?:(\s)?,(\s)?(\w+?))*
Well, the quantifier basically covers the whole regex in that case and you might as well use Matcher.find() to step through each match. Also, I'm curious why you have capture groups for the whitespace. If all you are trying to do is find a comma-separated set of words then that's something like: \w+(?:\s*,\s*\w+)* Then don't bother with capture groups and just split the whole match.
And for anything more complicated re: NLP, GATE is a pretty powerful tool. The learning curve is steep at times but you have a whole industry of science-guys to draw from: http://gate.ac.uk/
Regex in not suited for (natural) language processing. With regex, you can only match well defined patterns. You should really, really abandon the idea of doing this with regex.
You may want to start a new question where you specify what programming language you're using to perform this task and ask for pointers there.
EDIT
PSpeed posted a promising link to a 3rd party library, Gate, that's able to do many language processing tasks. And it's written in Java. I have not used it myself, but looking at the people/institutions working on it, it seems pretty solid.
The pattern that works is: \w+(?:\s*,\s*\w+)* and then manually separate CSV
There is no other method to do this with Java Regex.
Ideally, Java regex is not suitable for NLP. A useful tool for text mining is: gate.ac.uk
Thanks to Bart K. , and PSpeed.