This feels like it should be an extremely simple thing to do with regex but I can't quite seem to figure it out.
I would like to write a regex which checks to see if a list of certain words appear in a document, in any order, along with any of a set of other words in any order.
In boolean logic the check would be:
If allOfTheseWords are in this text and atLeastOneOfTheseWords are in this text, return true.
Example
I'm searching for (john and barbara) with (happy or sad).
Order does not matter.
"Happy birthday john from barbara" => VALID
"Happy birthday john" => INVALID
I simply cannot figure out how to get the and part to match in an orderless way, any help would be appreciated!
You don't really want to use a regex for this unless the text is very small, which from your description I doubt.
A simple solution would be to dump all the words into a HashSet, at which point checking to see if a word is present becomes a very quick and easy operation.
If you want to do it with regex, I'd try positive lookahead:
// searching for (john and barbara) with (happy or sad)
"^(?=.*\bjohn\b)(?=.*\bbarbara\b).*\b(happy|sad)\b"
The performance should be comparable to doing a full text search for each of the words in the allOfTheseWords group separately.
If you really need a single regex, then it would be very large and very slow due to backtracking. For your particular example of (John AND Barbara) AND (Happy or Sad), it would start like this:
\bJohn\b.*?\bBarbara\n.*?\bHappy\b|\bJohn\b.*?\bBarbara\n.*?\bSad\b|......
You'd ultimately need to put all combinations in the regex. Something like:
JBH, JBS, JHB, JSB, HJB, SJB, BJH, BJS, BHJ, BSJ, HBJ, SBJ
Again backtracking would be prohibitive, as would the explosion in the number of cases. Stay away from regexes here.
With your example, this is a regex that may help you :
Regex
(?:happy|sad).*?john.*?barbara|
(?:happy|sad).*?barbara.*?john|
barbara.*?john.*?(?:happy|sad)|
john.*?barbara.*?(?:happy|sad)|
barbara.*?(?:happy|sad).*?john|
john.*?(?:happy|sad).*?barbara
Output
happy birthday john from barbara => Matched
Happy birthday john => Not matched
As mentionned in other responses, a regex may not be well suited here.
It might be possible to do it with regexp, but it would be so complicated that it's better to use some different way (for example using a HashSet, as mentioned in the other answers).
One way to do it with regex would be to calculate all the permutations of the words which you are looking for, and then write a regex which mentions all those permutations. With 2 words there would be 2 permutations, as in (.*foo.*bar.*)|(.*bar.*foo.*) (plus word boundaries), with 3 words there would be 6 permutations, and quite soon the number of permutations would be larger than your input file.
If your data is relatively constant, and you are planning on searching a lot, using Apache Lucene will ensure better peformance.
Using information retrieval techniques, you will first index all your documents/sentences, and then search for your words, in your example you would want to search for "+(+john +barbara) +(sad happy)" [or "(john AND barbarar) AND (sad OR HAPPY)" ]
this approach will consume some time when indexing, however, searching will be much faster then any regex/hashset approach (since you don't need to iterate over all documents...)
Related
I'm trying to select top 3 strings which contains the most matches..
I'll explain it like this:
assume that we have the following keywords: "pc, programming, php, java"
and the following sentences:
a[0]="what is java??"<br>
a[1]="I love playing and programming on pc"<br>
a[2]="I'm good at programming php and java"<br>
a[3]="I'm programming php and java on my pc"<br>
so only the last 3 strings must be selected cause they are the top 3 strings containing the most matches.
How to do this in java???
If your dataset is small and you only care about exact matches, you could do something like the following:
Loop over each of your sentences performing an indexOf check for each keyword. If this returns something that isn't -1 then increment a counter for that sentence. Repeat for each keyword. At the end find the 3 sentences that have the highest counter.
This approach will have all kinds of issues though including things such as:
Case insensitivity
Tags matching partial words, e.g. "java" matching "javascript"
Ideally you would use a full text engine like Lucene/Solr/ElasticSearch and let that do all the heavy lifting for you
Arguably the easiest method would be to use Regex, an expression based system which searches for patterns within strings.
Pick up a website which teaches Regex. I suggest this one for starters.
http://regexone.com/
Afterwards, familiarize yourself with Java Regex. I suggest looking into capture groups.
I will not give you code to do this, because I believe there are many online examples you can look at, and it is in your best interest to learn how to do this by yourself.
This is one of those questions that has been asked and answered hundreds of times over, but I'm having a hard time adapting other solutions to my needs.
In my Java-application I have a method for censoring bad words in chat messages. It works for most of my words, but there is one particular (and popular) curse word that I can't seem to get rid of. The word is "faen" (which is simply a modern slang for "satan", in the language in question).
Using the pattern "fa+e+n" for matching multiple A's and E's actually works; however, in this language, the word for "that couch" or "that sofa" is "sofaen". I've tried a lot of different approaches, using variations of [^so] and (?!=so), but so far I haven't been able to find a way to match one and not the other.
The real goal here, is to be able to match the bad words, regardless of the number of vowels, and regardless of any non-letters in between the components of the word.
Here's a few examples of what I'm trying to do:
"String containing faen" Should match
"String containing sofaen" Should not match
"Non-letter-censored string with f-a#a-e.n" Should match
"Non-letter-censored string with sof-a#a-e.n" Should not match
Any tips to set me off in the right direction on this?
You want something like \bf[^\s]+a[^\s]+e[^\s]+n[^\s]\b. Note that this is the regular expression; if you want the Java then you need to use \\b[^\\s]+f[^\\s]+a[^\\s]+e[^\\s]+n[^\\s]\b.
Note also that this isn't perfect, but does handle the situations that you have suggested.
It's a terrible idea to begin with. You think, your users would write something like "f-aeen" to avoid your filter but would not come up with "ffaen" or "-faen" or whatever variation that you did not prepare for? This is a race you cannot win and the real loser is usability.
Lets say I have about 1000 sentences that I want to offer as suggestions when user is typing into a field.
I was thinking about running lucene in memory search and then feeding the results into the suggestions set.
The trigger for running the searches would be space char and exit from the input field.
I intend to use this with GWT so the client with be just getting the results from server.
I don't want to do what google is doing; where they complete each word and than make suggestions on each set of keywords. I just want to check the keywords and make suggestions based on that. Sort of like when I'm typing the title for the question here on stackoverflow.
Did anyone do something like this before? Is there already library I could use?
I was working on a similar solution. This paper titled Effective Phrase Prediction was quite helpful for me . You will have to prioritize the suggestions as well
If you've only got 1000 sentences, you probably don't need a powerful indexer like lucene. I'm not sure whether you want to do "complete the sentence" suggestions or "suggest other queries that have the same keywords" suggestions. Here are solutions to both:
Assuming that you want to complete the sentence input by the user, then you could put all of your strings into a SortedSet, and use the tailSet method to get a list of strings that are "greater" than the input string (since the string comparator considers a longer string A that starts with string B to be "greater" than B). Then, iterate over the top few entries of the set returned by tailSet to create a set of strings where the first inputString.length() characters match the input string. You can stop iterating as soon as the first inputString.length() characters don't match the input string.
If you want to do keyword suggestions instead of "complete the sentence" suggestions, then the overhead depends on how long your sentences are, and how many unique words there are in the sentences. If this set is small enough, you'll be able to get away with a HashMap<String,Set<String>>, where you mapped keywords to the sentences that contained them. Then you could handle multiword queries by intersecting the sets.
In both cases, I'd probably convert all strings to lower case first (assuming that's appropriate in your application). I don't think either solution would scale to hundreds of thousands of suggestions either. Do either of those do what you want? Happy to provide code if you'd like it.
I accidentally answered a question where the original problem involved splitting sentence to separate words.
And the author suggested to use BreakIterator to tokenize input strings and some people liked this idea.
I just don't get that madness: how 25 lines of complicated code can be better than a simple one-liner with regexp?
Please, explain me the pros of using BreakIterator and the real cases when it should be used.
If it's really so cool and proper then I wonder: do you really use the approach with BreakIterator in your projects?
From looking at the code posted at that answer, it looks like BreakIterator takes into consideration the language and locale of the text. Getting that level of support via regex will surely be a considerable pain. Perhaps that is the main reason it is preferred over a simple regex?
The BreakIterator gives some nice explicit control and iterates cleanly in a nested way over each sentence and word. I'm not familiar with exactly what specifying the locale does for you, but I'm sure its quite helpful sometimes as well.
It didn't strike me as complicate at all. Just set up one iterator for the sentence level, another for the word level, nest the word one inside the second one.
If the problem changed into something different the solution you had on the other question might've just been out the window. However, that pattern of iterating through sentences and words can do a lot.
Find the sentence where any word occurs the most repeated times. Output it along with that word
Find the word used most times throughout the whole string.
Find all words that occur in every sentence
Find all words that occur a prime number of times in 2 or more sentences
The list goes on...
InputString: A soldier may have bruises , wounds , marks , dislocations or other Injuries that hurt him .
ExpectedOutput:
bruises
wounds
marks
dislocations
Injuries
Generalized Pattern Tried:
".[\s]?(\w+?)"+ // bruises.
"(?:(\s)?,(\s)?(\w+?))*"+ // wounds marks dislocations
"[\s]?(?:or|and) other (\w+)."; // Injuries
The pattern should be able to match other input strings like: A soldier may have bruiser or other injuries that hurt him.
On trying the generalized pattern above, the output is:
bruises
dislocations
Injuries
There is something wrong with the capturing group for "(?:(\s)?,(\s)?(\w+?))*". The capturing group has one more occurences.. but it returns only "dislocations". "marks" and "dislocation: are devoured.
Could you please suggest what should be the right pattern, and where is the mistake?
This question comes closest to this question, but that solution didn't help.
Thanks.
When the capture group is annotated with a quantifier [ie: (foo)*] then you will only get the last match. If you wanted to get all of them then you need to quantifier inside the capture and then you will have to manually parse out the values. As big a fan as I am of regex, I don't think it's appropriate here for any number of reasons... even if you weren't ultimately doing NLP.
How to fix: (?:(\s)?,(\s)?(\w+?))*
Well, the quantifier basically covers the whole regex in that case and you might as well use Matcher.find() to step through each match. Also, I'm curious why you have capture groups for the whitespace. If all you are trying to do is find a comma-separated set of words then that's something like: \w+(?:\s*,\s*\w+)* Then don't bother with capture groups and just split the whole match.
And for anything more complicated re: NLP, GATE is a pretty powerful tool. The learning curve is steep at times but you have a whole industry of science-guys to draw from: http://gate.ac.uk/
Regex in not suited for (natural) language processing. With regex, you can only match well defined patterns. You should really, really abandon the idea of doing this with regex.
You may want to start a new question where you specify what programming language you're using to perform this task and ask for pointers there.
EDIT
PSpeed posted a promising link to a 3rd party library, Gate, that's able to do many language processing tasks. And it's written in Java. I have not used it myself, but looking at the people/institutions working on it, it seems pretty solid.
The pattern that works is: \w+(?:\s*,\s*\w+)* and then manually separate CSV
There is no other method to do this with Java Regex.
Ideally, Java regex is not suitable for NLP. A useful tool for text mining is: gate.ac.uk
Thanks to Bart K. , and PSpeed.