Sentence Auto-Complete with Java - java

Lets say I have about 1000 sentences that I want to offer as suggestions when user is typing into a field.
I was thinking about running lucene in memory search and then feeding the results into the suggestions set.
The trigger for running the searches would be space char and exit from the input field.
I intend to use this with GWT so the client with be just getting the results from server.
I don't want to do what google is doing; where they complete each word and than make suggestions on each set of keywords. I just want to check the keywords and make suggestions based on that. Sort of like when I'm typing the title for the question here on stackoverflow.
Did anyone do something like this before? Is there already library I could use?

I was working on a similar solution. This paper titled Effective Phrase Prediction was quite helpful for me . You will have to prioritize the suggestions as well

If you've only got 1000 sentences, you probably don't need a powerful indexer like lucene. I'm not sure whether you want to do "complete the sentence" suggestions or "suggest other queries that have the same keywords" suggestions. Here are solutions to both:
Assuming that you want to complete the sentence input by the user, then you could put all of your strings into a SortedSet, and use the tailSet method to get a list of strings that are "greater" than the input string (since the string comparator considers a longer string A that starts with string B to be "greater" than B). Then, iterate over the top few entries of the set returned by tailSet to create a set of strings where the first inputString.length() characters match the input string. You can stop iterating as soon as the first inputString.length() characters don't match the input string.
If you want to do keyword suggestions instead of "complete the sentence" suggestions, then the overhead depends on how long your sentences are, and how many unique words there are in the sentences. If this set is small enough, you'll be able to get away with a HashMap<String,Set<String>>, where you mapped keywords to the sentences that contained them. Then you could handle multiword queries by intersecting the sets.
In both cases, I'd probably convert all strings to lower case first (assuming that's appropriate in your application). I don't think either solution would scale to hundreds of thousands of suggestions either. Do either of those do what you want? Happy to provide code if you'd like it.

Related

Configuring the tokanisation of the search term in an elasticsearch query

I am doing a general search against elasticsearch (1.7) using a match query against a number of specified fields. This is done in a java app with one box to enter search terms in. Various search options are allowed (for example surrounding phrase with quotes to look for the phase not the component words). This means I am doing full test searches.
All is well except my account refs have forward slashes in them and a search on an account ref produces thousands of results. If I surround the account ref with quotes I get just the result I want. I assume an account ref of AC/1234/A01 is searching for [AC OR 1234 OR A01]. Initially I thought this was a regex issue but I don’t think it is.
I raised a similar question a while ago and one suggestion which I had thought worked was to add "analyzer": "keyword" to the query (in my code
queryStringQueryBuilder.analyzer("keyword")
).
The problem with this is that many of the other fields searched are not keyword and it is stopping a lot of flexible search options working (case sensitivity etc). I assume this has become something along the lines of an exact match in the text search.
I've looked at this the wrong way around for a while now and as I see it I can't fix it in the index or even in the general analyser settings as even if the account ref field is tokenised and analysed perfectly for my requirement the search will still search all the other fields for [AC OR 1234 OR A01].
Is there a way of configuring the search query to not split the account number on forward slashes? I could test ignoring all punctuation if it is possible to only split by whitespaces although I would prefer not to make such a radical change...
So I guess what I am asking is whether there is another built in analyzer which would still do a full full text search but would not split the search term up using punctuation ? If not is this something I could do with a custom analyzer (without applying it to the index itself ?)
Thanks.
The simplest way to do it is by replacing / with some character that doesn't cause the word to be split in two tokens, but doesn't interfere with your other terms (_, ., ' should work) or remove / completely using mapping char filter. There is a similar example here https://stackoverflow.com/a/23640832/783043

How to select strings with the most keywords matches?

I'm trying to select top 3 strings which contains the most matches..
I'll explain it like this:
assume that we have the following keywords: "pc, programming, php, java"
and the following sentences:
a[0]="what is java??"<br>
a[1]="I love playing and programming on pc"<br>
a[2]="I'm good at programming php and java"<br>
a[3]="I'm programming php and java on my pc"<br>
so only the last 3 strings must be selected cause they are the top 3 strings containing the most matches.
How to do this in java???
If your dataset is small and you only care about exact matches, you could do something like the following:
Loop over each of your sentences performing an indexOf check for each keyword. If this returns something that isn't -1 then increment a counter for that sentence. Repeat for each keyword. At the end find the 3 sentences that have the highest counter.
This approach will have all kinds of issues though including things such as:
Case insensitivity
Tags matching partial words, e.g. "java" matching "javascript"
Ideally you would use a full text engine like Lucene/Solr/ElasticSearch and let that do all the heavy lifting for you
Arguably the easiest method would be to use Regex, an expression based system which searches for patterns within strings.
Pick up a website which teaches Regex. I suggest this one for starters.
http://regexone.com/
Afterwards, familiarize yourself with Java Regex. I suggest looking into capture groups.
I will not give you code to do this, because I believe there are many online examples you can look at, and it is in your best interest to learn how to do this by yourself.

Lucene: Mining email addresses, names, and identifiers from an index

I have a lucene index with approx. 1 million documents. From these documents, I want to mine
email addresses
signatures - ( [whitespace]/s/[whitespace]john doe[whitespace] )
specific identifiers from each of the documents (that follow a regex pattern "\s[0-9]{3}[a-zA-Z0-9]{6}\s").
I understand that ideally using solr, during index build time, its much easier, but how can one do this from a built lucene index?
I am using java. For email address search, I tried to .setAllowLeadingWildcard(true) and then searched for # to find all email addresses - but I actually got zero results . if I search for # in luke I get zero results. If I search for #hotmail.com in luke, I get bunch of results with valid email addresses such as aaaaa#hotmail.com.
The index was created using StandardAnalyzer. Not sure if it matters, but the text is in UTF-8 I believe.
Any helpful suggestions, pointers is great! Note this is not for front end, so query doesn't have to be near realtime.
Analysis does matter, yes. The standard analyzer will treat whitespace and punctuation, such as #, as a place to split input into tokens. As such, you wouldn't expect to see any of them actually present in the indexed data.
You can use Lucene's regex query, particularly for the third case. A PhraseQuery seems appropriate for the second, I think, though I'm more that slightly confused about what you are trying to accomplish there.
Generally, you might want to use a different analyzer for an email field, in order to use it as a single token. You should get reasonable results searching for a particular e-mail address, since, though the analyzer would remove the punctuation, searching for the three (usually) tokens of a email consecutively in a phrase would be expected to get good matches. However, a regex search like \w*#\w*\.\w*, won't be particularly effective, since the punctuation won't actually be indexed and searchable, and a regex search doesn't span multiple terms in the index. Apart from searching for a known set of e-mail domains, or something of that nature, you would want to re-index use analysis more in line with how you need to search it in order to do what you are asking.

Java & Regex: Matching a substring that is not preceded by specific characters

This is one of those questions that has been asked and answered hundreds of times over, but I'm having a hard time adapting other solutions to my needs.
In my Java-application I have a method for censoring bad words in chat messages. It works for most of my words, but there is one particular (and popular) curse word that I can't seem to get rid of. The word is "faen" (which is simply a modern slang for "satan", in the language in question).
Using the pattern "fa+e+n" for matching multiple A's and E's actually works; however, in this language, the word for "that couch" or "that sofa" is "sofaen". I've tried a lot of different approaches, using variations of [^so] and (?!=so), but so far I haven't been able to find a way to match one and not the other.
The real goal here, is to be able to match the bad words, regardless of the number of vowels, and regardless of any non-letters in between the components of the word.
Here's a few examples of what I'm trying to do:
"String containing faen" Should match
"String containing sofaen" Should not match
"Non-letter-censored string with f-a#a-e.n" Should match
"Non-letter-censored string with sof-a#a-e.n" Should not match
Any tips to set me off in the right direction on this?
You want something like \bf[^\s]+a[^\s]+e[^\s]+n[^\s]\b. Note that this is the regular expression; if you want the Java then you need to use \\b[^\\s]+f[^\\s]+a[^\\s]+e[^\\s]+n[^\\s]\b.
Note also that this isn't perfect, but does handle the situations that you have suggested.
It's a terrible idea to begin with. You think, your users would write something like "f-aeen" to avoid your filter but would not come up with "ffaen" or "-faen" or whatever variation that you did not prepare for? This is a race you cannot win and the real loser is usability.

Regex unordered matches

This feels like it should be an extremely simple thing to do with regex but I can't quite seem to figure it out.
I would like to write a regex which checks to see if a list of certain words appear in a document, in any order, along with any of a set of other words in any order.
In boolean logic the check would be:
If allOfTheseWords are in this text and atLeastOneOfTheseWords are in this text, return true.
Example
I'm searching for (john and barbara) with (happy or sad).
Order does not matter.
"Happy birthday john from barbara" => VALID
"Happy birthday john" => INVALID
I simply cannot figure out how to get the and part to match in an orderless way, any help would be appreciated!
You don't really want to use a regex for this unless the text is very small, which from your description I doubt.
A simple solution would be to dump all the words into a HashSet, at which point checking to see if a word is present becomes a very quick and easy operation.
If you want to do it with regex, I'd try positive lookahead:
// searching for (john and barbara) with (happy or sad)
"^(?=.*\bjohn\b)(?=.*\bbarbara\b).*\b(happy|sad)\b"
The performance should be comparable to doing a full text search for each of the words in the allOfTheseWords group separately.
If you really need a single regex, then it would be very large and very slow due to backtracking. For your particular example of (John AND Barbara) AND (Happy or Sad), it would start like this:
\bJohn\b.*?\bBarbara\n.*?\bHappy\b|\bJohn\b.*?\bBarbara\n.*?\bSad\b|......
You'd ultimately need to put all combinations in the regex. Something like:
JBH, JBS, JHB, JSB, HJB, SJB, BJH, BJS, BHJ, BSJ, HBJ, SBJ
Again backtracking would be prohibitive, as would the explosion in the number of cases. Stay away from regexes here.
With your example, this is a regex that may help you :
Regex
(?:happy|sad).*?john.*?barbara|
(?:happy|sad).*?barbara.*?john|
barbara.*?john.*?(?:happy|sad)|
john.*?barbara.*?(?:happy|sad)|
barbara.*?(?:happy|sad).*?john|
john.*?(?:happy|sad).*?barbara
Output
happy birthday john from barbara => Matched
Happy birthday john => Not matched
As mentionned in other responses, a regex may not be well suited here.
It might be possible to do it with regexp, but it would be so complicated that it's better to use some different way (for example using a HashSet, as mentioned in the other answers).
One way to do it with regex would be to calculate all the permutations of the words which you are looking for, and then write a regex which mentions all those permutations. With 2 words there would be 2 permutations, as in (.*foo.*bar.*)|(.*bar.*foo.*) (plus word boundaries), with 3 words there would be 6 permutations, and quite soon the number of permutations would be larger than your input file.
If your data is relatively constant, and you are planning on searching a lot, using Apache Lucene will ensure better peformance.
Using information retrieval techniques, you will first index all your documents/sentences, and then search for your words, in your example you would want to search for "+(+john +barbara) +(sad happy)" [or "(john AND barbarar) AND (sad OR HAPPY)" ]
this approach will consume some time when indexing, however, searching will be much faster then any regex/hashset approach (since you don't need to iterate over all documents...)

Categories