Solr - Match sentence beginning with a particular word

Solr - Match sentence beginning with a particular word - java

Any tips on how this is done?
I've tried using the PatternTokenizerFactory, but it's not working as expected.
Is it possible to do this without writing a custom tokenizer?

you can tokenize the field in question using KeyWordTokenizerFactory and then do wildcard search
http://solr.pl/en/2010/12/20/wildcard-queries-and-how-solr-handles-them/
provided that you are not doing any other operation which does not work with the above Tokenizer.
Another way is a roundabout way. You can create a copyfield which will have its spaces stripped out using the following technique (or some other) :-
What is the regular expression to remove spaces in SOLR
You can then tokenize that copyfield using WhiteSpaceTokenizer (which essentially creates one token only since the copyfield values have no space) and then do a wildcard search on it.
The second approach might fail in some of the cases (for eg. "wor them" will match "worth*" after the spaces are stripped)

Related

Lucene: searching for string matching a regex

I use Lucene to search for specific patterns using a regular expression. A new use case came up where I need to look up a specific string matching a regex pattern. Good example would be to look up a prices in documents: prices can be written in many ways, just looking for "1256.88" as stored in the database is not enough. The value in the document may have a currency in front of it, behind it or even not present at all ("EUR 1256,88", "1256,88 EUR" or just "1256,88"). The value may have thousands separators or not. And of course this can be combined with each other. So I want to search for a specific, known price ("1256.88") being part of a regex at the same time. An example regex would be
[0-9]{1,10}*([\.|,][0-9]{0,2})?([\ ]?[€|$])?
What is the Lucene way of doing this? Is there a way to search with a regex AND an "example"?
Or do I have to search with a regex and then filter out wrong hits manually afterwards? How do I find out which strings triggered the match?

Combine multiple tokenizers in Solr

I'm trying to combine LetterTokenizerFactory with WhitespaceTokenizerFactory and not able to find how to do it without copying content using copyField.
Let me describe my idea:
I have two entries in text, e.g. H&M and Hewlett-Packard
User should be able to find H&M entering h&m - I use WhitespaceTokenizerFactory for this purpose, no need to split tokens on special chars
User should be able to find Hewlett-Packard entering 'packard' - LetterTokenizerFactory serves this case, tokens are splitted on special characters
Now I want to combine both this tokenizers
How can I achieve it without declaring 2 different types with different tokenizer factories and then copying value to field with second type?

You can use the WhitespaceTokenizerFactory as the main tokenizer, and then add the WordDelimiterGraphFilter to split your tokens further up into smaller tokens.
From the example for the WordDelimiterGraphFilter (previously named WordDelimiterFilter, but that's deprecated now - so the name will depend on which Solr version you're using):
Non-alphanumeric characters (discarded): "hot-spot" -> "hot", "spot"
That would allow packard to match hewlett. Be advised that this will also allow 'm' to match h&m, since you're splitting on non-alphanumeric characters. You can either use the protected setting for the filter to specify a list of words that should not be touched, or even better, if you want everything with & to remain untouched, use the types parameter to redefine what type & should be considered as.

Configuring the tokanisation of the search term in an elasticsearch query

I am doing a general search against elasticsearch (1.7) using a match query against a number of specified fields. This is done in a java app with one box to enter search terms in. Various search options are allowed (for example surrounding phrase with quotes to look for the phase not the component words). This means I am doing full test searches.
All is well except my account refs have forward slashes in them and a search on an account ref produces thousands of results. If I surround the account ref with quotes I get just the result I want. I assume an account ref of AC/1234/A01 is searching for [AC OR 1234 OR A01]. Initially I thought this was a regex issue but I don’t think it is.
I raised a similar question a while ago and one suggestion which I had thought worked was to add "analyzer": "keyword" to the query (in my code
queryStringQueryBuilder.analyzer("keyword")
).
The problem with this is that many of the other fields searched are not keyword and it is stopping a lot of flexible search options working (case sensitivity etc). I assume this has become something along the lines of an exact match in the text search.
I've looked at this the wrong way around for a while now and as I see it I can't fix it in the index or even in the general analyser settings as even if the account ref field is tokenised and analysed perfectly for my requirement the search will still search all the other fields for [AC OR 1234 OR A01].
Is there a way of configuring the search query to not split the account number on forward slashes? I could test ignoring all punctuation if it is possible to only split by whitespaces although I would prefer not to make such a radical change...
So I guess what I am asking is whether there is another built in analyzer which would still do a full full text search but would not split the search term up using punctuation ? If not is this something I could do with a custom analyzer (without applying it to the index itself ?)
Thanks.

The simplest way to do it is by replacing / with some character that doesn't cause the word to be split in two tokens, but doesn't interfere with your other terms (_, ., ' should work) or remove / completely using mapping char filter. There is a similar example here https://stackoverflow.com/a/23640832/783043

Combining (OR) arbitrary regular expressions

tl;dr Is there a way to OR/combine arbitrary regexes into a single regex (for matching, not capturing) in Java?
In my application I receive two lists from the user:
list of regular expressions
list of strings
and I need to output a list of the strings in (2) that were not matched by any of the regular expressions in (1).
I have the obvious naive implementation in place (iterate over all strings in (2); for each string iterate over all patterns in (1); if no pattern match the string add it to the list that will be returned) but I was wondering if it was possible to combine all patterns into a single one and let the regex compiler exploit optimization opportunities.
The obvious way to OR-combine regexes is obviously (regex1)|(regex2)|(regex3)|...|(regexN) but I'm pretty sure this is not the correct thing to do considering that I have no control over the individual regexes (e.g. they could contain all manners of back/forward references). I was therefore wondering if you can suggest a better way to combine arbitrary regexes in java.
note: it's only implied by the above, but I'll make it explicit: I'm only matching against the string - I don't need to use the output of the capturing groups.

Some regex engines (e.g. PCRE) have the construct (?|...). It's like a non-capturing group, but has the nice feature that in every alternation groups are counted from the same initial value. This would probably immediately solve your problem. So if switching the language for this task is an option for you, that should do the trick.
[edit: In fact, it will still cause problems with clashing named capturing groups. In fact, the pattern won't even compile, since group names cannot be reused.]
Otherwise you will have to manipulate the input patterns. hyde suggested renumbering the backreferences, but I think there is a simpler option: making all groups named groups. You can assure yourself that the names are unique.
So basically, for every input pattern you create a unique identifier (e.g. increment an ID). Then the trickiest part is finding capturing groups in the pattern. You won't be able to do this with a regex. You will have to parse the pattern yourself. Here are some thoughts on what to look out for if you are simply iterating through the pattern string:
Take note when you enter and leave a character class, because inside character classes parentheses are literal characters.
Maybe the trickiest part: ignore all opening parentheses that are followed by ?:, ?=, ?!, ?<=, ?<!, ?>. In addition there are the option setting parentheses: (?idmsuxU-idmsuxU) or (?idmsux-idmsux:somePatternHere) which also capture nothing (of course there could be any subset of those options and they could be in any order - the - is also optional).
Now you should be left only with opening parentheses that are either a normal capturing group or a named on: (?<name>. The easiest thing might be to treat them all the same - that is, having both a number and a name (where the name equals the number if it was not set). Then you rewrite all of those with something like (?<uniqueIdentifier-md5hashOfName> (the hyphen cannot be actually part of the name, you will just have your incremented number followed by the hash - since the hash is of fixed length there won't be any duplicates; pretty much at least). Make sure to remember which number and name the group originally had.
Whenever you encounter a backslash there are three options:
The next character is a number. You have a numbered backreference. Replace all those numbers with k<name> where name is the new group name you generated for the group.
The next characters are k<...>. Again replace this with the corresponding new name.
The next character is anything else. Skip it. That handles escaping of parentheses and escaping of backslashes at the same time.
I think Java might allow forward references. In that case you need two passes. Take care of renaming all groups first. Then change all the references.
Once you have done this on every input pattern, you can safely combine all of them with |. Any other feature than backreferences should not cause problems with this approach. At least not as long as your patterns are valid. Of course, if you have inputs a(b and c)d then you have a problem. But you will have that always if you don't check that the patterns can be compiled on their own.
I hope this gave you a pointer in the right direction.

Java string: classes or packages with advanced functions?

I am doing string manipulations and I need more advanced functions than the original ones provided in Java.
For example, I'd like to return a substring between the (n-1)th and nth occurrence of a character in a string.
My question is, are there classes already written by users which perform this function, and many others for string manipulations? Or should I dig on stackoverflow for each particular function I need?

Check out the Apache Commons class StringUtils, it has plenty of interesting ways to work with Strings.
http://commons.apache.org/lang/api-2.3/index.html?org/apache/commons/lang/StringUtils.html

Have you looked at the regular expression API? That's usually your best bet for doing complex things with strings:
http://download.oracle.com/javase/6/docs/api/java/util/regex/Pattern.html
Along the lines of what you're looking to do, you can traverse the string against a pattern (in your case a single character) and match everything in the string up to but not including the next instance of the character as what is called a capture group.
It's been a while since I've written a regex, but if you were looking for the character A for instance, then I think you could use the regex A([^A]*) and keep matching that string. The stuff in the parenthesis is a capturing group, which I reference below. To match it, you'd use the matcher method on pattern:
http://download.oracle.com/javase/6/docs/api/java/util/regex/Pattern.html#matcher%28java.lang.CharSequence%29
On the Matcher instance, you'd make sure that matches is true, and then keep calling find() and group(1) as needed, where group(1) would get you what is in between the parentheses. You could use a counter in your looping to make sure you get the n-1 instance of the letter.
Lastly, Pattern provides flags you can pass in to indicate things like case insensitivity, which you may need.
If I've made some mistakes here, then someone please correct me. Like I said, I don't write regexes every day, so I'm sure I'm a little bit off.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.