Why does Solr ClientUtils::escapeQueryChars escape spaces - java

Solr query has some special chars that need to be escaped, +-&|!(){}[]^"~*?:/.
SolrJ provides a utility method ClientUtils::escapeQueryChars which escapes more chars, including ; and white spaces.
It caused a bug in my application when the search term contains space, like foo bar, which was turned into foo\ bar by ClientUtils::escapeQueryChars. My solution is to split the search term, escape each term and joining them with AND or OR.
But it's still a pain to write extra code just to handle handle space.
Is there any special reason that space and ; are also escaped by this utility method ?

In Solr (and Lucene) the characters can have different meanings in query syntax depending from what query parser you're using (for example standard, dismax, edismax, etc.).
So when and what escape depends from which query parser you're using and which query you're trying to do. I know this seems too broad as answer but I'll add an example to make the things more clear.
For example, let's try to use edismax as query parser and have a document with a field named tv_display of type string.
If you write:
http://localhost:8983/solr/buybox/select?q=tv_display:Full HD
edismax will convert the query in +tv_display:Full +tv_display:HD.
In this way you'll never find the documents where tv_display is Full HD but all the documents where tv_display is Full and/or HD (and/or depends by your mm configuration).
ClientUtils::escapeQueryChars will convert Full HD in Full\ HD:
http://localhost:8983/solr/buybox/select?q=tv_display:Full\ HD
So edismax takes the entire string as a single token and only in this way will be returned all the documents where tv_display has Full HD.
In conclusion ClientUtils::escapeQueryChars escape all possible characters (spaces and semicolons included) that can be misunderstood by a query parser.

Related

Best way to validate non-printable ascii characters in XML

Application needs to validate the different input XML(s) messages for non-printable ascii characters. We currently know two options to do this.
Change the XSD to include the restriction.
Validate the input xml string in java application using Regular Expression
Which approach is better in terms of performance as our application has to return the response within a few seconds? Is there any other option available to do this?
It's mainly a matter of opinion but if you have an XSD that seems to be the natural place to include the validations. The only thing you may need to consider is that via XSD you will either fail or pass, whereas with ad-hoc java validation you can ignore non-printable, or replace or take an action without failing the input completely.
The only characters that are (a) ASCII, (b) non-printable, and (c) allowed in XML 1.0 documents are CR, NL, and TAB. I find it hard to see why excluding those three characters is especially important, but if you already have an XSD schema, then it makes sense to add the restriction there.
The usual approach is not to make these three characters invalid, but to treat them as equivalent to space characters, which you can do by using a data type that has the whitespace facet value "normalize" or "collapse".

Configuring the tokanisation of the search term in an elasticsearch query

I am doing a general search against elasticsearch (1.7) using a match query against a number of specified fields. This is done in a java app with one box to enter search terms in. Various search options are allowed (for example surrounding phrase with quotes to look for the phase not the component words). This means I am doing full test searches.
All is well except my account refs have forward slashes in them and a search on an account ref produces thousands of results. If I surround the account ref with quotes I get just the result I want. I assume an account ref of AC/1234/A01 is searching for [AC OR 1234 OR A01]. Initially I thought this was a regex issue but I don’t think it is.
I raised a similar question a while ago and one suggestion which I had thought worked was to add "analyzer": "keyword" to the query (in my code
queryStringQueryBuilder.analyzer("keyword")
).
The problem with this is that many of the other fields searched are not keyword and it is stopping a lot of flexible search options working (case sensitivity etc). I assume this has become something along the lines of an exact match in the text search.
I've looked at this the wrong way around for a while now and as I see it I can't fix it in the index or even in the general analyser settings as even if the account ref field is tokenised and analysed perfectly for my requirement the search will still search all the other fields for [AC OR 1234 OR A01].
Is there a way of configuring the search query to not split the account number on forward slashes? I could test ignoring all punctuation if it is possible to only split by whitespaces although I would prefer not to make such a radical change...
So I guess what I am asking is whether there is another built in analyzer which would still do a full full text search but would not split the search term up using punctuation ? If not is this something I could do with a custom analyzer (without applying it to the index itself ?)
Thanks.
The simplest way to do it is by replacing / with some character that doesn't cause the word to be split in two tokens, but doesn't interfere with your other terms (_, ., ' should work) or remove / completely using mapping char filter. There is a similar example here https://stackoverflow.com/a/23640832/783043

Lucene: Mining email addresses, names, and identifiers from an index

I have a lucene index with approx. 1 million documents. From these documents, I want to mine
email addresses
signatures - ( [whitespace]/s/[whitespace]john doe[whitespace] )
specific identifiers from each of the documents (that follow a regex pattern "\s[0-9]{3}[a-zA-Z0-9]{6}\s").
I understand that ideally using solr, during index build time, its much easier, but how can one do this from a built lucene index?
I am using java. For email address search, I tried to .setAllowLeadingWildcard(true) and then searched for # to find all email addresses - but I actually got zero results . if I search for # in luke I get zero results. If I search for #hotmail.com in luke, I get bunch of results with valid email addresses such as aaaaa#hotmail.com.
The index was created using StandardAnalyzer. Not sure if it matters, but the text is in UTF-8 I believe.
Any helpful suggestions, pointers is great! Note this is not for front end, so query doesn't have to be near realtime.
Analysis does matter, yes. The standard analyzer will treat whitespace and punctuation, such as #, as a place to split input into tokens. As such, you wouldn't expect to see any of them actually present in the indexed data.
You can use Lucene's regex query, particularly for the third case. A PhraseQuery seems appropriate for the second, I think, though I'm more that slightly confused about what you are trying to accomplish there.
Generally, you might want to use a different analyzer for an email field, in order to use it as a single token. You should get reasonable results searching for a particular e-mail address, since, though the analyzer would remove the punctuation, searching for the three (usually) tokens of a email consecutively in a phrase would be expected to get good matches. However, a regex search like \w*#\w*\.\w*, won't be particularly effective, since the punctuation won't actually be indexed and searchable, and a regex search doesn't span multiple terms in the index. Apart from searching for a known set of e-mail domains, or something of that nature, you would want to re-index use analysis more in line with how you need to search it in order to do what you are asking.

Search database table with all special characters

I have a table of project in which i have a project name and that project name may contain any special character or any alpha numeric value or any combination of number word or special characters.
Now i need to apply keyword search in that and that may contain any special character in search.
So my question is: How we can search either single or multiple special characters in database?
I am using mysql 5.0 with java hibernate api.
This should be possible with some simple sanitization of you query.
e.g: a search for \#(%*#$\ becomes:
SELECT * FROM foo WHERE name LIKE "%\\#(\%*#$\\%";
when evaluated the back slashes escape so that the search ends up being anything that contains "\#(%*#$\"
In general anything that's a special character in a string can be escaped via a backslash. This only really becomes tricky if you have a name such as: "\\foo\\bar\\" which to escape properly would become "\\\\foo\\\\bar\\\\"
A side note, please proof read your posts prior to finalizing. Its really depressing and shows a lack of effort when your questions title has spelling errors in it.

Solr - Match sentence beginning with a particular word

Any tips on how this is done?
I've tried using the PatternTokenizerFactory, but it's not working as expected.
Is it possible to do this without writing a custom tokenizer?
you can tokenize the field in question using KeyWordTokenizerFactory and then do wildcard search
http://solr.pl/en/2010/12/20/wildcard-queries-and-how-solr-handles-them/
provided that you are not doing any other operation which does not work with the above Tokenizer.
Another way is a roundabout way. You can create a copyfield which will have its spaces stripped out using the following technique (or some other) :-
What is the regular expression to remove spaces in SOLR
You can then tokenize that copyfield using WhiteSpaceTokenizer (which essentially creates one token only since the copyfield values have no space) and then do a wildcard search on it.
The second approach might fail in some of the cases (for eg. "wor them" will match "worth*" after the spaces are stripped)

Categories