Solr field searching, wildcards, and escaped characters

Solr field searching, wildcards, and escaped characters - java

I'm using solr for the search functionality on my webapp. I append an "*" to the end of each user's search. So, if the search is: foo I change it to filename:foo*
This works fine, except that often a hyphen will be included in the user's search. A search of filename:foo-bar* returns zero results, as the hyphen removes any search results produced from the search term(s) after it. I can escape it, as filename:foo\-bar* but I still get zero results. If I try filename:foo"-"* the search returns all documents.
Any suggestions on how to get - and * to play nice with one another?
Thanks for the help

In my experience, I've had to escape the wildcard character if anything else in the string is escaped. This is to get it to function as a wildcard; you'd think it'd make it look for the character itself, but it seems not to. Note: Escaping the * without other escaped characters seems to search exactly for the character *, and does not use it as a wildcard operator.
field:*and\/or* //would NOT perform a wildcarded search for "and/or"
field:\*and\/or\* //would perform wildcard search for and/or
To be clear, it seems like they are backwards, but that is what hass worked in my cases.

Related

How to write a regular expression that can matches the whole function #Prompt (…) whatever written inside () even if it contains another ()

For example I want replace any prompt function in an SQL query
I have used this expression
Query = Query.replaceAll("#prompt\\s*\\(.*?\\)", "(1)");
This expression works in this example
#Prompt('Bill Cycle','A','MIGRATION\BC',,,)
#Prompt('Bill Cycle','A','MIGRATION\BC',,,)
and the output is is (1)
but when it does not work on this example
#Prompt('Groups','A','[Lookup] Price Group (2)\Nested Group',,,)
the out put is (1) \Nested Group',,,) which is not valid

Sadly, as pointed out by Joe C in a comment, what you are trying to do cannot be done in a regular expression for arbitrary depth parenthesis. The reason is because regular-expressions are not capable of "counting". You need a stack machine for that, or a context-free language parser.
However, you also suggest that the 'prompted' content is always inside single quotes. I assume below the standard Java regexp library. Other regexp libraries might need translation...
"#Prompt\\('[^']*'(\s*,\s*(('[^']*')|([^',)]*)))*\\)"
So, you are searching within prompt for blocks of single-quoted text. The search assumes that each internal bit of content is enclosed in single quotes.
Verify at https://regex101.com/r/nByy0Y/1 (I made a couple fixes). Note that at regex101.com, it will treat the double back-slash as intending a literal back-slash. What you want instead is just to quote the parenthesis so that you want a literal parenthesis.

Because you are using the lazy quantifier '?', it is stopping the match at the end of the first ')'. removing that will let it go to the end greedily, as such:
#prompt\(.*\)
But if there is concern that the entries may have more parans after the one in question, it will cause problems.
Assuming the additional parens will always be in quotes, you can do this:
#prompt\((('([^'])*',*)*|(.*,*)*)\)
Here is it looking for items wrapped in single quotes OR text without parens, which should capture all of the single quoted elements or null params or unquoted text params

Configuring the tokanisation of the search term in an elasticsearch query

I am doing a general search against elasticsearch (1.7) using a match query against a number of specified fields. This is done in a java app with one box to enter search terms in. Various search options are allowed (for example surrounding phrase with quotes to look for the phase not the component words). This means I am doing full test searches.
All is well except my account refs have forward slashes in them and a search on an account ref produces thousands of results. If I surround the account ref with quotes I get just the result I want. I assume an account ref of AC/1234/A01 is searching for [AC OR 1234 OR A01]. Initially I thought this was a regex issue but I don’t think it is.
I raised a similar question a while ago and one suggestion which I had thought worked was to add "analyzer": "keyword" to the query (in my code
queryStringQueryBuilder.analyzer("keyword")
).
The problem with this is that many of the other fields searched are not keyword and it is stopping a lot of flexible search options working (case sensitivity etc). I assume this has become something along the lines of an exact match in the text search.
I've looked at this the wrong way around for a while now and as I see it I can't fix it in the index or even in the general analyser settings as even if the account ref field is tokenised and analysed perfectly for my requirement the search will still search all the other fields for [AC OR 1234 OR A01].
Is there a way of configuring the search query to not split the account number on forward slashes? I could test ignoring all punctuation if it is possible to only split by whitespaces although I would prefer not to make such a radical change...
So I guess what I am asking is whether there is another built in analyzer which would still do a full full text search but would not split the search term up using punctuation ? If not is this something I could do with a custom analyzer (without applying it to the index itself ?)
Thanks.

The simplest way to do it is by replacing / with some character that doesn't cause the word to be split in two tokens, but doesn't interfere with your other terms (_, ., ' should work) or remove / completely using mapping char filter. There is a similar example here https://stackoverflow.com/a/23640832/783043

Search database table with all special characters

I have a table of project in which i have a project name and that project name may contain any special character or any alpha numeric value or any combination of number word or special characters.
Now i need to apply keyword search in that and that may contain any special character in search.
So my question is: How we can search either single or multiple special characters in database?
I am using mysql 5.0 with java hibernate api.

This should be possible with some simple sanitization of you query.
e.g: a search for \#(%*#$\ becomes:
SELECT * FROM foo WHERE name LIKE "%\\#(\%*#$\\%";
when evaluated the back slashes escape so that the search ends up being anything that contains "\#(%*#$\"
In general anything that's a special character in a string can be escaped via a backslash. This only really becomes tricky if you have a name such as: "\\foo\\bar\\" which to escape properly would become "\\\\foo\\\\bar\\\\"
A side note, please proof read your posts prior to finalizing. Its really depressing and shows a lack of effort when your questions title has spelling errors in it.

is it possible to use replaceAll() with wildcards

Good morning. I realize there are a ton of questions out there regarding replace and replaceAll() but i havnt seen this.
What im looking to do is parse a string (which contains valid html to a point) then after I see the second instance of <p> in the string i want to remove everything that starts with & and ends with ; until i see the next </p>
To do the second part I was hoping to use something along the lines of s.replaceAll("&*;","")
That doesnt work but hopefully it gets my point across that I am looking to replace anything that starts with & and ends with ;

You should probably leave the parsing to a DOM parser (see this question). I can almost guarantee you'll have to do this to find text within the <p> tags.
For the replacement logic, String.replaceAll uses regular expressions, which can do the matching you want.
The "wildcard" in regular expressions that you want is the .* expression. Using your example:
String ampStr = "This &escape;String";
String removed = ampStr.replaceAll("&.*;", "");
System.out.println(removed);
This outputs This String. This is because the . represents any character, and the * means "this character 0 or more times." So .* basically means "any number of characters." However, feeding it:
"This &escape;String &anotherescape;Extended"
will probably not do what you want, and it will output This Extended. To fix this, you specify exactly what you want to look for instead of the . character. This is done using [^;], which means "any character that's not a semicolon:
String removed = ampStr.replaceAll("&[^;]*;", "");
This has performance benefits over &.*?; for non-matching strings, so I highly recommend using this version, especially since not all HTML files will contain a &abc; token and the &.*?; version can have huge performance bottle-necks as a result.

The expression you want is:
s.replaceAll("&.*?;","");
But do you really want to be parsing HTML this way? You may be better off using an XML parser.

Java String#contains() using String#matches() with escape character

I need a simple way to implement the contains function using matches. I believe this is my starting point:
xxx.matches("'.*yyy.*'");
But I need to make it a universal method and pre-process whatever I search for to be accepted by matches! This must be done using only the escape '\' character!
Imagine a string SEARCH_FOR that can contain some special characters that must be "regex escaped"...
String SEARCH_FOR="*.\\"
xxx.matches("'.*" + SEARCH_FOR + ".*'");
Are there any catches? Special situations? Any other "special chars should be taken into account?

Are you looking for Pattern.quote(String) ?
This escapes special characters for you.
EDIT:
After reading the comments, I really hope you try Pattern.quote(yourString.toLowerCase()) as it sounds like you've been using Pattern.quote(yourString).toLowerCase(). If DataNucleus is applying the regex then there should be no problems with using the \Q and \E escape sequence.
Since you have really asked for it, ".\\".replaceAll("(\\.|\\$|\\+|\\*|\\\\)", "\\\\\$1") outputs \.\\
This will escape .'s, $'s, + 's, *'s and \'s. Note that the security of this is now all upon you. If you don't escape something you needed to, or you escape it incorrectly, you will either allow people to use regex inside the search term when you weren't expecting to or it won't returns results that you were expecting.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.