Search database table with all special characters

Search database table with all special characters - java

I have a table of project in which i have a project name and that project name may contain any special character or any alpha numeric value or any combination of number word or special characters.
Now i need to apply keyword search in that and that may contain any special character in search.
So my question is: How we can search either single or multiple special characters in database?
I am using mysql 5.0 with java hibernate api.

This should be possible with some simple sanitization of you query.
e.g: a search for \#(%*#$\ becomes:
SELECT * FROM foo WHERE name LIKE "%\\#(\%*#$\\%";
when evaluated the back slashes escape so that the search ends up being anything that contains "\#(%*#$\"
In general anything that's a special character in a string can be escaped via a backslash. This only really becomes tricky if you have a name such as: "\\foo\\bar\\" which to escape properly would become "\\\\foo\\\\bar\\\\"
A side note, please proof read your posts prior to finalizing. Its really depressing and shows a lack of effort when your questions title has spelling errors in it.

Related

Why does Solr ClientUtils::escapeQueryChars escape spaces

Solr query has some special chars that need to be escaped, +-&|!(){}[]^"~*?:/.
SolrJ provides a utility method ClientUtils::escapeQueryChars which escapes more chars, including ; and white spaces.
It caused a bug in my application when the search term contains space, like foo bar, which was turned into foo\ bar by ClientUtils::escapeQueryChars. My solution is to split the search term, escape each term and joining them with AND or OR.
But it's still a pain to write extra code just to handle handle space.
Is there any special reason that space and ; are also escaped by this utility method ?

In Solr (and Lucene) the characters can have different meanings in query syntax depending from what query parser you're using (for example standard, dismax, edismax, etc.).
So when and what escape depends from which query parser you're using and which query you're trying to do. I know this seems too broad as answer but I'll add an example to make the things more clear.
For example, let's try to use edismax as query parser and have a document with a field named tv_display of type string.
If you write:
http://localhost:8983/solr/buybox/select?q=tv_display:Full HD
edismax will convert the query in +tv_display:Full +tv_display:HD.
In this way you'll never find the documents where tv_display is Full HD but all the documents where tv_display is Full and/or HD (and/or depends by your mm configuration).
ClientUtils::escapeQueryChars will convert Full HD in Full\ HD:
http://localhost:8983/solr/buybox/select?q=tv_display:Full\ HD
So edismax takes the entire string as a single token and only in this way will be returned all the documents where tv_display has Full HD.
In conclusion ClientUtils::escapeQueryChars escape all possible characters (spaces and semicolons included) that can be misunderstood by a query parser.

Does space affect the result in regex ـــــ java

I'm using regex in order to define a set of rules that extract specific information from unstructured resumes.
and this information are:
Company that applicant worked in or still working
role (designation)... ex: software engineer
Date (From-To)
every applicant write his/her employment details in his/her own way. However, some resume have a common style for example :
2012- 2014.Dean of the Faculty of Engineering Information Technology/
University Name.
so I define this regex in order to extract the needed information
Here my regex:
(^[0-9]{4})(-|–|.|_|to) ([0-9]{4})(.*) (of the|at|in) (.*).
and this regex was able to extract the information from the above example
role:Dean
company: Faculty of Engineering Information Technology/University Name.
date from: 2012 to :2014
loyalty: 2 years // this is depend on the extracted date
But I have another sample from another resume that have the same style of writing
1996-1997, Lecturer in Computer Science Department, Jerusalem open
university.
it should give Match but it didn't until I remove the space in the regex then it was able to extract the data
My question is does the space affect in regex??!!
and how I can fix this so that it could extract the data from both resume regarding of the space in the regex rule??
Here my demo

does the space affect in regex?
You have determined for yourself that it does. Space characters are not regex metacharacters, unless you enable the COMMENTS option in your pattern. Ordinarily, they stand for themselves, just like most other characters.
how I can fix this so that it could extract the data from both resume regarding of the space in the regex rule?
You can apply quantifers such as ? or * to space characters in your regex, just like you can to any other character or group. So, for example, you might use
(^[0-9]{4})(-|–|.|_|to) *([0-9]{4})(.*) (of the|at|in) (.*).
Do consider also that you might sometimes have to deal with tab characters, too. You can use the escape sequence \s to match any single whitespace character other than a newline, whether it be a space, a tab, or any other recognized as whitespace by Java.

You can use an optional amount of white-space by using \\s* instead of a space . \\s means white-space character, and the * means zero or more

Configuring the tokanisation of the search term in an elasticsearch query

I am doing a general search against elasticsearch (1.7) using a match query against a number of specified fields. This is done in a java app with one box to enter search terms in. Various search options are allowed (for example surrounding phrase with quotes to look for the phase not the component words). This means I am doing full test searches.
All is well except my account refs have forward slashes in them and a search on an account ref produces thousands of results. If I surround the account ref with quotes I get just the result I want. I assume an account ref of AC/1234/A01 is searching for [AC OR 1234 OR A01]. Initially I thought this was a regex issue but I don’t think it is.
I raised a similar question a while ago and one suggestion which I had thought worked was to add "analyzer": "keyword" to the query (in my code
queryStringQueryBuilder.analyzer("keyword")
).
The problem with this is that many of the other fields searched are not keyword and it is stopping a lot of flexible search options working (case sensitivity etc). I assume this has become something along the lines of an exact match in the text search.
I've looked at this the wrong way around for a while now and as I see it I can't fix it in the index or even in the general analyser settings as even if the account ref field is tokenised and analysed perfectly for my requirement the search will still search all the other fields for [AC OR 1234 OR A01].
Is there a way of configuring the search query to not split the account number on forward slashes? I could test ignoring all punctuation if it is possible to only split by whitespaces although I would prefer not to make such a radical change...
So I guess what I am asking is whether there is another built in analyzer which would still do a full full text search but would not split the search term up using punctuation ? If not is this something I could do with a custom analyzer (without applying it to the index itself ?)
Thanks.

The simplest way to do it is by replacing / with some character that doesn't cause the word to be split in two tokens, but doesn't interfere with your other terms (_, ., ' should work) or remove / completely using mapping char filter. There is a similar example here https://stackoverflow.com/a/23640832/783043

Solr field searching, wildcards, and escaped characters

I'm using solr for the search functionality on my webapp. I append an "*" to the end of each user's search. So, if the search is: foo I change it to filename:foo*
This works fine, except that often a hyphen will be included in the user's search. A search of filename:foo-bar* returns zero results, as the hyphen removes any search results produced from the search term(s) after it. I can escape it, as filename:foo\-bar* but I still get zero results. If I try filename:foo"-"* the search returns all documents.
Any suggestions on how to get - and * to play nice with one another?
Thanks for the help

In my experience, I've had to escape the wildcard character if anything else in the string is escaped. This is to get it to function as a wildcard; you'd think it'd make it look for the character itself, but it seems not to. Note: Escaping the * without other escaped characters seems to search exactly for the character *, and does not use it as a wildcard operator.
field:*and\/or* //would NOT perform a wildcarded search for "and/or"
field:\*and\/or\* //would perform wildcard search for and/or
To be clear, it seems like they are backwards, but that is what hass worked in my cases.

Java regex to distinguish special characters while allowing non english chars

I am trying to do above. One option is get a set of chars which are special characters and then with some java logic we can accomplish this. But then I have to make sure I include all special chars.
Is there any better way of doing this ?

You need to decide what constitutes a special character. One method that may be of interest is Character.getType(char) which returns an int which will match one of the constant values of Character such as Character.LOWERCASE_LETTER or Character.CURRENCY_SYMBOL. This lets you determine the general category of a character, and then you need to decide which categories count as 'special' characters and which you will accept as part of text.
Note that Java uses UTF-16 to encode its char and String values, and consequently you may need to deal with supplementary characters (see the link in the description of the getType method). This is a nuisance, but the Character method does offer methods which help you detect this situation and work around it. See the Character.isSupplementaryCodepoint(int) and Character.codepointAt(char[], int) methods.
Also be aware that Java 6 is far less knowledgeable about Unicode than is Java 7. The newest version of Java has added far more to its Unicode database, but code running on Java 6 will not recognise some (actually quite a few) exotic codepoints as being part of a Unicode block or general category, so you need to bear this in mind when writing your code.

It sounds like you would like to remove all control characters from a Unicode string. You can accomplish this by using a Unicode character category identifier in a regex. The category "Cc" contains those characters, see http://www.fileformat.info/info/unicode/category/Cc/list.htm.
myString = myString.replaceAll("[\p{Cc}]+", "");

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.