Lucene Search SuggestWords() - java

I used the Apache Lucene library to do a search method.
public static List<String> suggestWords(String word, Directory directory, String field) {
blabla
}
Text
[Text]
text
[Next, Text, Heat, Sent, Test, Texts]
Had any of you ever work on this library, I would like to understand why when i search for Text I am getting the good word/words and when i search for text the first suggested word is Next and not Text. Should I always put the first letter of the word to UpperCase before searching the suggestWords list ?
Thank you !

In the Apache Lucene Library fields names are case sensitive. That could explain your issue with the Text and text case.
To avoid your issues, you might add a String.toLowercase() or Uppercase like you said, to ensure you to have the right answer.

Related

How to prevent to write characters like into a string?

I have a problem with extracting text from scientific articles.
I use PDFBox to extract text from pdf. The
problem is not from extraction process but with some special math notations that leads to problem when I want to write the extracted text into an XML file, the special character which is not extracted correctly will cause trouble. Instead of ,  or other similar HTML codes will be inserted to the XML file and ruins the whole file. How to fix this issue?
The HTML codes that I mean are look like these and at the moment, number 218 is the trouble. But I guess for different math notations, different HTML codes will be replaced and cause the problem afterward.
I have already tried following string cleanings but didn't help:
nextWord=nextWord.replaceAll("[-+.^:,]", "");
nextWord=nextWord.replaceAll("\\s+", "");
nextWord=nextWord.replaceAll("[^\\x00-\\x7F]", "");
You may write a pre-check before writing each line to a file, to check whether the text does not contain ambiguous characters. Below pattern contains all basic characters in any given textbook. You may add or remove as per your content.
public boolean isValidCharacters(String word){
String pattern= "^[a-zA-Z0-9~##$^*()_+={}|\\,.?: -]*$";
return word.matches(pattern);
}
You can write something yourself with a regex or if you have other String manipulations to do the Apache StringUtils are really great. It has a isAlpha() isNumeric() method that is easy to implement.
https://commons.apache.org/proper/commons-lang/apidocs/org/apache/commons/lang3/StringUtils.html

Eclipse change case in regex find and replace

In Eclipse, I would like to be able to do a regex search and replace for some text and slightly modify it, changing the case of one of the letters. For example: find myVariable.getProperty() and change it to myVariable.property.
I can easily use myVariable.get(\w+)\(\) and replace it with myVariable.$1, but that results in myVariable.Property with the capital 'P'.
I believe this is possible with some regex engines, but I cannot find a way to do it within Eclipse.
I don't think eclipse supports that type of functionality. You would have to get "creative" and do things like:
Search: myVariable\.getP(\w+)\(\)
Replace: myVariable\.p(\1)
But according to regular-expresions.info (http://www.regular-expressions.info/replacecase.html), if you're open to editing your JSP file with a different text editor, there are programs that use other flavors of RegEx which can make your change.
Using your example, EditPad Lite for instance would allow your search:
Search: (myVariable\.)get(\w+)\(\)
And replace it with:
Replace: \1\L2
This would result in:
myVariable.getProperty()
to:
myVariable.property
In this case \L2 changes the contents of the second back reference to a lowercase version. \U2 would change it to uppercase. \I0 would capitalize the initial letter of each separated word in the string and \F0 would capitalize just the first letter of your string.
I've done similar things for small but repetitive changes where eclipse is not exactly equipped for the job. And then go back to eclipse when the change has gone through.

Lucene french search

I was using lucene indexing/searching and faced a problem. Lucene displays some results for query "word" and some for "l'word". (word is just for an example). I need it to display all mentioned results for any of two these queries.
I tried to change StandartAnalyzer to FrenchAnalyzer, but it also doesn't recognize this words as the same.
Changing all "l'word" to "word" in the index and in the search string is also not an option. We need original string to be displayed in results.

Best way to store words for given scenerio

I am working on Java project [Maven].
I am confused in one point. I don't know what is logiclaly corect.
Problem is as follows :-
Sentence is given, and from their I have extract some particular words.
Solution that I found
I make one regex and put in Constants class. Whenever I have to add more words, I simply appended words in regex.
This solves the problem.
I am confused here
I am thinking, if I put numbers of text files in resources folder where each text file denotes one regex expression.
REGEX = (?:A|B|C|D)
A, B, C, D = Word(String)
Is it a good idea ? If not please suggest any other.
Why would you save regex's in a text file? The fact that you're using a regex seems like an implementation detail that you would want to encapsulate (unless you want the significantly greater functionality but also overhead of supporting regexes).
Also, why do you need new files for each word? That seems like you could just have one file with a word per line that is all of the words you're interested in. This would be much more simple for a user to understand than 100 files with one regex per file.
As my understanding, you want to find some key words from the input string. And those key words could be extened according your requirments.
your current solution is to make this regex (?:A|B|C|D) in your Constant class, wheneveer it's required, you'll add more key words in this regex.
If my understanding is not wrong, maybe, one suggestion is to put this regex in your properties file, like this
REGEX = (?:city|Animal|plant|student)
if too long, it's could be like this
REGEX = (?:city|Animal|plant|student|car|computer|clothes|\
furnature|others)
Your second idea, if my understanding is not wrong, is to put the keywords as the file name, and those files are put in one resource folder. therefore, you could obtain those files name to compose the final regexp. If your regex are always fixed as the (?:A|B|C|D) format, then this solution is good & convenient. (Every time, you add one new keyword file, you don't need to modify any source code & property file)

Lucene Index problems with "-" character

I'm having trouble with a Lucene Index, which has indexed words, that contain "-" Characters.
It works for some words that contain "-" but not for all and I don't find the reason, why it's not working.
The field I'm searching in, is analyzed and contains version of the word with and without the "-" character.
I'm using the analyzer: org.apache.lucene.analysis.standard.StandardAnalyzer
here an example:
if I search for "gsx-*" I got a result, the indexed field contains
"SUZUKI GSX-R 1000 GSX-R1000 GSXR"
but if I search for "v-*" I got no result. The indexed field of the expected result contains:
"SUZUKI DL 1000 V-STROM DL1000V-STROMVSTROM V STROM"
If I search for "v-strom" without "*" it works, but if I just search for "v-str" for example I don't get the result. (There should be a result because it's for a live search for a webshop)
So, what's the difference between the 2 expected results? why does it work for "gsx-" but not for "v-" ?
StandardAnalyzer will treat the hyphen as whitespace, I believe. So it turns your query "gsx-*" into "gsx*" and "v-*" into nothing because at also eliminates single-letter tokens. What you see as the field contents in the search result is the stored value of the field, which is completely independent of the terms that were indexed for that field.
So what you want is for "v-strom" as a whole to be an indexed term. StandardAnalyzer is not suited to this kind of text. Maybe have a go with the WhitespaceAnalyzer or SimpleAnalyzer. If that still doesn't cut it, you also have the option of throwing together your own analyzer, or just starting off those two mentined and composing them with further TokenFilters. A very good explanation is given in the Lucene Analysis package Javadoc.
BTW there's no need to enter all the variants in the index, like V-strom, V-Strom, etc. The idea is for the same analyzer to normalize all these variants to the same string both in the index and while parsing the query.
ClassicAnalyzer handles '-' as a useful, non-delimiter character. As I understand ClassicAnalyzer, it handles '-' like the pre-3.1 StandardAnalyzer because ClassicAnalyzer uses ClassicTokenizer which treats numbers with an embedded '-' as a product code, so the whole thing is tokenized as one term.
When I was at Regenstrief Institute I noticed this after upgrading Luke, as the LOINC standard medical terms (LOINC was initiated by R.I.) are identified by a number followed by a '-' and a checkdigit, like '1-8' or '2857-1'. My searches for LOINCs like '45963-6' failed using StandardAnalyzer in Luke 3.5.0, but succeeded with ClassicAnalyzer (and this was because we built the index with the 2.9.2 Lucene.NET).
(Based on Lucene 4.7) StandardTokenizer splits hyphenated words into two. for example "chat-room" into "chat","room" and index the two words separately instead of indexing as a single whole word. It is quite common for separate words to be connected with a hyphen: “sport-mad,” “camera-ready,” “quick-thinking,” and so on. A significant number are hyphenated names, such as “Emma-Claire.” When doing a Whole Word Search or query, users expect to find the word within those hyphens. While there are some cases where they are separate words, that's why lucene keeps the hyphen out of the default definition.
To give support of hyphen in StandardAnalyzer, you have to make changes in StandardTokenizerImpl.java which is generated class from jFlex.
Refer this link for complete guide.
You have to add following line in SUPPLEMENTARY.jflex-macro which is included by StandardTokenizerImpl.jflex file.
MidLetterSupp = ( [\u002D] )
And After making changes provide StandardTokenizerImpl.jflex file as input to jFlex engine and click on generate. The output of that will be StandardTokenizerImpl.java
And using that class file rebuild the index.
The ClassicAnalzer is recommended to index text containing product codes like 'GSX-R1000'. It will recognize this as a single term and did not split up its parts. But for example the text 'Europe/Berlin' will be split up by the ClassicAnalzer into the words 'Europe' and 'Berlin'. This means if you have a text indexed by the ClassicAnalyzer containing the phrase
Europe/Berlin GSX-R1000
you can search for "europe", "berlin" or "GSX-R1000".
But be careful which analyzer you use for the search. I think the best choice to search a Lucene index is the KeywordAnalyzer. With the KeywordAnalyzer you can also search for specific fields in a document and you can build complex queries like:
(processid:4711) (berlin)
This query will search documents with the phrase 'berlin' but also a field 'processid' containing the number 4711.
But if you search the index for the phrase "europe/berlin" you will get no result! This is because the KeywordAnalyzer did not change your search phrase, but the phrase 'Europe/Berlin' was split up into two separate words by the ClassicAnalyzer. This means you have to search for 'europe' and 'berlin' separately.
To solve this conflict you can translate a search term, entered by the user, in a search query that fits you needs using the following code:
QueryParser parser = new QueryParser("content", new ClassicAnalyzer());
Query result = parser.parse(searchTerm);
searchTerm = result.toString("content");
This code will translate the serach pharse
Europe/Berlin
into
europe berlin
which will result in the expected document set.
Note: This will also work for more complex situations. The search term
Europe/Berlin GSX-R1000
will be translated into:
(europe berlin) GSX-R1000
which will search correctly for all phrases in combination using the KeyWordAnalyzer.

Categories