Search stem and exact words in Lucene 4.4.0 - java

i've store a lucene document with a single TextField contains words without stems.
I need to implement a search program that allow users to search words and exact words,
but if i've stored words without stemming, a stem search cannot be done.
There's a method to search both exact words and/or stemming words in Documents without
store Two fields ?
Thanks in advance.

Indexing two separate fields seems like the right approach to me.
Stemmed and unstemmed text require different analysis strategies, and so require you to provide a different Analyzer to the QueryParser. Lucene doesn't really support indexing text in the same field with different analyzers. That is by design. Furthermore, duplicating the text in the same field could result in some fairly strange scoring impacts (heavier scoring on terms that are not touched by the stemmer, particularly).
There is no need to store the text in each of these fields, but it only makes sense to index them in separate fields.
You can apply a different analyzer to different fields by using a PerFieldAnalyzerWrapper, by the way. Like:
Map<String,Analyzer> analyzerList = new HashMap<String,Analyzer>();
analyzerList.put("stemmedText", new EnglishAnalyzer(Version.LUCENE_44));
analyzerList.put("unstemmedText", new StandardAnalyzer(Version.LUCENE_44));
PerFieldAnalyzerWrapper analyzer = new PerFieldAnalyzerWrapper(new StandardAnalyzer(Version.LUCENE_44), analyzerList);
I can see a couple of possibilities to accomplish it though, if you really want to.
One would be to create your own stem filter, based on (or possibly extending) the one you wish to use already, and add in the ability to keep the original tokens after stemming. Mind your position increments, in this case. Phrase queries and the like may be problematic.
The other (probably worse) possibility, would be to add the text to the field normally, then add it again to the same field, but this time after manually stemming. Two fields added with the same name will be effectively concatenated. You'dd want to store in a separate field, in this case. Expect wonky scoring.
Again, though, both of these are bad ideas. I see no benefit whatsoever to either of these strategies over the much easier and more useful approach of just indexing two fields.

Related

Elasticsearch - EdgeNgram + highlight + term_vector = bad highlights

When i use an analyzer with edgengram (min=3, max=7, front) + term_vector=with_positions_offsets
With document having text = "CouchDB"
When i search for "couc"
My highlight is on "cou" and not "couc"
It seems my highlight is only on the minimum matching token "cou" while i would expect to be on the exact token (if possible) or at least the longest token found.
It works fine without analyzing the text with term_vector=with_positions_offsets
What's the impact of removing the term_vector=with_positions_offsets for perfomances?
When you set term_vector=with_positions_offsets for a specific field it means that you are storing the term vectors per document, for that field.
When it comes to highlighting, term vectors allow you to use the lucene fast vector highlighter, which is faster than the standard highlighter. The reason is that the standard highlighter doesn't have any fast way to highlight since the index doesn't contain enough information (positions and offsets). It can only re-analyze the field content, intercept offsets and positions and make highlighting based on that information. This can take quite a while, especially with long text fields.
Using term vectors you do have enough information and don't need to re-analyze the text. The downside is the size of the index, which will notably increase. I must add that since Lucene 4.2 term vectors are better compressed and stored in an optimized way though. And there's also the new PostingsHighlighter based on the ability to store offsets in the postings list, which requires even less space.
elasticsearch uses automatically the best way to make highlighting based on the information available. If term vectors are stored, it will use the fast vector highlighter, otherwise the standard one. After you reindex without term vectors, highlighting will be made using the standard highlighter. It will be slower but the index will be smaller.
Regarding ngram fields, the described behaviour is weird since fast vector highlighter should have a better support for ngram fields, thus I would expect exactly the opposite result.
I know this question is old, but it was not yet answered completely:
There is another option that can yield to such a strange behaviour:
You have to set require_field_match to true if you don't want that other results of documents should influence the current document highlighting, see: http://www.elasticsearch.org/guide/reference/api/search/highlighting/

Assign a paper to a reviewer based on keywords

I was wondering if you know any algorithm that can do an automatic assignment for the following situation: I have some papers with a some keywords defined, and some reviewers that have some specific keywords defined. How could I do an automatic mapping, so that the reviewer could review the papers from his/her area of interest?
If you are open to using external tools Lucene is a library that will allow you to search text based on (from their website)
phrase queries, wildcard queries, proximity queries, range queries and more
fielded searching (e.g., title, author, contents)
date-range searching
sorting by any field
multiple-index searching with merged results
allows simultaneous update and searching
You will basically need to design your own parser, or specialize an existing parser according to your needs. You need to scan the papers, and according to your keywords,search and match your tokens accordingly. Then the sentences with these keywords are to be separated and displayed to the reviewer.
I would suggest the Stanford NLP POS tagger. Every keyword that you would need, will fall under some part-of-speech. You can then just tag your complete document, and search for those tags and accordingly sort out the sentences.
Apache Lucene could be one solution.
It allows you to index documents either in a RAM directory, or within a real directory of your file system, and then to perform full-text searches.
Its proposes a lot of very interesting features like filters or analyzers. You can for example:
remove the stop words depending on the language of the documents (e.g. for english: a, the, of, etc.);
stem the tokens (e.g. function, functional, functionality, etc., are considered as a single instance);
perform complex queries (e.g. review*, keyw?rds, "to be or not to be", etc.);
and so on and so forth...
You should have a look! Don't hesitate to ask me some code samples if Lucene is the way you chose :)

Find arbitary patterns common to a group of strings

Background:
I am developing a program in that iterates over all the movies & tv series episodes stored on my computer, rates them (using rotten tomatoes) and sorts them in order of rating.
I extract the movie name by removing all the unneccessary text such as '.avi', '720p' etc. from the file name.
I am using Java.
Problem:
Some folders contain movie files such as:
Episode 301 Rainforest Schmainforest.avi
Episode 302 Spontaneous Combustion.avi
The word 'Episode' and numbers are valid and are common words in movies, so I can't simply remove them. However, It is clear from the repetitive nature of the names that 'Episode' and '3XX' should be removed.
Aother folder might be:
720p.S5.E1.cripple fight.avi
720p.S5.E2.towelie.avi
Many arbitary patterns like these exist in different groups of files, and I need something to recongise these arbitary patterns so I can extract the keywords. It would be unfeasible to write regex for each case.
Summary:
Is there a tool or API that I can use to find complex repetitive patterns (must be able to match sequences of numbers)? [something like a longest common sequence library]
Well, you could simply take all the filtered names in your dir, and do a simple word-count. You could give extra weight to words that occur in (roughly) the same spot every time.
In the end you'd end up with a count and a weight, and you need to decide what lines to draw. It's probably not every file in the dir (because of maybe images or samples), but if most have a certain word, it's not "the" or something like that, and mabye they all appear "at the start" or "on the second spot", you can filter them.
But this wouldn't work for, random example, Friends episodes. THey're all called "The one where.....". That would be filtered in every sane version of your sought-after algorithm
The bottom line is: I don't think you can because of the friends-episode-problem. There just not enough distinction between wanted repetition and unwanted repetition.
Only thing you can do is make a blacklist of stuff you want to filter, like you allready seem to do with the avi / 720 thing.
I believe that what you are asking for is not trivial. Pattern extraction, as opposed to mere recognition, is well within the fields of artificial intelligence and knowledge discovery. I have encountered several related libraries for Java, but most need a lot of additional code to define even the simplest task.
Since this is a rather hot research area, you might want to perform a cursory search in Google Scholar, using appropriate keywords.
Disclaimer: before you use any library or algorithm found via the Internet, you should investigate its legal status. Unfortunately quite a few of the algorithms that are developed in active research areas are often encumbered by patents and such...
I have a kind-of answer posted here
http://pastebin.com/Eb0cQyKd
I wanted to remove non-unique parts of file names such as'720dpi', 'Episode', 'xvid' 'ac3' without specifying in advance what they would be. But I wanted to keep information like S01E01. I had created a huge black list but it wasn't convenient because the list kept on changing.
The code linked above uses Python (not Java) to remove all non-unique words in a file name.
Basically it creates a list of all the words used in the file names, and any word which comes up for most of the files it puts into a dictionary. Then it iterates through the files and deletes all these dictionary words from them.
The script also does some cleaning: some movies use underscores ('_') or periods ('.') to separate words in the filenames. I convert all these to spaces.
I have used it a lot recently and it works well.

String analysis and classification

I am developing a financial manager in my freetime with Java and Swing GUI. When the user adds a new entry, he is prompted to fill in: Moneyamount, Date, Comment and Section (e.g. Car, Salary, Computer, Food,...)
The sections are created "on the fly". When the user enters a new section, it will be added to the section-jcombobox for further selection. The other point is, that the comments could be in different languages. So the list of hard coded words and synonyms would be enormous.
So, my question is, is it possible to analyse the comment (e.g. "Fuel", "Car service", "Lunch at **") and preselect a fitting Section.
My first thought was, do it with a neural network and learn from the input, if the user selects another section.
But my problem is, I donĀ“t know how to start at all. I tried "encog" with Eclipse and did some tutorials (XOR,...). But all of them are only using doubles as in/output.
Anyone could give me a hint how to start or any other possible solution for this?
Here is a runable JAR (current development state, requires Java7) and the Sourceforge Page
Forget about neural networks. This is a highly technical and specialized field of artificial intelligence, which is probably not suitable for your problem, and requires a solid expertise. Besides, there is a lot of simpler and better solutions for your problem.
First obvious solution, build a list of words and synonyms for all your sections and parse for these synonyms. You can then collect comments online for synonyms analysis, or use parse comments/sections provided by your users to statistically detect relations between words, etc...
There is an infinite number of possible solutions, ranging from the simplest to the most overkill. Now you need to define if this feature of your system is critical (prefilling? probably not, then)... and what any development effort will bring you. One hour of work could bring you a 80% satisfying feature, while aiming for 90% would cost one week of work. Is it really worth it?
Go for the simplest solution and tackle the real challenge of any dev project: delivering. Once your app is delivered, then you can always go back and improve as needed.
String myString = new String(paramInput);
if(myString.contains("FUEL")){
//do the fuel functionality
}
In a simple app, if you will be having only some specific sections in your application then you can get string from comments and check it if it contains some keywords and then according to it change the value of Section.
If you have a lot of categories, I would use something like Apache Lucene where you could index all the categories with their name's and potential keywords/phrases that might appear in a users description. Then you could simply run the description through Lucene and use the top matched category as a "best guess".
P.S. Neural Network inputs and outputs will always be doubles or floats with a value between 0 and 1. As for how to implement String matching I wouldn't even know where to start.
It seems to me that following will do:
hard word statistics
maybe a stemming class (English/Spanish) which reduce a word like "lunches" to "lunch".
a list of most frequent non-words (the, at, a, for, ...)
The best fit is a linear problem, so theoretical fit for a neural net, but why not take immediately the numerical best fit.
A machine learning algorithm such as an Artificial Neural Network doesn't seem like the best solution here. ANNs can be used for multi-class classification (i.e. 'to which of the provided pre-trained classes does the input represent?' not just 'does the input represent an X?') which fits your use case. The problem is that they are supervised learning methods and as such you need to provide a list of pairs of keywords and classes (Sections) that spans every possible input that your users will provide. This is impossible and in practice ANNs are re-trained when more data is available to produce better results and create a more accurate decision boundary / representation of the function that maps the inputs to outputs. This also assumes that you know all possible classes before you start and each of those classes has training input values that you provide.
The issue is that the input to your ANN (a list of characters or a numerical hash of the string) provides no context by which to classify. There's no higher level information provided that describes the word's meaning. This means that a different word that hashes to a numerically close value can be misclassified if there was insufficient training data.
(As maclema said, the output from an ANN will always be floats with each value representing proximity to a class - or a class with a level of uncertainty.)
A better solution would be to employ some kind of word-relation or synonym graph. A Bag of words model might be useful here.
Edit: In light of your comment that you don't know the Sections before hand,
an easy solution to program would be to provide a list of keywords in a file that gets updated as people use the program. Simply storing a mapping of provided comments -> Sections, which you will already have in your database, would allow you to filter out non-keywords (and, or, the, ...). One option is to then find a list of each Section that the typed keywords belong to and suggest multiple Sections and let the user pick one. The feedback that you get from user selections would enable improvements of suggestions in the future. Another would be to calculate a Bayesian probability - the probability that this word belongs to Section X given the previous stored mappings - for all keywords and Sections and either take the modal Section or normalise over each unique keyword and take the mean. Calculations of probabilities will need to be updated as you gather more information ofcourse, perhaps this could be done with every new addition in a background thread.

Tagging of names using lucene/java

I have names of all the employees of my company (5000+). I want to write an engine which can on the fly find names in online articles(blogs/wikis/help documents) and tag them with "mailto" tag with the users email.
As of now I am planning to remove all the stop words from the article and then search for each word in a lucene index. But even in that case I see a lot of queries hitting the indexes, for example if there is an article with 2000 words and only two references to people names then most probably there will be 1000 lucene queries.
Is there a way to reduce these queries? Or a completely other way of achieving the same?
Thanks in advance
If you have only 5000 names, I would just stick them into a hash table in memory instead of bothering with Lucene. You can hash them several ways (e.g., nicknames, first-last or last-first, etc.) and still have a relatively small memory footprint and really efficient performance.
http://en.wikipedia.org/wiki/Aho%E2%80%93Corasick_string_matching_algorithm
This algorithm might be of use to you. The way this would work is you first compile the entire list of names into a giant finite state machine (which would probably take a while), but then once this state machine is built, you can run it through as many documents as you want and detect names pretty efficiently.
I think it would look at every character in each document only once, so it should be much more efficient than tokenizing the document and comparing each word to a list of known names.
There are a bunch of implementations available for different languages on the web. Check it out.

Categories