I have been using lucene for building indexes of documents, and performing search on them. I know that lucene supports FuzzyQuery, which is based on levenshtein distance.
FuzzyQuery also has an option to define prefix length, where we can keep the first few characters of search term fixed. I want to know whether there is an option to define suffix length. Or please suggest some implementation where I can achieve this.
The main reason for the prefix in FuzzyQuery, is that it allows the search to narrow the possible result set before checking for fuzzy matches, and so provides a significant performance improvement. Adding a suffix doesn't provide any such benefit.
The best way to achieve this and reap to performance benefit may be to index the tokens reversed, by adding a ReverseStringFilter into your analyzer. This is similarly often used to support leading wildcard queries without the big performance hit that typically comes with them.
Related
I have a Lucene application with multiple indices in which the relevancy scoring suffers due to differences in the term frequencies across the different indices. My understanding is that the Term Dictionary (.tim file) contains "term statistics" such as the document frequency statistics on each term. I was thinking that one approach might be to modify the .tim files for each index (and related segments) and update the "term statistics". Is it possible to overwrite or modify the .tim and .tip files in such a way?
relevancy scoring suffers
From the FAQ:
score values are meaningful only for purposes of comparison between
other documents for the exact same query and the exact same index.
when you try to compute a percentage, you are setting up an implicit
comparison with scores from other queries.
Is it possible? I suppose, but it strikes me as about as good an idea as attempting to change an application by directly modifying the compiled binaries.
If you need very specific things from scoring, then you should generally implement a Similarity that does what you need. Extending TFIDFSimilarity is often a good idea. Really not clear on what the exact problem is, so I can't provide any more specific guidance than that, but perhaps that provides a point in the right general direction.
I have DB table which stores list of all exceptions in Java and their description.
When the user inputs the exception name it retrieves the respective description. I have used Levenshtein distance to match strings incase they enter a wrong string, but I want to eliminate words irrelevant in a string search such as "and", "or" etc., from the input string, and to provide fast searching.
Is there an already existing framework or API for doing thing kind of searching on a list of strings?
Is there and better way to search for strings than Levenshtein Distance?
Actually, you're kind of wrong. Words such as "and" and "or" are extremely relevant to the way some search engines work; furthermore, as you already know, Levenshtein distance is a common and effective metric allowing you to check the similarity between words. Also, using a (probably hashed) dictionary is almost as fast as it gets. Also, as already stated, if you really want to filter the input, define your rules for filtering, process the input, and then use the resulting string as a base for Levenshtein calculation.
Also, I'm kind of provoked to post a LMGTFY link here, since actually reading the Wikipedia article about Levenshtein gives you all the other information you may need. I'd suggest reading more about all the distance metrics and edit distances, there's not much I can add to the coverage already present in the below links.
source: http://en.wikipedia.org/wiki/Levenshtein_distance, http://en.wikipedia.org/wiki/Edit_distance, http://en.wikipedia.org/wiki/Wagner%E2%80%93Fischer_algorithm
I have a set of search terms like [+dog -"jack russels" +"fox terrier"], [+cat +persian -tabby]. These could be quite long with maybe 30 sub-terms making up each term.
I now have some online news articles extracts such as ["My fox terrier is the cutest dog in the world..."] and ["Has anyone seen my lost persian cat? He went missing ..."]. They're not too long, perhaps 500 characters at most each.
In traditional search engines one expects a huge amount of articles that are pre-processed into indexes, allowing for speed-ups when searching given 'search terms', using set theory/boolean logic to reduce articles to only ones that match the phrases. In this situation, however, the order of my search terms is ~10^5, and I'd like to be able to process a single article at a time, to see ALL the sets of search terms that article would be matched with (i.e. all the + terms are in the text and none of the - terms).
I have a possible solution using two maps (one for the positive sub-phrases, one for the negative sub-phrases), but I don't think it'll be very efficient.
First prize would be a library that solves this problem, second prize is a push in the right direction towards solving this.
Kind regards,
Assuming all the positive sub-terms are required for a match:
Put all the sub-terms from your search terms into a hashtable. The sub-term is the key, the value is a pointer to the full search term data structure (which should include a unique id and a map of sub-terms to a boolean).
Additionally, when processing a news item, create a "candidates" map, indexed by the term id. Each candidate structure has a pointer to the term definition, a set that contains the seen sub-terms and a "rejected" flag.
Iterate over the words of the news article.
For each hit, look up the candidate entry. If not there, create and add an empty one.
If the candidate rejection flag is set, you are done.
Otherwise, look up the sub-term from the term data structure.
If negative, set the rejected flag.
If positive, add the sub-term to the set of seen sub-terms.
In the end, iterate over the candidates. All candidates that are not rejected and where the size of the seen set equals to the number of positive sub-terms of that term are your hits.
Implementation: https://docs.google.com/document/d/1boieLJboLTy7X2NH1Grybik4ERTpDtFVggjZeEDQH74/edit
Runtime is O(n * m) where n is the number of words in the article and m is the maximum number of terms sharing the same sub-term (expected to be relatively small).
First of all, I think making a Suffix Tree of your document makes the searching much faster since you need to built it once, but you may use it as many times as the length of your query is.
Second, you need to iterate all of the search terms (both + and - ones) to make sure if the answer is yes (that is the document matches the query). However, for a "no" answer, you dont! If the answer is no, then the order of matching the search terms against the document really matters. That is one order may give you a faster "no" than another order. Now the question is "What is the optimal order to get a fast NO?". It really depends on the application, but a good starting point is that multi-word terms such as "red big cat" are less commonly repeated in the documents compared to short terms such as "cat" and vice versa. So, go with +"Loo ooo ooo ooo ooo ong" and -"short" terms first.
Is there any build-in library in Java for searching strings in large files of about 100GB in java. I am currently using binary-search but it is not that efficient.
As far as I know Java does not contain any file search engine, with or without an index. There is a very good reason for that too: search engine implementations are intrinsically tied to both the input data set and the search pattern format. A minor variation in either could result in massive changes in the search engine.
For us to be able to provide a more concrete answer you need to:
Describe exactly the data set: the number, path structure and average size of files, the format of each entry and the format of each contained token.
Describe exactly your search patterns: are those fixed strings, glob patterns or, say, regular expressions? Do you expect the pattern to match a full line or a specific token in each line?
Describe exactly your desired search results: do you want exact or approximate matches? Do you want to get a position in a file, or extract specific tokens?
Describe exactly your requirements: are you able to build an index beforehand? Is the data set expected to be modified in real time?
Explain why can't you use third party libraries such as Lucene that are designed exactly for this kind of work.
Explain why your current binary search, which should have a complexity of O(logn) is not efficient enough. The only thing that might be be faster, with a constant complexity would involve the use of a hash table.
It might be best if you described your problem in broader terms. For example, one might assume from your sample data set that what you have is a set of words and associated offset or document identifier lists. A simple method to approach searching in such a set would be to store an word/file-position index in a hash table to be able to access each associated list in constant time.
If u doesn't want to use the tools built for search, then store the data in DB and use sql.
I have one List in C#.This String array contains elements of Paragraph that are read from the Ms-Word File.for example,
list 0-> The picture above shows the main report which will be used for many of the markup samples in this chapter. There are several interesting elements in this sample document. First there rae the basic text elements, the primary building blocks for your document. Next up is the table at the bottom of the report which will be discussed in full, including the handy styling effects such as row-banding. Finally the image displayed in the header will be added to finalize the report.
list 1->The picture above shows the main report which will be used for many of the markup samples in this chapter. There are several interesting elements in this sample document. First there rae the basic text elements, the primary building blocks for your document. Various other elements of WordprocessingML will also be handled. By moving the formatting information into styles a higher degree of re-use is made possible. The document will be marked using custom XML tags and the insertion of other advanced elements such as a table of contents is discussed. But before all the advanced features can be added, the base of the document needs to be built.
Some thing like that.
Now My search String is :
The picture above shows the main report which will be used for many of the markup samples in this chapter. There are several interesting elements in this sample document. First there rae the basic text elements, the primary building blocks for your document. Next up is the table at the bottom of the report which will be discussed in full, including the handy styling effects such as row-banding. Before going over all the elements which make up the sample documents a basic document structure needs to be laid out. When you take a WordprocessingML document and use the Windows Explorer shell to rename the docx extension to zip you will find many different elements, especially in larger documents.
I want to check my search String with that list elements.
my criteria is "If each list element contains 85% match or exact match of search string then we want to retrieve that list elements.
In our case,
list 0 -> more satisfies my search string.
list 1 -it also matches some text,but i think below not equal to my criteria...
How i do this kind of criteria based search on String...?
I have more confusion on my problem also
Welcome your ideas and thoughts...
The keyword is DISTANCE or "string distance". and also, "Paragraph similarity"
You seek to implement a function which would express as a scalar, say a percentage as suggested in the question, indicative of how similar a string is from another string.
Plain string distance functions such as hamming or Levenstein may not be appropriate, for they work at character level rather than at word level, but generally these algorithms convey the idea of what is needed.
Working at word level you'll probably also want to take into account some common NLP features, for example ignore (or give less weight to) very common words (such as 'the', 'in', 'of' etc.) and maybe allow for some forms of stemming. The order of the words, or for the least their proximity may also be of import.
One key factor to remember is that even with relatively short strings, many distances functions can be quite expensive, computationally speaking. Before selecting one particular algorithm you'll need to get an idea of the general parameters of the problem:
how many strings would have to be compared? (on average, maximum)
how many words/token do the string contain? (on average, max)
Is it possible to introduce a simple (quick) filter to reduce the number of strings to be compared ?
how fancy do we need to get with linguistic features ?
is it possible to pre-process the strings ?
Are all the records in a single language ?
Comparing Methods for Single Paragraph Similarity Analysis, a scholarly paper provides a survey of relevant techniques and considerations.
In a nutshell, the the amount of design-time and run-time one can apply this relatively open problem varies greatly and is typically a compromise between the level of precision desired vs. the run-time resources and the overall complexity of the solution which may be acceptable.
In its simplest form, when the order of the words matters little, computing the sum of factors based on the TF-IDF values of the words which match may be a very acceptable solution.
Fancier solutions may introduce a pipeline of processes borrowed from NLP, for example Part-of-Speech Tagging (say for the purpose of avoiding false positive such as "SAW" as a noun (to cut wood), and "SAW" as the past tense of the verb "to see". or more likely to filter outright some of the words based on their grammatical function), stemming and possibly semantic substitutions, concept extraction or latent semantic analysis.
You may want to look into lucene for Java or lucene.net for c#. I don't think it'll do the percentage requirement you want out of the box, but it's a great tool for doing text matching.
You maybe could run a separate query for each word, and then work out the percentage yourself of ones that matched.
Here's an idea (and not a solution by any means but something to get started with)
private IEnumerable<string> SearchList = GetAllItems(); // load your list
void Search(string searchPara)
{
char[] delimiters = new char[]{' ','.',','};
var wordsInSearchPara = searchPara.Split(delimiters, StringSplitOptions.RemoveEmptyEntries).Select(a=>a.ToLower()).OrderBy(a => a);
foreach (var item in SearchList)
{
var wordsInItem = item.Split(delimiters, StringSplitOptions.RemoveEmptyEntries).Select(a => a.ToLower()).OrderBy(a => a);
var common = wordsInItem.Intersect(wordsInSearchPara);
// now that you know the common items, you can get the differential
}
}