Modifying .tim and .tip files in Lucene Index

Modifying .tim and .tip files in Lucene Index - java

I have a Lucene application with multiple indices in which the relevancy scoring suffers due to differences in the term frequencies across the different indices. My understanding is that the Term Dictionary (.tim file) contains "term statistics" such as the document frequency statistics on each term. I was thinking that one approach might be to modify the .tim files for each index (and related segments) and update the "term statistics". Is it possible to overwrite or modify the .tim and .tip files in such a way?

relevancy scoring suffers
From the FAQ:
score values are meaningful only for purposes of comparison between
other documents for the exact same query and the exact same index.
when you try to compute a percentage, you are setting up an implicit
comparison with scores from other queries.

Is it possible? I suppose, but it strikes me as about as good an idea as attempting to change an application by directly modifying the compiled binaries.
If you need very specific things from scoring, then you should generally implement a Similarity that does what you need. Extending TFIDFSimilarity is often a good idea. Really not clear on what the exact problem is, so I can't provide any more specific guidance than that, but perhaps that provides a point in the right general direction.

Related

Suffix Length in Lucene Fuzzy Search

I have been using lucene for building indexes of documents, and performing search on them. I know that lucene supports FuzzyQuery, which is based on levenshtein distance.
FuzzyQuery also has an option to define prefix length, where we can keep the first few characters of search term fixed. I want to know whether there is an option to define suffix length. Or please suggest some implementation where I can achieve this.

The main reason for the prefix in FuzzyQuery, is that it allows the search to narrow the possible result set before checking for fuzzy matches, and so provides a significant performance improvement. Adding a suffix doesn't provide any such benefit.
The best way to achieve this and reap to performance benefit may be to index the tokens reversed, by adding a ReverseStringFilter into your analyzer. This is similarly often used to support leading wildcard queries without the big performance hit that typically comes with them.

Build in library's to perform effective searching on 100GB files

Is there any build-in library in Java for searching strings in large files of about 100GB in java. I am currently using binary-search but it is not that efficient.

As far as I know Java does not contain any file search engine, with or without an index. There is a very good reason for that too: search engine implementations are intrinsically tied to both the input data set and the search pattern format. A minor variation in either could result in massive changes in the search engine.
For us to be able to provide a more concrete answer you need to:
Describe exactly the data set: the number, path structure and average size of files, the format of each entry and the format of each contained token.
Describe exactly your search patterns: are those fixed strings, glob patterns or, say, regular expressions? Do you expect the pattern to match a full line or a specific token in each line?
Describe exactly your desired search results: do you want exact or approximate matches? Do you want to get a position in a file, or extract specific tokens?
Describe exactly your requirements: are you able to build an index beforehand? Is the data set expected to be modified in real time?
Explain why can't you use third party libraries such as Lucene that are designed exactly for this kind of work.
Explain why your current binary search, which should have a complexity of O(logn) is not efficient enough. The only thing that might be be faster, with a constant complexity would involve the use of a hash table.
It might be best if you described your problem in broader terms. For example, one might assume from your sample data set that what you have is a set of words and associated offset or document identifier lists. A simple method to approach searching in such a set would be to store an word/file-position index in a hash table to be able to access each associated list in constant time.

If u doesn't want to use the tools built for search, then store the data in DB and use sql.

Is this possible to develop some criteria based search on the Strings in C# or JAVA?

I have one List in C#.This String array contains elements of Paragraph that are read from the Ms-Word File.for example,
list 0-> The picture above shows the main report which will be used for many of the markup samples in this chapter. There are several interesting elements in this sample document. First there rae the basic text elements, the primary building blocks for your document. Next up is the table at the bottom of the report which will be discussed in full, including the handy styling effects such as row-banding. Finally the image displayed in the header will be added to finalize the report.
list 1->The picture above shows the main report which will be used for many of the markup samples in this chapter. There are several interesting elements in this sample document. First there rae the basic text elements, the primary building blocks for your document. Various other elements of WordprocessingML will also be handled. By moving the formatting information into styles a higher degree of re-use is made possible. The document will be marked using custom XML tags and the insertion of other advanced elements such as a table of contents is discussed. But before all the advanced features can be added, the base of the document needs to be built.
Some thing like that.
Now My search String is :
The picture above shows the main report which will be used for many of the markup samples in this chapter. There are several interesting elements in this sample document. First there rae the basic text elements, the primary building blocks for your document. Next up is the table at the bottom of the report which will be discussed in full, including the handy styling effects such as row-banding. Before going over all the elements which make up the sample documents a basic document structure needs to be laid out. When you take a WordprocessingML document and use the Windows Explorer shell to rename the docx extension to zip you will find many different elements, especially in larger documents.
I want to check my search String with that list elements.
my criteria is "If each list element contains 85% match or exact match of search string then we want to retrieve that list elements.
In our case,
list 0 -> more satisfies my search string.
list 1 -it also matches some text,but i think below not equal to my criteria...
How i do this kind of criteria based search on String...?
I have more confusion on my problem also
Welcome your ideas and thoughts...

The keyword is DISTANCE or "string distance". and also, "Paragraph similarity"
You seek to implement a function which would express as a scalar, say a percentage as suggested in the question, indicative of how similar a string is from another string.
Plain string distance functions such as hamming or Levenstein may not be appropriate, for they work at character level rather than at word level, but generally these algorithms convey the idea of what is needed.
Working at word level you'll probably also want to take into account some common NLP features, for example ignore (or give less weight to) very common words (such as 'the', 'in', 'of' etc.) and maybe allow for some forms of stemming. The order of the words, or for the least their proximity may also be of import.
One key factor to remember is that even with relatively short strings, many distances functions can be quite expensive, computationally speaking. Before selecting one particular algorithm you'll need to get an idea of the general parameters of the problem:
how many strings would have to be compared? (on average, maximum)
how many words/token do the string contain? (on average, max)
Is it possible to introduce a simple (quick) filter to reduce the number of strings to be compared ?
how fancy do we need to get with linguistic features ?
is it possible to pre-process the strings ?
Are all the records in a single language ?
Comparing Methods for Single Paragraph Similarity Analysis, a scholarly paper provides a survey of relevant techniques and considerations.
In a nutshell, the the amount of design-time and run-time one can apply this relatively open problem varies greatly and is typically a compromise between the level of precision desired vs. the run-time resources and the overall complexity of the solution which may be acceptable.
In its simplest form, when the order of the words matters little, computing the sum of factors based on the TF-IDF values of the words which match may be a very acceptable solution.
Fancier solutions may introduce a pipeline of processes borrowed from NLP, for example Part-of-Speech Tagging (say for the purpose of avoiding false positive such as "SAW" as a noun (to cut wood), and "SAW" as the past tense of the verb "to see". or more likely to filter outright some of the words based on their grammatical function), stemming and possibly semantic substitutions, concept extraction or latent semantic analysis.

You may want to look into lucene for Java or lucene.net for c#. I don't think it'll do the percentage requirement you want out of the box, but it's a great tool for doing text matching.
You maybe could run a separate query for each word, and then work out the percentage yourself of ones that matched.

Here's an idea (and not a solution by any means but something to get started with)
private IEnumerable<string> SearchList = GetAllItems(); // load your list
void Search(string searchPara)
{
char[] delimiters = new char[]{' ','.',','};
var wordsInSearchPara = searchPara.Split(delimiters, StringSplitOptions.RemoveEmptyEntries).Select(a=>a.ToLower()).OrderBy(a => a);
foreach (var item in SearchList)
{
var wordsInItem = item.Split(delimiters, StringSplitOptions.RemoveEmptyEntries).Select(a => a.ToLower()).OrderBy(a => a);
var common = wordsInItem.Intersect(wordsInSearchPara);
// now that you know the common items, you can get the differential
}
}

In Lucene, can I search one index but use IDF from another one?

I'm building a system where I want to show only results indexed in the past few days.
Furthermore, I don't want to maintain a giant index with a million documents if I only want to return results from a couple of days (thousands of documents).
On the other hand, my system heavily relies that the occurrences of terms in documents stored in the index have a realistic distribution (consequently: realistic IDF).
That said, I would like to use a small index to return results, but I want to compute documents score using a IDF from a much greater Index (or even an external source).
The Similarity API doesn't seem to allow me to do this. The idf method does not receive as parameter the term being used.
Another possibility is to use TrieRangeQuery to make sure the documents shown are within the last couple of days. Again, I rather not mantain a larger index. Also this kind of query is not cheap.

You should be able to extend IndexReader and override the docFreq() methods to provide whatever values you'd like. One thing this implementation can do is open two IndexReader instances -- one for the small index and one for the large index. All the methods are delegated to the small IndexReader, except for docFreq(), which is delegated to the large index. You'll need to scale the value returned, i.e.
int myNewDocFreq = bigIndexReader.docFreq(t) / bigIndexReader.maxDoc() * smallIndexReader.maxDoc()

Hash Tables - Java

Am about to do a homework, and i need to store quite a lot of information (Dictionary) in a data structure of my choice. I heard people in my classroom saying hash-tables are the way to go. How come?

Advantages
When you first hear about hash tables they sound too good to be true. The reason is that is does not matter how many items there are searching, insertion (deletion sometimes) can take approximately 0(1) which is pretty much instantaneous from the user POV. Given its performance capabilities in terms of speed, hash tables are used mainly yet not limited to programs that need to look up thousands of items in less than a sec (for example spell-checkers / search engines). From my particular point of view I find H tables much easier to program than any sort of binary trees, and am not expert, so if you are a beginner that might too be an advantage.
Disadvantages
As hash tables are based on arrays they can be difficult to expand once created. Also I have read that for certain hash tables once full or getting full the speed when performing a task reduces notoriously. As a result of both when programming you will need to be fairly accurate of how many items you need to store. Additionally is not possible to search the items in the hash table in order for example from the smallest to the biggest, so if that is something you are looking for it might not be what you need.
Extra Info
Wikipedia article's - Hash Table - Big O Notation
Tutorial on Hash Tables - Tutorial
All how to's about Hash Tables - Java2S
Book Advice
I advice you to get a book called "Data Structures & Algorithms in Java - Second Edition - Robert Lafore" its a big book, but it has everything explained very subtle, for me is the only programming book so far i can read like is a novel.
Additional info regarding Big O notation - O(1)
O(1) doesn't mean "pretty much instantaneous" (an O(1) algorithm could take hours, weeks or years). It means (in this case) "is independent of the size of the collection" (assuming the hash code is good enough). – Ben Lings
Thanks to Ben for his clarification.
P.S: You might want to be more descriptive in the future when you ask a question that way other users can pin-point what you are looking for.

To help you out on deciding what type of collection is better for you, take a look at this Java Tutorials lesson:
Lesson: Introduction to Collections
Reading this you can see which collection fits your needs.

The best structure for your Dictionary would be a Prefix tree in which each node's 'key' is a letter from one of your words and each node's 'value' is the meaning of the word (dictionary translation). Word lookup is linear on the word's length (the same as a hashtable, since your hash function would ideally be linear), or O(1) if we consider words as a whole. The thing that is better than hash tables is that a hash table will take a lot of space in order to ensure O(1) access and, depending on the words in the dictionary, it might be very sparsely populated. The prefix tree on the other hand actually provides compression - the tree itself will contain all the original information in less space than before, since common parts of words are shared along the tree structure. Dictionaries usually have tens of thousands words, leaving a prefix tree the only viable solution.
P.S. As mentioned earlier, the tree has almost infinite scalability, in contrast to a hash table.

It depends on what you want to store and how you want to access it. You don't really provide enough information.
Hash tables provide O(1) lookup times so they can be used to retrieve values based on a key very quickly. If the hashing algorithm is expensive you may find that it is outperformed by other data structures. This is especially true if you are doing a lot of inserting and removing of items from the structure.

If you are planning on using a hash table implementation from the Java libraries, be sure to note that there are two of them - HashTable, and HashMap. One of them is commonly used these days, and one is outdated and generally found in legacy code. Do some research to find out which is which, and why the newer one is better.

A hashtable allows you to map keys to objects.
If you're storing values that have unique keys, and you will need to lookup the values by their keys, hashtables are the way to go.
If you just want to store an ordered set of objects without unique keys, an ordinary ArrayList is the way to go. (In particular, note that ordinary hashtables are unordered)

Hash Tables are good option but while using it you might have to decide what can be the good hash function.. this question can have many answers and depends on the programmer. I personally feel you can check out B+ tree or Trie. One of the main use of Trie is Dictionary representation.Trie in Wiki
Hope this helps !!

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.