Java: Very quick way of searching a string in huge dictionary

Java: Very quick way of searching a string in huge dictionary - java

I had a huge dictionary containing around 1.2 million strings. As an input I will get a sentence and I need to check for each word of input sentence whether it is present in dictionary or not?
Performance is the highest priority for me hence I would like to keep this dictionary in-memory. I want to complete my dictionary lookup in less than a millisecond. Kindly suggest how I can achieve this? Any existing external API which do this?

So you only need a set of words from the dictionary and see whether it contains the set of words of the sentence.
Set<String> dictionaryIndex = new HashSet<>();
Set<String> sentence = new HashSet<>();
if (!dictionaryIndex.containsAll(sentence)) {
...
However if you want to do more, consider a database, maybe an embedded in-memory database, like H2 or Derby. You can then do more, and a query would be:
SELECT COUNT(*) FROM dictionary WHERE word IN('think', 'possitive', 'human')
You might even consider a spelling library. They keep smaller dictionary and use stemming: 'learn' for learning, learner, learned, learns.

If you are open to using external apis, I would suggest you go for elastic search's percolate api. Performance being the priority, this exactly fits your requirement.
The java api can be found here.
You can index all the keywords and then provide it a document(in your case the sentence)
Indexing:
for(String obj:keywordLst){
client.prepareIndex("myindex", ".percolator", obj)
.setSource(XContentFactory.jsonBuilder()
.startObject()
.field("query", QueryBuilders.matchPhraseQuery("content", obj))
.endObject())
.setRefresh(true)
.execute().actionGet();
}
Searching:
XContentBuilder docBuilder = XContentFactory.jsonBuilder().startObject();
docBuilder.field("doc").startObject();
docBuilder.field("content", text);
docBuilder.endObject(); //End of the doc field
docBuilder.endObject(); //End of the JSON root object
PercolateResponse response = client.preparePercolate().setSource(docBuilder)
.setIndices("myindex").setDocumentType("type")
.execute().actionGet();
for(PercolateResponse.Match match : response) {
//found matches
}

I think the 1.2 million strings will not fit in memory or easily overflow the size limitation of your memory (consider a bad case where the average string length 256).
If some kind of pre-processing is allowed, I think you'd better first reduce the sequence of strings into a sequence of words. It means that you first convert your data into another set of data that will easily fit in memory and won't overflow.
After that, I think you can depend on the in-memory data structures such as HashMap.

Related

How to ignore Stop words search using Lucene, when analysis of Stop Words is required?

How to ignore Stop words during Lucene Search?
I have analyzed all data including Stop Words using Custom Analyzer because it is requirement in most of the searches.
But in solution another requirement jumps in for one of module, which says to exclude Stop words from searches, on same fields, where Stop words are already Analyzed.
While analysis
#Fields({#Field(index = Index.YES, store = Store.NO, analyzer = #Analyzer(impl=CustomStopWordsAccepterAnalyzer.class)),
Now requirement say to ignore stop word when search string have "Love With Hubby" and return best score results using Love Hubby. Kindly suggest!

Once you enabled stopwords for a Field, the stopwords are effectively not encoded in the index so they can not be made to re-appear during query time.
The problem you have is quite common, as often people need to combine the score of multiple full-text queries performed with different options.
The solution is rather simple: for each property of your Java Entity, use multiple #Field annotations and assign a different index fieldname to each. This way you can target each different field with a BooleanQuery and have the scores of the output take both fields into account.

Lucene: How to perform search on several independent index sets and merge the result?

Now I have several Lucene index sets (I call it shards), which indexes different document sets. They are independent, which means I can perform search on each of them without reading others. Then I get a query request. I want to search it over every index set and combine the result to form the final top documents.
I know that when scoring the documents, Lucene needs to know the <idf> of every term, and different index sets will give different <idf> to the same term (because different index sets hold different document sets). Thus to my understanding, I cannot compare the document score from different index sets directly. Then how should I generate the final result?
An obvious solution would be first merge the index and then perform the search over the big index. However, this is tooo time-consuming for me and thus unacceptable. Anyone has other better solutions?
P.S.: I don't want to use any packages or softwares (like Katta) except Lucene and Hadoop.

I think MultiReader is what you are looking for. If you have multiple IndexReaders, say reader1 and reader2:
MultiReader multiReader = new MultiReader(reader1, reader2);
IndexSearcher searcher = new IndexSearcher(multiReader);

Create a Occurrence Vector Using Apache Lucene

We are developing an application to detect plagiarism. We are using Apache lucene for document indexing. I have a need to create an occurrence vector for each document using the index we created. I would like to know whether there is a way to do this using apache lucene. I tried to use TermFreqVectors but I couldn't find a proper way. Any suggestion or help is highly appreciated.
Thanks.

The TermFreqVector class does what you'd like, I think. It can even give you term positions so that you can detect ordered sequences of words. To generate the vector, you need to specify this at indexing time like this:
String text = "text you want to index; you could also use a Reader here";
Document doc = new Document();
doc.add(new Field("text", text, Store.NO, Index.ANALYZED, TermVector.WITH_POSITIONS));
At retrieval time, you can run phrase queries (e.g, "a b c"~25) or SpanQuerys (which you have to construct programmatically).
To get term frequency and position information from the index, do something like this:
TermPositionVector v = (TermPositionVector) this.reader.getTermFreqVector(docnum, this.textField);
int wordIndex = v.indexOf("want");
int[] positions = v.getTermPositions(wordIndex); // should return the position(s) of the word "want" in your text

If you want to achieve this you could use a RAMDirectory to store your document (assuming you only want to do this for one document).
Then you can use IndexReader.termDocs(Term term) to fetch the TermDocs for this directory, containing the document id (only one if you store one doc) and the frequency of the term in the document.
You can then do this for each term to create your occurance vector.
You could off course also do this for more than one document and create multiple occurance vectors at once.
http://lucene.apache.org/java/3_1_0/api/all/org/apache/lucene/index/IndexReader.html
As I'm sure you are looking to find similarities in documents => similar documents, you might want to have a look on the MoreLikeThis implementation of Lucene: http://lucene.apache.org/java/3_1_0/api/all/org/apache/lucene/search/similar/MoreLikeThis.html

Is this possible to develop some criteria based search on the Strings in C# or JAVA?

I have one List in C#.This String array contains elements of Paragraph that are read from the Ms-Word File.for example,
list 0-> The picture above shows the main report which will be used for many of the markup samples in this chapter. There are several interesting elements in this sample document. First there rae the basic text elements, the primary building blocks for your document. Next up is the table at the bottom of the report which will be discussed in full, including the handy styling effects such as row-banding. Finally the image displayed in the header will be added to finalize the report.
list 1->The picture above shows the main report which will be used for many of the markup samples in this chapter. There are several interesting elements in this sample document. First there rae the basic text elements, the primary building blocks for your document. Various other elements of WordprocessingML will also be handled. By moving the formatting information into styles a higher degree of re-use is made possible. The document will be marked using custom XML tags and the insertion of other advanced elements such as a table of contents is discussed. But before all the advanced features can be added, the base of the document needs to be built.
Some thing like that.
Now My search String is :
The picture above shows the main report which will be used for many of the markup samples in this chapter. There are several interesting elements in this sample document. First there rae the basic text elements, the primary building blocks for your document. Next up is the table at the bottom of the report which will be discussed in full, including the handy styling effects such as row-banding. Before going over all the elements which make up the sample documents a basic document structure needs to be laid out. When you take a WordprocessingML document and use the Windows Explorer shell to rename the docx extension to zip you will find many different elements, especially in larger documents.
I want to check my search String with that list elements.
my criteria is "If each list element contains 85% match or exact match of search string then we want to retrieve that list elements.
In our case,
list 0 -> more satisfies my search string.
list 1 -it also matches some text,but i think below not equal to my criteria...
How i do this kind of criteria based search on String...?
I have more confusion on my problem also
Welcome your ideas and thoughts...

The keyword is DISTANCE or "string distance". and also, "Paragraph similarity"
You seek to implement a function which would express as a scalar, say a percentage as suggested in the question, indicative of how similar a string is from another string.
Plain string distance functions such as hamming or Levenstein may not be appropriate, for they work at character level rather than at word level, but generally these algorithms convey the idea of what is needed.
Working at word level you'll probably also want to take into account some common NLP features, for example ignore (or give less weight to) very common words (such as 'the', 'in', 'of' etc.) and maybe allow for some forms of stemming. The order of the words, or for the least their proximity may also be of import.
One key factor to remember is that even with relatively short strings, many distances functions can be quite expensive, computationally speaking. Before selecting one particular algorithm you'll need to get an idea of the general parameters of the problem:
how many strings would have to be compared? (on average, maximum)
how many words/token do the string contain? (on average, max)
Is it possible to introduce a simple (quick) filter to reduce the number of strings to be compared ?
how fancy do we need to get with linguistic features ?
is it possible to pre-process the strings ?
Are all the records in a single language ?
Comparing Methods for Single Paragraph Similarity Analysis, a scholarly paper provides a survey of relevant techniques and considerations.
In a nutshell, the the amount of design-time and run-time one can apply this relatively open problem varies greatly and is typically a compromise between the level of precision desired vs. the run-time resources and the overall complexity of the solution which may be acceptable.
In its simplest form, when the order of the words matters little, computing the sum of factors based on the TF-IDF values of the words which match may be a very acceptable solution.
Fancier solutions may introduce a pipeline of processes borrowed from NLP, for example Part-of-Speech Tagging (say for the purpose of avoiding false positive such as "SAW" as a noun (to cut wood), and "SAW" as the past tense of the verb "to see". or more likely to filter outright some of the words based on their grammatical function), stemming and possibly semantic substitutions, concept extraction or latent semantic analysis.

You may want to look into lucene for Java or lucene.net for c#. I don't think it'll do the percentage requirement you want out of the box, but it's a great tool for doing text matching.
You maybe could run a separate query for each word, and then work out the percentage yourself of ones that matched.

Here's an idea (and not a solution by any means but something to get started with)
private IEnumerable<string> SearchList = GetAllItems(); // load your list
void Search(string searchPara)
{
char[] delimiters = new char[]{' ','.',','};
var wordsInSearchPara = searchPara.Split(delimiters, StringSplitOptions.RemoveEmptyEntries).Select(a=>a.ToLower()).OrderBy(a => a);
foreach (var item in SearchList)
{
var wordsInItem = item.Split(delimiters, StringSplitOptions.RemoveEmptyEntries).Select(a => a.ToLower()).OrderBy(a => a);
var common = wordsInItem.Intersect(wordsInSearchPara);
// now that you know the common items, you can get the differential
}
}

Data structure for search engine in JAVA?

I m MCS 2nd year student.I m doing a project in Java in which I have different images. For storing description of say IMAGE-1, I have ArrayList named IMAGE-1, similarly for IMAGE-2 ArrayList IMAGE-2 n so on.....
Now I need to develop a search engine, in which i need to find a all image's whose description matches with a word entered in search engine..........
FOR EX If i enter "computer" then I should be able to find all images whose description contain "computer".
So my question is...
How should i do this efficiently?
How should i maintain all those
ArrayList since i can have 100 of
such...? or should i use another
data structure instead of ArrayList?

A simple implementation is to tokenize the description and use a Map<String, Collection<Item>> to store all items for a token.
Building:
for(String token: tokenize(description)) map.get(token).add(item)
(A collection is needed as multiple entries could be found for a token. The initialization of the collection is missing in the code. But the idea should be clear.)
Use:
List<Item> result = map.get("Computer")
The the general purpose HashMap implementation is not the most efficient in this case. When you start getting memory problems you can look into a tree implementation that is more efficient (like radix trees - implementation).
The next step could be to use some (in-memory) database. These could be relational (HSQL) or not (Berkeley DB).

If you have a small number of images and short descriptions (< 1000 characters), load them into an array and search for words using String.indexOf() (i.e. one entry in the array == one complete image description). This is efficient enough for, say, less than 10'000 images.
Use toLowerCase() to fold the case of the characters (so users will find "Computer" when they type "computer"). String.indexOf() will also work for short words (using "comp" to find "Computer" or "compare").
If you have lots of images and long descriptions and/or you want to give your users some comforts for the search (like Google does), then use Lucene.

There is no simple, easy-to-use data structure that supports efficient fulltext search.
But do you actually need efficiency? Is this a desktop app or a web app? In the former case, don't worry about efficiency, a modern CPU can search through megabytes of text in fractions of a second - simply look through all your descriptions using String.contains() (or a regexp to allow more flexible searches).
If you really need efficiency (such as for a webapp where many people could do searches at the same time), look into Apache Lucene.
As for your ArrayLists, it seems strange to use one for the description of a single image. Why a list, what does the index represent? Lines? If so, and unless you actually need to access lines directly, replace the lists with a simple String - it can contain newline characters just fine.

I would suggest you to use the Hashtable class or to organize your content into a tree to optimize searching.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.