Improving the speed of Solr query over 16 million tweets

Improving the speed of Solr query over 16 million tweets - java

I use Solr (SolrCloud) to index and search my tweets. There are about 16 million tweets and the index size is approximately 3 GB. The tweets are indexed in real time as they come so that real time search is enabled. Currently I use lowercase field type for my tweet body field. For a single search term in the search, it is taking around 7 seconds and with addition of each search term, time taken for search is linearly increasing. 3GB is the maximum RAM allocated for the solr process. Sample solr search query looks like this
tweet_body:*big* AND tweet_body:*data* AND tweet_tag:big_data
Any suggestions on improving the speed of searching? Currently I run only 1 shard which contains the entire tweet collection.

The query tweet_body:*big* can be expected to perform poorly. Trailing wildcards are easy, Leading Wildcards can be readily handled with a ReversedWildcardFilterFactory. Both, however, will have to scan every document, rather than being able to utilize the index to locate matching documents. Combining the two approaches would only allow you to search:
tweet_body:*big tweet_body:big*
Which is not the same thing. If you really must search for terms with a leading AND trailing wildcard, I would recommend looking into indexing your data as N-grams.
I wasn't previously aware of it, but it seems the lowercase field type is a Lowercase filtered KeywordAnalyzer. This is not what you want. That means the entire field is treated as a single token. Good for identification numbers and the like, but not for a body of text you wish to perform a full text search on.
So yes, you need to change it. text_general is probably appropriate. That will index a correctly tokenized field, and you should be able to performt he query you are looking for with:
tweet_body:big AND tweet_body:data AND tweet_tag:big_data
You will have to reindex, but there is no avoiding that. There is no good, performant way to perform a full text search on a keyword field.

Try using filter queries,as filter queries runs in parallel

Related

Finding number of unique terms over multiple fields

I need to find number (or list) of unique terms over a combination of two or more fields in Lucene-Java. I am using Java libraries for Lucene 4.1.0. I checked questions such as this and this, but they discuss finding list of unique terms from a single (specific) field, or over all the fields (no subset).
For example, I am interested in number(unique(height, gender)) rather than number(unique(height)), or number(unique(gender)).
Given the data:
height,gender
1,M
2,F
3,M
3,F
4,M
4,F
number(unique(height)) is 4, number(unique(gender)) is 2 and number(unique(gender,height)) is 6.
Any help will be greatly appreciated.
Thanks!

If you have predefined multiple fields then the simplest and quickest (in search terms) would be to index a combined field, i.e. heightGender (1.23:male). You can then just count the unique terms in this field, however this doesn't offer any flexibility at search time.
A more flexible approach would be to use facets (https://lucene.apache.org/core/4_1_0/facet/index.html). You would then constrain you query to each value of one field (e.g. Gender (male/female)) and retrieve all the values (and document counts) of the other field.
However if you do not have the ability to change the indexing process then you are left with doing a brute force search using Boolean queries to find the number of documents in the index for all combinations of the field values in which you are interested. I presume you are only counting combinations where the number of documents is non-zero.
It is worth noting that this question is exactly what Solr Pivot Facets address (http://lucidworks.com/blog/pivot-facets-inside-and-out/)

What data structure to use for indexing data for partial %infix% searching?

Imagine you have a huge cache of data that is to be searched through by 4 ways :
exact match
prefix%
%suffix
%infix%
I'm using Trie for the first 3 types of searching, but I can't figure out how to approach the fourth one other than sequential processing of huge array of elements.

If your dataset is huge cosider using a search platform like Apache Solr so that you dont end up in a performance mess.

You can construct a navigable map or set (eg TreeMap or TreeSet) for the 2 (with keys in normal order) and 3 (keys in reverse)
For option 4 you can construct a collection with a key for every starting letter. You can simplify this depending on your requirement. This can lead to more space being used but get O(log n) lookup times.

For #4 I am thinking if you pre-compute the number of occurances of each character then you can look up in that table for entires that have at least as many occurances of the characters in the search string.
How efficient this algorithm is will probably depend on the nature of the data and the search string. It might be useful to give some examples of both here to get better answers.

Build in library's to perform effective searching on 100GB files

Is there any build-in library in Java for searching strings in large files of about 100GB in java. I am currently using binary-search but it is not that efficient.

As far as I know Java does not contain any file search engine, with or without an index. There is a very good reason for that too: search engine implementations are intrinsically tied to both the input data set and the search pattern format. A minor variation in either could result in massive changes in the search engine.
For us to be able to provide a more concrete answer you need to:
Describe exactly the data set: the number, path structure and average size of files, the format of each entry and the format of each contained token.
Describe exactly your search patterns: are those fixed strings, glob patterns or, say, regular expressions? Do you expect the pattern to match a full line or a specific token in each line?
Describe exactly your desired search results: do you want exact or approximate matches? Do you want to get a position in a file, or extract specific tokens?
Describe exactly your requirements: are you able to build an index beforehand? Is the data set expected to be modified in real time?
Explain why can't you use third party libraries such as Lucene that are designed exactly for this kind of work.
Explain why your current binary search, which should have a complexity of O(logn) is not efficient enough. The only thing that might be be faster, with a constant complexity would involve the use of a hash table.
It might be best if you described your problem in broader terms. For example, one might assume from your sample data set that what you have is a set of words and associated offset or document identifier lists. A simple method to approach searching in such a set would be to store an word/file-position index in a hash table to be able to access each associated list in constant time.

If u doesn't want to use the tools built for search, then store the data in DB and use sql.

Advice on how to improve a current fuzzy search implementation

I'm currently working on implementing a fuzzy search for a terminology web service and I'm looking for suggestions on how I might improve the current implementation. It's too much code to share, but I think an explanation might suffice to prompt thoughtful suggestions. I realize it's a lot to read but I'd appreciate any help.
First, the terminology is basically just a number of names (or terms). For each word, we split it into tokens by space and then iterate through each character to add it to the trie. On a terminal node (such as when the character y in strawberry is reached) we store in a list an index to the master term list. So a terminal node can have multiple indices (since the terminal node for strawberry will match 'strawberry' and 'allergy to strawberry').
As for the actual search, the search query is also broken up into tokens by space. The search algorithm is run for each token. The first character of the search token must be a match (so traw will never match strawberry). After that, we go through children of each successive node. If there is child with a character that matches, we continue the search with the next character of the search token. If a child does not match the given character, we look at the children using the current character of the search token (so not advancing it). This is the fuzziness part, so 'stwb' will match 'strawberry'.
When we reach the end of the search token, we will search through the rest of the trie structure at that node to get all potential matches (since the indexes to the master term list are only on the terminal nodes). We call this the roll up. We store the indices by setting their value on a BitSet. Then, we simply and the BitSets from the results of each search token result. We then take, say, the first 1000 or 5000 indices from the anded BitSets and find the actual terms they correspond to. We use Levenshtein to score each term and then sort by score to get our final results.
This works fairly well and is pretty fast. There are over 390k nodes in the tree and over 1.1 million actual term names. However, there are problems with this as it stands.
For example, searching for 'car cat' will return Catheterization, when we don't want it to (since the search query is two words, the result should be at least two). That would be easy enough to check, but it doesn't take care of a situation like Catheterization Procedure, since it is two words. Ideally, we'd want it to match something like Cardiac Catheterization.
Based on the need to correct this, we came up with some changes. For one, we go through the trie in a mixed depth/breadth search. Essentially we go depth first as long as a character matches. Those child nodes that didn't match get added to a priority queue. The priority queue is ordered by edit distance, which can be calculated while searching the trie (since if there's a character match, distance remains the same and if not, it increases by 1). By doing this, we get the edit distance for each word.
We are no longer using the BitSet. Instead, it's a map of the index to a Terminfo object. This object stores the index of the query phrase and the term phrase and the score. So if the search is "car cat" and a term matched is "Catheterization procedure" the term phrase indices will be 1 as will the query phrase indices. For "Cardiac Catheterization" the term phrase indices will be 1,2 as will the query phrase indices. As you can see, it's very simple afterward to look at the count of term phrase indices and query phrase indices and if they aren't at least equal to the search word count, they can be discarded.
After that, we add up the edit distances of the words, remove the words from the term that match the term phrase index, and count the remaining letters to get the true edit distance. For example, if you matched the term "allergy to strawberries" and your search query was "straw" you would have a score of 7 from strawberries, then you'd use the term phrase index to discard strawberries from the term, and just count "allergy to" (minus the spaces) to get the score of 16.
This gets us the accurate results we expect. However, it is far too slow. Where before we could get 25-40 ms on one word search, now it could be as much as half a second. It's largely from things like instantiating TermInfo objects, using .add() operations, .put() operations and the fact that we have to return a large number of matches. We could limit each search to only return 1000 matches, but there's no guarantee that the first 1000 results for "car" would match any of the first 1000 matches for "cat" (remember, there are over 1.1. million terms).
Even for a single query word, like cat, we still need a large number of matches. This is because if we search for 'cat' the search is going to match car and roll up all the terminal nodes below it (which will be a lot). However, if we limited the number of results, it would place too heavy an emphasis on words that begin with the query and not the edit distance. Thus, words like catheterization would be more likely to be included than something like coat.
So, basically, are there any thoughts on how we could handle the problems that the second implementation fixed, but without as much of the speed slow down that it introduced? I can include some selected code if it might make things clearer but I didn't want to post a giant wall of code.

Wow... tough one.
Well why don't you implement lucene? It is the best and current state of the art when it comes to problems like yours afaik.
However I want to share some thoughts...
Fuziness isnt something like straw* its rather the mis typing of some words. And every missing/wrong character adds 1 to the distance.
Its generally very, very hard to have partial matching (wildcards) and fuzziness at the same time!
Tokenizing is generally a good idea.
Everything also heavily depends on the data you get. Are there spelling mistakes in the source files or only in the search queries?
I have seen some pretty nice implementations using multi dimensional range trees.
But I really think if you want to accomplish all of the above you need a pretty neat combination of a graph set and a nice indexing algorithm.
You could for example use a semantic database like sesame and when importing your documents import every token and document as a node. Then depending on position in the document etc you can add a weighted relation.
Then you need the tokens in some structure where you can do efficient fuzzy matches such as bk-trees.
I think you could index the tokens in a mysql database and do bit-wise comparision functions to get differences. Theres a function that returns all matching bits, if you translit your strings to ascii and group the bits you could achieve something pretty fast.
However if you matched the tokens to the string you can construct a hypothetical perfect match antity and query your semantic database for the nearest neighbours.
You would have to break the words apart into partial words when tokenizing to achieve partial matches.
However you can do also wildcard matches (prefix, suffix or both) but no fuzziness then.
You can also index the whole word or different concatenations of tokens.
However there may be special bk-tree implementations that support this but i have never seen one.

I did a number of iterations of a spelling corrector ages ago, and here's a recent description of the basic method. Basically the dictionary of correct words is in a trie, and the search is a simple branch-and-bound. I used repeated depth-first trie walk, bounded by lev. distance because, since each additional increment of distance results in much more of the trie being walked, the cost, for small distance, is basically exponential in the distance, so going to a combined depth/breadth search doesn't save much but makes it a lot more complicated.
(Aside: You'd be amazed how many ways physicians can try to spell "acetylsalicylic acid".)
I'm surprised at the size of your trie. A basic dictionary of acceptable words is maybe a few thousand. Then there are common prefixes and suffixes. Since the structure is a trie, you can connect together sub-tries and save a lot of space. Like the trie of basic prefixes can connect to the main dictionary, and then the terminal nodes of the main dictionary can connect to the trie of common suffixes (which can in fact contain cycles). In other words, the trie can be generalized into a finite state machine. That gives you a lot of flexibility.
REGARDLESS of all that, you have a performance problem. The nice thing about performance problems is, the worse they are, the easier they are to find. I've been a real pest on StackOverflow pointing this out. This link explains how to do it, links to a detailed example, and tries to dispel some popular myths. In a nutshell, the more time it is spending doing something that you could optimize, the more likely you will catch it doing that if you just pause it and take a look. My suspicion is that a lot of time is going into operations on overblown data structure, rather than just getting to the answer. That's a common situation, but don't fix anything until samples point you directly at the problem.

Data structure for search engine in JAVA?

I m MCS 2nd year student.I m doing a project in Java in which I have different images. For storing description of say IMAGE-1, I have ArrayList named IMAGE-1, similarly for IMAGE-2 ArrayList IMAGE-2 n so on.....
Now I need to develop a search engine, in which i need to find a all image's whose description matches with a word entered in search engine..........
FOR EX If i enter "computer" then I should be able to find all images whose description contain "computer".
So my question is...
How should i do this efficiently?
How should i maintain all those
ArrayList since i can have 100 of
such...? or should i use another
data structure instead of ArrayList?

A simple implementation is to tokenize the description and use a Map<String, Collection<Item>> to store all items for a token.
Building:
for(String token: tokenize(description)) map.get(token).add(item)
(A collection is needed as multiple entries could be found for a token. The initialization of the collection is missing in the code. But the idea should be clear.)
Use:
List<Item> result = map.get("Computer")
The the general purpose HashMap implementation is not the most efficient in this case. When you start getting memory problems you can look into a tree implementation that is more efficient (like radix trees - implementation).
The next step could be to use some (in-memory) database. These could be relational (HSQL) or not (Berkeley DB).

If you have a small number of images and short descriptions (< 1000 characters), load them into an array and search for words using String.indexOf() (i.e. one entry in the array == one complete image description). This is efficient enough for, say, less than 10'000 images.
Use toLowerCase() to fold the case of the characters (so users will find "Computer" when they type "computer"). String.indexOf() will also work for short words (using "comp" to find "Computer" or "compare").
If you have lots of images and long descriptions and/or you want to give your users some comforts for the search (like Google does), then use Lucene.

There is no simple, easy-to-use data structure that supports efficient fulltext search.
But do you actually need efficiency? Is this a desktop app or a web app? In the former case, don't worry about efficiency, a modern CPU can search through megabytes of text in fractions of a second - simply look through all your descriptions using String.contains() (or a regexp to allow more flexible searches).
If you really need efficiency (such as for a webapp where many people could do searches at the same time), look into Apache Lucene.
As for your ArrayLists, it seems strange to use one for the description of a single image. Why a list, what does the index represent? Lines? If so, and unless you actually need to access lines directly, replace the lists with a simple String - it can contain newline characters just fine.

I would suggest you to use the Hashtable class or to organize your content into a tree to optimize searching.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.