How can I get the size of Solr Facet results?

How can I get the size of Solr Facet results? - java

There is a multi-value field in my schema named XXX. And it may be more 10,0000 documents in my Solr, I want to get how many values exist in XXX without any duplication.
For now, I use facet.field=XXX&facet.limit=-1 to get the facet results size. It will spend a lot of time and sometimes occur Read Timeout.
What I want for the facet results is only the 'size', I don't care about the contents.
By the way, I use Solr 5.0, is there any other better solution to solve my requirement?

The index does maintain a list of unique terms, since that is how the inverted index works. It is also very very fast to compute and return, unlike faceting. If your values are single terms, then that could be a way of getting to what you want. There is a way to get unique terms, given that the TermsComponent is enabled in your solrconfig.xml. For example:
http://localhost:8983/solr/corename/terms?q=*%3A*&wt=json&indent=true&terms=true&terms.fl=XXX
Would return a list of all unique terms, and their counts:
{
"responseHeader":{
"status":0,
"QTime":0},
"terms":{
"XXX":[
"John Backus",3,
"Ada Lovelace",3,
"Charles Babbage",2,
"John Mauchly",1,
"Alan Turing",1
]
}
}
The length of this list is the amount of unique terms, in the example that would be 5. Unfortunately the API doesn't provide a way to just ask for the count, without returning the list of terms, so while it has speed advantage in generating the list, the amount of time required to return full list gives it a similar drawback to the facets approach. Also, the returned list may become quite long.
Check out https://wiki.apache.org/solr/TermsComponent for the API details.

Related

Apache Solr, Require multiple of a single field while optimizing query to only hit specific indexes

My data is partitioned inside solr so that when I send a request "+apple" (required apple) I only hit partition 'a' to search.
Because of this optimization I cannot easily use boolean logic that spans all my data.
Solr query: +fruit:bananna +fruit:apple
Result: there are no fruit with both fields so i get 0 results because I am searching the 'a' partition AND 'b' partition each with both required fields. In this case it is very unlikely that a record with the fruit field has two names, however this is a multi-valued field, so it is possible and I want those records to be at the top of my result from solr.
One way to get what I want would be to change the query to: fruit:bananna fruit:apple
However... this will sometimes return results that are neither apple nor bananna because solr marks both as optional thus I allow it to search all my indexes. For example:
fruit:bananna fruit:apple Country:Mexico
This might return oranges in Mexico... in which case I would rather get 0 results.
Also, doing two separate queries is not an option...does anyone know of a better way to get this 'REQUIRED OR' functionality with my partition optimization?
I am also open to other designs, i'm just looking for input.

Efficient Search Algorithm For Specific Data Fields

So I am actually been assigned to write algorithms on filtering/searching.
Task : Filter: search and list objects that fulfill specified attribute(s)
Say The whole system is a student registration record system.
I have data as shown below. I will need to filter and search by these attributes say search/filter by gender or student name or date of birth etc.
Student Name
, Gender
, Date Of Birth
, Mobile No
Is there specific efficient algorithm formula or method for each of these field.
Example , strings and integers each has their own type of efficient search algorithm right?
Here's what I am going to do.
I am going to code a binary search algorithm for searching/filtering based on these fields above.
That's it. But yeah that's easy to be honest.
But I am just curious like what's the proper and appropriate coding approach for a efficient search/filter algorithm for each of these fields will you guys do?
I will not be using sequential search algorithm obviously as this will involve huge data so I am not going to iterate each of these data to downgrade efficiency performance.
Sequential search algorithm will be used when needed if data is less.

Searching is a very broad topic and it completely depends upon your use case.
while building an efficient Searching algorithm you should take below factors into consideration
What's the size of your data? -is it fixed or it keeps varying
periodically?
How often you are going to Insert/modify/delete
your data?
Is your data sorted or unsorted?
Do you need a prefix based search like autosearch,autocomplete,longest prefix search etc?
Now let's think about the solution/approach
if your data is less and unsorted as you can try Linear
Search(which has O(n)time complexity where "n" is size of your
data/array)
if your data is already sorted which is not always the case you can
use Binary search as it's complexity is 0(log n). if your
data is not sorted then sorting the data again takes
(nlogn)~typically if you are using Java,Arrays.sort() by default uses Merge sort or Quick sort which is (nlogn).
if faster retrieval is the main object you can think of HashMaps or HashMaps. the elements of Hashmap are indexed by Hashcode, the
time to search for any element would almost be 1 or constant time(if
your hash function implementation is good)
Prefix based search :since you mentioned about searching by Names,you also have the option of using
"Tries" data structure.
Tries are excellent option if you are performing Insert/Delete/Update functionalities frequently .
Lookup of an elements in a Trie is 0(k) where "k" is the length of the string to be searched.
Since you have registration data where insert,update,deletion is common TRIES Data Structure is a good option to consider.
Also,check this link to choose between Tries and HashTables TriesVsMaps
Below is the sample representation of Tries(img src:Hackerearth)

Finding number of unique terms over multiple fields

I need to find number (or list) of unique terms over a combination of two or more fields in Lucene-Java. I am using Java libraries for Lucene 4.1.0. I checked questions such as this and this, but they discuss finding list of unique terms from a single (specific) field, or over all the fields (no subset).
For example, I am interested in number(unique(height, gender)) rather than number(unique(height)), or number(unique(gender)).
Given the data:
height,gender
1,M
2,F
3,M
3,F
4,M
4,F
number(unique(height)) is 4, number(unique(gender)) is 2 and number(unique(gender,height)) is 6.
Any help will be greatly appreciated.
Thanks!

If you have predefined multiple fields then the simplest and quickest (in search terms) would be to index a combined field, i.e. heightGender (1.23:male). You can then just count the unique terms in this field, however this doesn't offer any flexibility at search time.
A more flexible approach would be to use facets (https://lucene.apache.org/core/4_1_0/facet/index.html). You would then constrain you query to each value of one field (e.g. Gender (male/female)) and retrieve all the values (and document counts) of the other field.
However if you do not have the ability to change the indexing process then you are left with doing a brute force search using Boolean queries to find the number of documents in the index for all combinations of the field values in which you are interested. I presume you are only counting combinations where the number of documents is non-zero.
It is worth noting that this question is exactly what Solr Pivot Facets address (http://lucidworks.com/blog/pivot-facets-inside-and-out/)

In Lucene, can I search one index but use IDF from another one?

I'm building a system where I want to show only results indexed in the past few days.
Furthermore, I don't want to maintain a giant index with a million documents if I only want to return results from a couple of days (thousands of documents).
On the other hand, my system heavily relies that the occurrences of terms in documents stored in the index have a realistic distribution (consequently: realistic IDF).
That said, I would like to use a small index to return results, but I want to compute documents score using a IDF from a much greater Index (or even an external source).
The Similarity API doesn't seem to allow me to do this. The idf method does not receive as parameter the term being used.
Another possibility is to use TrieRangeQuery to make sure the documents shown are within the last couple of days. Again, I rather not mantain a larger index. Also this kind of query is not cheap.

You should be able to extend IndexReader and override the docFreq() methods to provide whatever values you'd like. One thing this implementation can do is open two IndexReader instances -- one for the small index and one for the large index. All the methods are delegated to the small IndexReader, except for docFreq(), which is delegated to the large index. You'll need to scale the value returned, i.e.
int myNewDocFreq = bigIndexReader.docFreq(t) / bigIndexReader.maxDoc() * smallIndexReader.maxDoc()

Get a value from hashtable by a part of its key

Say I have a Hashtable<String, Object> with such keys and values:
apple => 1
orange => 2
mossberg => 3
I can use the standard get method to get 1 by "apple", but what I want is getting the same value (or a list of values) by a part of the key, for example "ppl". Of course it may yield several results, in this case I want to be able to process each key-value pair. So basically similar to the LIKE '%ppl%' SQL statement, but I don't want to use a (in-memory) database just because I don't want to add unnecessary complexity. What would you recommend?
Update:
Storing data in a Hashtable isn't a requirement. I'm seeking for a kind of a general approach to solve this.

The obvious brute-force approach would be to iterate through the keys in the map and match them against the char sequence. That could be fine for a small map, but of course it does not scale.
This could be improved by using a second map to cache search results. Whenever you collect a list of keys matching a given char sequence, you can store these in the second map so that next time the lookup is fast. Of course, if the original map is changed often, it may get complicated to update the cache. As always with caches, it works best if the map is read much more often than changed.
Alternatively, if you know the possible char sequences in advance, you could pre-generate the lists of matching strings and pre-fill your cache map.
Update: Hashtable is not recommended anyway - it is synchronized, thus much slower than it should be. You are better off using HashMap if no concurrency is involved, or ConcurrentHashMap otherwise. Latter outperforms a Hashtable by far.
Apart from that, out of the top of my head I can't think of a better collection to this task than maps. Of course, you may experiment with different map implementations, to find the one which suits best your specific circumstances and usage patterns. In general, it would thus be
Map<String, Object> fruits;
Map<String, List<String>> matchingKeys;

Not without iterating through explicitly. Hashtable is designed to go (exact) key->value in O(1), nothing more, nothing less. If you will be doing query operations with large amounts of data, I recommend you do consider a database. You can use an embedded system like SQLite (see SQLiteJDBC) so no separate process or installation is required. You then have the option of database indexes.
I know of no standard Java collection that can do this type of operation efficiently.

Sounds like you need a trie with references to your data. A trie stores strings and lets you search for strings by prefix. I don't know the Java standard library too well and I have no idea whether it provides an implementation, but one is available here:
http://www.cs.duke.edu/~ola/courses/cps108/fall96/joggle/trie/Trie.java
Unfortunately, a trie only lets you search by prefixes. You can work around this by storing every possible suffix of each of your keys:
For 'apple', you'd store the strings
'apple'
'pple'
'ple'
'le'
'e'
Which would allow you to search for every prefix of every suffix of your keys.
Admittedly, this is the kind of "solution" that would prompt me to continue looking for other options.

first of all, use hashmap, not hashtable.
Then, you can filter the map using a predicate by using utilities in google guava
public Collection<Object> getValues(){
Map<String,Object> filtered = Maps.filterKeys(map,new Predicate<String>(){
//predicate methods
});
return filtered.values();
}

Can't be done in a single operation
You may want to try to iterate the keys and use the ones that contain your desired string.

The only solution I can see (I'm not Java expert) is to iterate over the keys and check for matching against a regular expression. If it matches, you put the matched key-value pair in the hashtable that will be returned.

If you can somehow reduce the problem to searching by prefix, you might find a NavigableMap helpful.

it will be interesting to you to look throw these question: Fuzzy string search library in Java
Also take a look on Lucene (answer number two)

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.