Finding number of unique terms over multiple fields

Finding number of unique terms over multiple fields - java

I need to find number (or list) of unique terms over a combination of two or more fields in Lucene-Java. I am using Java libraries for Lucene 4.1.0. I checked questions such as this and this, but they discuss finding list of unique terms from a single (specific) field, or over all the fields (no subset).
For example, I am interested in number(unique(height, gender)) rather than number(unique(height)), or number(unique(gender)).
Given the data:
height,gender
1,M
2,F
3,M
3,F
4,M
4,F
number(unique(height)) is 4, number(unique(gender)) is 2 and number(unique(gender,height)) is 6.
Any help will be greatly appreciated.
Thanks!

If you have predefined multiple fields then the simplest and quickest (in search terms) would be to index a combined field, i.e. heightGender (1.23:male). You can then just count the unique terms in this field, however this doesn't offer any flexibility at search time.
A more flexible approach would be to use facets (https://lucene.apache.org/core/4_1_0/facet/index.html). You would then constrain you query to each value of one field (e.g. Gender (male/female)) and retrieve all the values (and document counts) of the other field.
However if you do not have the ability to change the indexing process then you are left with doing a brute force search using Boolean queries to find the number of documents in the index for all combinations of the field values in which you are interested. I presume you are only counting combinations where the number of documents is non-zero.
It is worth noting that this question is exactly what Solr Pivot Facets address (http://lucidworks.com/blog/pivot-facets-inside-and-out/)

Related

Efficient Search Algorithm For Specific Data Fields

So I am actually been assigned to write algorithms on filtering/searching.
Task : Filter: search and list objects that fulfill specified attribute(s)
Say The whole system is a student registration record system.
I have data as shown below. I will need to filter and search by these attributes say search/filter by gender or student name or date of birth etc.
Student Name
, Gender
, Date Of Birth
, Mobile No
Is there specific efficient algorithm formula or method for each of these field.
Example , strings and integers each has their own type of efficient search algorithm right?
Here's what I am going to do.
I am going to code a binary search algorithm for searching/filtering based on these fields above.
That's it. But yeah that's easy to be honest.
But I am just curious like what's the proper and appropriate coding approach for a efficient search/filter algorithm for each of these fields will you guys do?
I will not be using sequential search algorithm obviously as this will involve huge data so I am not going to iterate each of these data to downgrade efficiency performance.
Sequential search algorithm will be used when needed if data is less.

Searching is a very broad topic and it completely depends upon your use case.
while building an efficient Searching algorithm you should take below factors into consideration
What's the size of your data? -is it fixed or it keeps varying
periodically?
How often you are going to Insert/modify/delete
your data?
Is your data sorted or unsorted?
Do you need a prefix based search like autosearch,autocomplete,longest prefix search etc?
Now let's think about the solution/approach
if your data is less and unsorted as you can try Linear
Search(which has O(n)time complexity where "n" is size of your
data/array)
if your data is already sorted which is not always the case you can
use Binary search as it's complexity is 0(log n). if your
data is not sorted then sorting the data again takes
(nlogn)~typically if you are using Java,Arrays.sort() by default uses Merge sort or Quick sort which is (nlogn).
if faster retrieval is the main object you can think of HashMaps or HashMaps. the elements of Hashmap are indexed by Hashcode, the
time to search for any element would almost be 1 or constant time(if
your hash function implementation is good)
Prefix based search :since you mentioned about searching by Names,you also have the option of using
"Tries" data structure.
Tries are excellent option if you are performing Insert/Delete/Update functionalities frequently .
Lookup of an elements in a Trie is 0(k) where "k" is the length of the string to be searched.
Since you have registration data where insert,update,deletion is common TRIES Data Structure is a good option to consider.
Also,check this link to choose between Tries and HashTables TriesVsMaps
Below is the sample representation of Tries(img src:Hackerearth)

How can I get the size of Solr Facet results?

There is a multi-value field in my schema named XXX. And it may be more 10,0000 documents in my Solr, I want to get how many values exist in XXX without any duplication.
For now, I use facet.field=XXX&facet.limit=-1 to get the facet results size. It will spend a lot of time and sometimes occur Read Timeout.
What I want for the facet results is only the 'size', I don't care about the contents.
By the way, I use Solr 5.0, is there any other better solution to solve my requirement?

The index does maintain a list of unique terms, since that is how the inverted index works. It is also very very fast to compute and return, unlike faceting. If your values are single terms, then that could be a way of getting to what you want. There is a way to get unique terms, given that the TermsComponent is enabled in your solrconfig.xml. For example:
http://localhost:8983/solr/corename/terms?q=*%3A*&wt=json&indent=true&terms=true&terms.fl=XXX
Would return a list of all unique terms, and their counts:
{
"responseHeader":{
"status":0,
"QTime":0},
"terms":{
"XXX":[
"John Backus",3,
"Ada Lovelace",3,
"Charles Babbage",2,
"John Mauchly",1,
"Alan Turing",1
]
}
}
The length of this list is the amount of unique terms, in the example that would be 5. Unfortunately the API doesn't provide a way to just ask for the count, without returning the list of terms, so while it has speed advantage in generating the list, the amount of time required to return full list gives it a similar drawback to the facets approach. Also, the returned list may become quite long.
Check out https://wiki.apache.org/solr/TermsComponent for the API details.

Improving the speed of Solr query over 16 million tweets

I use Solr (SolrCloud) to index and search my tweets. There are about 16 million tweets and the index size is approximately 3 GB. The tweets are indexed in real time as they come so that real time search is enabled. Currently I use lowercase field type for my tweet body field. For a single search term in the search, it is taking around 7 seconds and with addition of each search term, time taken for search is linearly increasing. 3GB is the maximum RAM allocated for the solr process. Sample solr search query looks like this
tweet_body:*big* AND tweet_body:*data* AND tweet_tag:big_data
Any suggestions on improving the speed of searching? Currently I run only 1 shard which contains the entire tweet collection.

The query tweet_body:*big* can be expected to perform poorly. Trailing wildcards are easy, Leading Wildcards can be readily handled with a ReversedWildcardFilterFactory. Both, however, will have to scan every document, rather than being able to utilize the index to locate matching documents. Combining the two approaches would only allow you to search:
tweet_body:*big tweet_body:big*
Which is not the same thing. If you really must search for terms with a leading AND trailing wildcard, I would recommend looking into indexing your data as N-grams.
I wasn't previously aware of it, but it seems the lowercase field type is a Lowercase filtered KeywordAnalyzer. This is not what you want. That means the entire field is treated as a single token. Good for identification numbers and the like, but not for a body of text you wish to perform a full text search on.
So yes, you need to change it. text_general is probably appropriate. That will index a correctly tokenized field, and you should be able to performt he query you are looking for with:
tweet_body:big AND tweet_body:data AND tweet_tag:big_data
You will have to reindex, but there is no avoiding that. There is no good, performant way to perform a full text search on a keyword field.

Try using filter queries,as filter queries runs in parallel

Is this possible to develop some criteria based search on the Strings in C# or JAVA?

I have one List in C#.This String array contains elements of Paragraph that are read from the Ms-Word File.for example,
list 0-> The picture above shows the main report which will be used for many of the markup samples in this chapter. There are several interesting elements in this sample document. First there rae the basic text elements, the primary building blocks for your document. Next up is the table at the bottom of the report which will be discussed in full, including the handy styling effects such as row-banding. Finally the image displayed in the header will be added to finalize the report.
list 1->The picture above shows the main report which will be used for many of the markup samples in this chapter. There are several interesting elements in this sample document. First there rae the basic text elements, the primary building blocks for your document. Various other elements of WordprocessingML will also be handled. By moving the formatting information into styles a higher degree of re-use is made possible. The document will be marked using custom XML tags and the insertion of other advanced elements such as a table of contents is discussed. But before all the advanced features can be added, the base of the document needs to be built.
Some thing like that.
Now My search String is :
The picture above shows the main report which will be used for many of the markup samples in this chapter. There are several interesting elements in this sample document. First there rae the basic text elements, the primary building blocks for your document. Next up is the table at the bottom of the report which will be discussed in full, including the handy styling effects such as row-banding. Before going over all the elements which make up the sample documents a basic document structure needs to be laid out. When you take a WordprocessingML document and use the Windows Explorer shell to rename the docx extension to zip you will find many different elements, especially in larger documents.
I want to check my search String with that list elements.
my criteria is "If each list element contains 85% match or exact match of search string then we want to retrieve that list elements.
In our case,
list 0 -> more satisfies my search string.
list 1 -it also matches some text,but i think below not equal to my criteria...
How i do this kind of criteria based search on String...?
I have more confusion on my problem also
Welcome your ideas and thoughts...

The keyword is DISTANCE or "string distance". and also, "Paragraph similarity"
You seek to implement a function which would express as a scalar, say a percentage as suggested in the question, indicative of how similar a string is from another string.
Plain string distance functions such as hamming or Levenstein may not be appropriate, for they work at character level rather than at word level, but generally these algorithms convey the idea of what is needed.
Working at word level you'll probably also want to take into account some common NLP features, for example ignore (or give less weight to) very common words (such as 'the', 'in', 'of' etc.) and maybe allow for some forms of stemming. The order of the words, or for the least their proximity may also be of import.
One key factor to remember is that even with relatively short strings, many distances functions can be quite expensive, computationally speaking. Before selecting one particular algorithm you'll need to get an idea of the general parameters of the problem:
how many strings would have to be compared? (on average, maximum)
how many words/token do the string contain? (on average, max)
Is it possible to introduce a simple (quick) filter to reduce the number of strings to be compared ?
how fancy do we need to get with linguistic features ?
is it possible to pre-process the strings ?
Are all the records in a single language ?
Comparing Methods for Single Paragraph Similarity Analysis, a scholarly paper provides a survey of relevant techniques and considerations.
In a nutshell, the the amount of design-time and run-time one can apply this relatively open problem varies greatly and is typically a compromise between the level of precision desired vs. the run-time resources and the overall complexity of the solution which may be acceptable.
In its simplest form, when the order of the words matters little, computing the sum of factors based on the TF-IDF values of the words which match may be a very acceptable solution.
Fancier solutions may introduce a pipeline of processes borrowed from NLP, for example Part-of-Speech Tagging (say for the purpose of avoiding false positive such as "SAW" as a noun (to cut wood), and "SAW" as the past tense of the verb "to see". or more likely to filter outright some of the words based on their grammatical function), stemming and possibly semantic substitutions, concept extraction or latent semantic analysis.

You may want to look into lucene for Java or lucene.net for c#. I don't think it'll do the percentage requirement you want out of the box, but it's a great tool for doing text matching.
You maybe could run a separate query for each word, and then work out the percentage yourself of ones that matched.

Here's an idea (and not a solution by any means but something to get started with)
private IEnumerable<string> SearchList = GetAllItems(); // load your list
void Search(string searchPara)
{
char[] delimiters = new char[]{' ','.',','};
var wordsInSearchPara = searchPara.Split(delimiters, StringSplitOptions.RemoveEmptyEntries).Select(a=>a.ToLower()).OrderBy(a => a);
foreach (var item in SearchList)
{
var wordsInItem = item.Split(delimiters, StringSplitOptions.RemoveEmptyEntries).Select(a => a.ToLower()).OrderBy(a => a);
var common = wordsInItem.Intersect(wordsInSearchPara);
// now that you know the common items, you can get the differential
}
}

In Lucene, can I search one index but use IDF from another one?

I'm building a system where I want to show only results indexed in the past few days.
Furthermore, I don't want to maintain a giant index with a million documents if I only want to return results from a couple of days (thousands of documents).
On the other hand, my system heavily relies that the occurrences of terms in documents stored in the index have a realistic distribution (consequently: realistic IDF).
That said, I would like to use a small index to return results, but I want to compute documents score using a IDF from a much greater Index (or even an external source).
The Similarity API doesn't seem to allow me to do this. The idf method does not receive as parameter the term being used.
Another possibility is to use TrieRangeQuery to make sure the documents shown are within the last couple of days. Again, I rather not mantain a larger index. Also this kind of query is not cheap.

You should be able to extend IndexReader and override the docFreq() methods to provide whatever values you'd like. One thing this implementation can do is open two IndexReader instances -- one for the small index and one for the large index. All the methods are delegated to the small IndexReader, except for docFreq(), which is delegated to the large index. You'll need to scale the value returned, i.e.
int myNewDocFreq = bigIndexReader.docFreq(t) / bigIndexReader.maxDoc() * smallIndexReader.maxDoc()

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.