Faceting using SolrJ and Solr4

Faceting using SolrJ and Solr4 - java

I've gone through the related questions on this site but haven't found a relevant solution.
When querying my Solr4 index using an HTTP request of the form
&facet=true&facet.field=country
The response contains all the different countries along with counts per country.
How can I get this information using SolrJ?
I have tried the following but it only returns total counts across all countries, not per country:
solrQuery.setFacet(true);
solrQuery.addFacetField("country");
The following does seem to work, but I do not want to have to explicitly set all the groupings beforehand:
solrQuery.addFacetQuery("country:usa");
solrQuery.addFacetQuery("country:canada");
Secondly, I'm not sure how to extract the facet data from the QueryResponse object.
So two questions:
1) Using SolrJ how can I facet on a field and return the groupings without explicitly specifying the groups?
2) Using SolrJ how can I extract the facet data from the QueryResponse object?
Thanks.
Update:
I also tried something similar to Sergey's response (below).
List<FacetField> ffList = resp.getFacetFields();
log.info("size of ffList:" + ffList.size());
for(FacetField ff : ffList){
String ffname = ff.getName();
int ffcount = ff.getValueCount();
log.info("ffname:" + ffname + "|ffcount:" + ffcount);
}
The above code shows ffList with size=1 and the loop goes through 1 iteration. In the output ffname="country" and ffcount is the total number of rows that match the original query.
There is no per-country breakdown here.
I should mention that on the same solrQuery object I am also calling addField and addFilterQuery. Not sure if this impacts faceting:
solrQuery.addField("user-name");
solrQuery.addField("user-bio");
solrQuery.addField("country");
solrQuery.addFilterQuery("user-bio:" + "(Apple OR Google OR Facebook)");
Update 2:
I think I got it, again based on what Sergey said below. I extracted the List object using FacetField.getValues().
List<FacetField> fflist = resp.getFacetFields();
for(FacetField ff : fflist){
String ffname = ff.getName();
int ffcount = ff.getValueCount();
List<Count> counts = ff.getValues();
for(Count c : counts){
String facetLabel = c.getName();
long facetCount = c.getCount();
}
}
In the above code the label variable matches each facet group and count is the corresponding count for that grouping.

Actually you need only to set facet field and facet will be activated (check SolrJ source code):
solrQuery.addFacetField("country");
Where did you look for facet information? It must be in QueryResponse.getFacetFields (getValues.getCount)

In the solr Response you should use QueryResponse.getFacetFields() to get List of FacetFields among which figure "country". so "country" is idenditfied by QueryResponse.getFacetFields().get(0)
you iterate then over it to get List of Count objects using
QueryResponse.getFacetFields().get(0).getValues().get(i)
and get value name of facet using QueryResponse.getFacetFields().get(0).getValues().get(i).getName()
and the corresponding weight using
QueryResponse.getFacetFields().get(0).getValues().get(i).getCount()

Related

ElasticSearch Java API to query and return single column instead of the whole json document

While searching using java api in elaticsearch, I would like to retrieve only one column.
Currently when I query using the Java API it returns the whole record like this: [{_id=123-456-7890, name=Wonder Woman, gender=FEMALE}, {_id=777-990-7890, name=Cat Woman, gender=FEMALE}]
The record above correctly matches the search condition shown in th . As shown in the code below:
List<Map<String, Object>> result = new ArrayList<Map<String, Object>>();
SearchRequestBuilder srb = client.prepareSearch("heros")
.setSearchType(SearchType.DFS_QUERY_THEN_FETCH);
MatchQueryBuilder mqb;
mqb = QueryBuilders.matchQuery("name", "Woman");
srb.setQuery(mqb);
SearchResponse response = srb.execute().actionGet();
long totalHitCount = response.getHits().getTotalHits();
System.out.println(response.getHits().getTotalHits());
for (SearchHit hit : response.getHits()) {
result.add(hit.getSource());
}
System.out.println(result);
I want only one column to be returned. If I search for name I just want the full names back in a list: "Wonder Woman", "Cat Woman" only not the whole json record for each of them. If you think I need to iterate over the result list of maps in java please propose an example of how to do that in this case.

You can specify the fields to be returned from a search, per documentation. This can be set via SearchRequestBuilder.addFields(String... fields), ie:
SearchRequestBuilder srb = client.prepareSearch("heros")
.setSearchType(SearchType.DFS_QUERY_THEN_FETCH)
.addFields("name");

Better combine both:
use .addFields("name") to tell ES that it needs to return only this
column
use hit.field("name").getValue().toString() to get the result
It is important to use .addFields when you don't need the whole document, but the specific field/s as it will lower the overhead and the network traffic

I figured it out.
List<String> valuesList= new ArrayList<String>();
for (SearchHit hit : response.getHits()) {
result.add(hit.getSource());
valuesList.add(hit.getSource().get("name").toString());
}

The other solutions didn't work for me, hit.getSource() was returning null. Maybe they are deprecated? Not sure. But here was my solution, which FYI can speed things up considerably if you are only getting one field and you are getting lots of results.
Use addFields(Strings) on your SearchRequestBuilder as mentioned, but then when you are getting the values you need to use:
hit.getFields().get( fieldName ).getValue()
or
hit.getFields().get( fieldName ).getValues()
to get a single value or a list of values depending on the field.

How to retrieve the Field that "hit" in Lucene

Maybe I'm really missing something.
I have indexed a bunch of key/value pairs in Lucene (v4.1 if it matters). Say I have
key1=value1 and key2=value2, e.g. as read from a properties file.
They get indexed both as specific fields and into a catchall "ALL" field, e.g.
new Field("key1", "value1", aFieldTypeMimickingKeywords);
new Field("key2", "value2", aFieldTypeMimickingKeywords);
new Field("ALL", "key1=value1", aFieldTypeMimickingKeywords);
new Field("ALL", "key2=value2", aFieldTypeMimickingKeywords);
// then get added to the Document of course...
I can then do a wildcard search, using
new WildcardQuery(new Term("ALL", "*alue1"));
and it will find the hit.
But, it would be nice to get more info, like "what was complete value (e.g. "key1=value1") that goes with that hit?".
The best I can figure out it to get the Document, then get the list of IndexableFields, then loop over all of them and see if the field.stringValue().contains("alue1"). (I can look at the data structures in the debugger and all the info is there)
This seems completely insane cause isn't that what Lucene just did? Shouldn't the Hit information return some of the Fields?
Is Lucene missing what seems like "obvious" functionality? Google and starting at the APIs hasn't revealed anything straightforward, but I feel like I must be searching on the wrong stuff.

You might want to try with IndexSearcher.explain() method. Once you get the ID of the matching document, prepare a query for each field (using the same search keywords) and invoke Explanation.isMatch() for each query: the ones that yield true will give you the matched field. Example:
for (String field: fields){
Query query = new WildcardQuery(new Term(field, "*alue1"));
Explanation ex = searcher.explain(query, docID);
if (ex.isMatch()){
//Your query matched field
}
}

MongoDB can't find by national regex or query

Looks like I miss something important.
I have some records inserted into mongoDB that contains fields with national characters. There are no problem to insert it to DB or find them and all values looks pretty good.
But if I try to find particular one with "find()" or "regex()" they return nothing. For example:
DBObject query = new BasicDBObject();
query.put("position", Pattern.compile(".*forsøg.*"));
--or--
query.put("position","forsøg");
System.out.println(collection.find(query).count()); // prints 0
in log
query={ "position" : { "$regex" : ".*������.*"}}
--or---
query={ "position" : "������"}
Field value for "position" is equal "forsøg" ofc. Pattern.matches(".*forsøg.*", "forsøg") returns true.
If I replace pattern with one containing only ASCII characters (".abc." for example ) all methods work as expected. Collection.findAll() return all saved instances with readable and correct values.
Versions: MongoDB 2.0.6 64bit, mongo-java-driver 2.8.0, Java 7. I tried the same with spring-data-mongodb 1.0.2.RELEASE but removed it.

Looks like I meet a strange bug related with a maven + testng. The same code executed from .war and from testsuit provides totally different result in database.
The difference may be easily found if point your browser to
http://127.0.0.1:28017/baseName/collectionName/
and look at the values after each execution.

hbase: querying for specific value with dynamically created qualifier

Hy,
Hbase allows a column family to have different qualifiers in different rows. In my case a column family has the following specification
abc[cnt] # where cnt is an integer that can be any positive integer
what I want to achieve is to get all the data from a different column family, only if the value of the described qualifier (in a different column family) matches.
for narrowing the Scan down I just add those two families I need for the query. but that is as far as I could get for now.
I already achieved the same behaviour with a SingleColumnValueFilter, but then the qualifier was known in advance. but for this one the qualifier can be abc1, abc2 ... there would be too many options, thus too many SingleColumnValueFilter's.
Then I tried using the ValueFilter, but this filter only returns those columns that match the value, thus the wrong column family.
Can you think of any way to achieve my goal, querying for a value within a dynamically created qualifier in a column family and returning the contents of the column family and another column family (as specified when creating the Scan)? preferably only querying once.
Thanks in advance for any input.
UPDATE: (for clarification as discussed in the comments)
in a more graphical way, a row may have the following:
colfam1:aaa
colfam1:aab
colfam1:aac
colfam2:abc1
colfam2:abc2
whereas I want to get all of the family colfam1 if any value of colfam2 has e.g. the value x, with regard to the fact that colfam2:abc[cnt] is dynamically created with cnt being any positive integer

I see two approaches for this: client-side filtering or server-side filtering.
Client-side filtering is more straightforward. The Scan adds only the two families "colfam1" and "colfam2". Then, for each Result you get from scanner.next(), you must filter according to the qualifiers in "colfam2".
byte[] queryValue = Bytes.toBytes("x");
Scan scan = new Scan();
scan.addFamily(Bytes.toBytes("colfam1");
scan.addFamily(Bytes.toBytes("colfam2");
ResultScanner scanner = myTable.getScanner(scan);
Result res;
while((res = scanner.next()) != null) {
NavigableMap<byte[],byte[]> colfam2 = res.getFamilyMap(Bytes.toBytes("colfam2"));
boolean foundQueryValue = false;
SearchForQueryValue: while(!colfam2.isEmpty()) {
Entry<byte[], byte[]> cell = colfam2.pollFirstEntry();
if( Bytes.equals(cell.getValue(), queryValue) ) {
foundQueryValue = true;
break SearchForQueryValue;
}
}
if(foundQueryValue) {
NavigableMap<byte[],byte[]> colfam1 = res.getFamilyMap(Bytes.toBytes("colfam1"));
LinkedList<KeyValue> listKV = new LinkedList<KeyValue>();
while(!colfam1.isEmpty()) {
Entry<byte[], byte[]> cell = colfam1.pollFirstEntry();
listKV.add(new KeyValue(res.getRow(), Bytes.toBytes("colfam1"), cell.getKey(), cell.getValue());
}
Result filteredResult = new Result(listKV);
}
}
(This code was not tested)
And then finally filteredResult is what you want. This approach is not elegant and might also give you performance issues if you have a lot of data in those families. If "colfam1" has a lot of data, you don't want to transfer it to the client if it will end up not being used if value "x" is not in a qualifier of "colfam2".
Server-side filtering. This requires you to implement your own Filter class. I believe you cannot use the provided filter types to do this. Implementing your own Filter takes some work, you also need to compile it as a .jar and make it available to all RegionServers. But then, it helps you to avoid sending loads of data of "colfam1" in vain.
It is too much work for me to show you how to custom implement a Filter, so I recommend reading a good book (HBase: The Definitive Guide for example). However, the Filter code will look pretty much like the client-side filtering I showed you, so that's half of the work done.

Why wont this sort in Solr work?

I need to sort on a date-field type, which name is "mod_date".
It works like this in the browser adress-bar:
http://localhost:8983/solr/select/?&q=bmw&sort=mod_date+desc
But I am using a phpSolr client which sends an URL to Solr, and the url sent is this:
fq=+category%3A%22Bilar%22+%2B+car_action%3AS%C3%A4ljes&version=1.2&wt=json&json.nl=map&q=%2A%3A%2A&start=0&rows=5&sort=mod_date+desc
// This wont work and is echoed after this in php:
$queryString = http_build_query($params, null, $this->_queryStringDelimiter);
$queryString = preg_replace('/%5B(?:[0-9]|[1-9][0-9]+)%5D=/', '=', $queryString);
This wont work, I dont know why!
Everything else works fine, all right fields are returned. But the sort doesn't work.
Any ideas?
Thanks
BTW: The field "mod_date" contains something like:
2010-03-04T19:37:22.5Z
EDIT:
First I use PHP to send this to a SolrPhpClient which is another php-file called service.php:
require_once('../SolrPhpClient/Apache/Solr/Service.php');
$solr = new Apache_Solr_Service('localhost', 8983, '/solr/');
$results = $solr->search($querystring, $p, $limit, $solr_params);
$solr_params is an array which contains the solr-parameters (q, fq, etc).
Now, in service.php:
$params['version'] = self::SOLR_VERSION;
// common parameters in this interface
$params['wt'] = self::SOLR_WRITER;
$params['json.nl'] = $this->_namedListTreatment;
$params['q'] = $query;
$params['sort'] = 'mod_date desc'; // HERE IS THE SORT I HAVE PROBLEM WITH
$params['start'] = $offset;
$params['rows'] = $limit;
$queryString = http_build_query($params, null, $this->_queryStringDelimiter);
$queryString = preg_replace('/%5B(?:[0-9]|[1-9][0-9]+)%5D=/', '=', $queryString);
if ($method == self::METHOD_GET)
{
return $this->_sendRawGet($this->_searchUrl . $this->_queryDelimiter . $queryString);
}
else if ($method == self::METHOD_POST)
{
return $this->_sendRawPost($this->_searchUrl, $queryString, FALSE, 'application/x-www-form-urlencoded');
}
The $results contain the results from Solr...
So this is the way I need to get to work (via php).
This code below (also on top of this Q) works but thats because I paste it into the adress bar manually, not via the PHPclient. But thats just for debugging, I need to get it to work via the PHPclient:
http://localhost:8983/solr/select/?&q=bmw&sort=mod_date+des // Not via phpclient, but works
UPDATE (2010-03-08):
I have tried Donovans codes (the urls) and they work fine.
Now, I have noticed that it is one of the parameters causing the 'SORT' not to work.
This parameter is the "wt" parameter. If we take the url from top of this Q, (fq=+category%3A%22Bilar%22+%2B+car_action%3AS%C3%A4ljes&version=1.2&wt=json&json.nl=map&q=%2A%3A%2A&start=0&rows=5&sort=mod_date+desc), and just simply remove the "wt" parameter, then the sort works.
BUT the results appear differently, thus making my php code not able to recognize the results I believe. Donovan would know this I think. I am guessing in order for the PHPClient to work, the results must be in a specific structure, which gets messed up as soon as I remove the wt parameter.
Donovan, help me please...
Here is some background what I use your SolrPhpClient for:
I have a classifieds website, which uses MySql. But for the searching I am using Solr to search some indexed fields. Then Solr returns an array of ID:numbers (for all matches of the search criteria). Then I use those ID:numbers to find everything in a MySql db and fetch all other information (example is not searchable information).
So simplified: Search -> Solr returns all matches in an array of ID:nrs -> Id:numbers from Solr are the same as the Id numbers in the MySql db, so I can just make a simple match agains every record with the ID matching the ID from the Solr results array.
I don't use Faceting, no boosting, no relevancy or other fancy stuff. I only sort by the latest classified put, and give the option to users to also sort on the cheapest price. Nothing more.
Then I use the "fq" parameter to do queries on different fields in Solr depending on category chosen by users (example "cars" in this case which in my language is "Bilar").
I am really stuck with this problem here...
Thanks for all help

As pointed out in the stack overflow comments, your browser query is different than your php client based query - to remove that from the equation you should test with this corrected. To get the same results as the browser based query you're php code should have looked something like this:
$solr = new Apache_Solr_Client(...);
$searchOptions = array(
'sort' => 'mod_date desc'
);
$results = $solr->search("bmw", 0, 10, $searchOptions);
Instead, I imagine it looks more like:
$searchOptions = array(
'fq' => 'category:"Bilar" + car_action:Sälje',
'sort' => 'mod_date desc'
)
$solr->search("\*:*", 0, 10, $searchOptions);
What I expect you to see is that php client results will be the same as the browser based results, and I imagine the same would happen if you did it the opposite way - take your current parameters from the php client and applied them correctly to the browser based query.
Now onto your problem, you don't see documents sorted properly.
I would try this query, which is the equivalent of the php client based code:
http://localhost:8983/solr/select/?&q=%2A%3A%2A&fq=+category%3A%22Bilar%22+%2B+car_action%3AS%C3%A4ljes&sort=mod_date+desc
versus this query, which moves the filter query into the main query:
http://localhost:8983/solr/select/?&q=+category%3A%22Bilar%22+%2B+car_action%3AS%C3%A4ljes&sort=mod_date+desc
and see if there is a difference. If there is, then it might be a bug in how results from cached filtered queries are used and sorted by solr - which wouldn't be a problem with the client, but the solr service itself.
Hope this gets you closer to an anser.

Use session's values for save sort parameters.

The quick answer in case someone is attempting to sort via solr-php-client:
$searchOptions = array('sort' => 'field_date desc');
Ditch the + sign that you would usually put on the URL. It took me a while as well to figure it out, I was encoding it and putting it all over the place...

It's possible it's related to the json.nl=map parameter. When the response is set to JSON with wt=json and json.nl=map, facets are not sorted as expected with the facet.sort or f.<field_name>.facet.sort=count|index options.
e.g. with facet.sort=count and wt=json only, I get:
"dc_coverage": [
"United States",
5,
"19th century",
1,
"20th century",
1,
"Detroit (Mich.)",
1,
"Pennsylvania",
1,
"United States--Michigan--Detroit",
1,
"United States--Washington, D.C.",
1
]
But with facet.sort=count, wt=json, and json.nl=map as an option, you can see the sorting is lost:
"dc_coverage": {
"19th century": 1,
"20th century": 1,
"Detroit (Mich.)": 1,
"Pennsylvania": 1,
"United States": 5,
"United States--Michigan--Detroit": 1,
"United States--Washington, D.C.": 1
}
There is more information here about formatting the JSON response when using json.nl=map: https://cwiki.apache.org/confluence/display/solr/Response+Writers#ResponseWriters-JSONResponseWriter

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.