Reverse search in Hibernate Search

Reverse search in Hibernate Search - java

I'm using Hibernate Search (which uses Lucene) for searching some Data I have indexed in a directory. It works fine but I need to do a reverse search. By reverse search I mean that I have a list of queries stored in my database I need to check which one of these queries match with a Data object each time Data Object is created. I need it to alert the user when a Data Object matches with a Query he has created. So I need to index this single Data Object which has just been created and see which queries of my list has this object as a result.
I've seen Lucene MemoryIndex Class to create an index in memory so I can do something like this example for every query in a list (though iterating in a Java list of queries would not be very efficient):
//Iterating over my list<Query>
MemoryIndex index = new MemoryIndex();
//Add all fields
index.addField("myField", "myFieldData", analyzer);
...
QueryParser parser = new QueryParser("myField", analyzer);
float score = index.search(query);
if (score > 0.0f) {
System.out.println("it's a match");
} else {
System.out.println("no match found");
}
The problem here is that this Data Class has several Hibernate Search Annotations #Field,#IndexedEmbedded,... which indicated how fields should be indexed, so when I invoke index() method on the FullTextEntityManager instance it uses this information to index the object in the directory. Is there a similar way to index it in memory using this information?
Is there a more efficient way of doing this reverse search?

Just index the new object (if you use automatic indexing you don't have to do anything besides committing the current transaction), then retrieve the queries you want to run and run all of them in a boolean query, combining the stored query with the id of the new object. Something like this:
...
BooleanQuery query = new BooleanQuery();
query.add(storedQuery, BooleanClause.Occur.MUST);
query.add(new TermQuery(ProjectionConstants.ID, id), BooleanClause.Occur.MUST);
...
If you get a result you know the query matched.

Since MemoryIndex is a completely separate component that doesn't extend or implement Lucene's Directory or IndexReader, I don't think there's a way you can plug this into Hibernate Search Annotations. I'm guessing that if you choose to use MemoryIndex, you'll need to write your addField() calls which basically mirrors what you're doing in the annotations.
How many queries are we talking about here? Depending on how many there are you might be able to get away with just running the queries on the main index that Hibernate maintains, ensuring to constrain the search to the document ID you just added. Or for every document that's added, create a one-document in-memory index using RAMDirectory and run the queries through that.

Related

How to query DynamoDB and return the first matching result instead of querying the whole collection?

We have some old code that is doing a query of the DynamoDB to find list of matching records.
Sample code below:
final DynamoDBQueryExpression<MyObject> queryExp = new DynamoDBQueryExpression<MyObject>()
.withHashKeyValues(myObject)
.withIndexName(indexName)
.withScanIndexForward(false)
.withConsistentRead(true)
.withLimit(rowsPerPage);
final PaginatedQueryList<MyObject> ruleInstanceList = dynamoDBMapper.query(MyObject.class, queryExp);
This is a slow operation since this query will return a list of matching MyObject, and I noticed all we used it for is to check if this list is empty or not.
So what I want to do is simply doing the query to find the first element or even a different type of query to simply make sure the count is greater than 0, all I need to verify is that the record exist so that I can reduce the latency.
My question is, how do I do it in order to achieve this?

The documentation for getLimit() indicates:
Note that when calling DynamoDBMapper.query, multiple requests are made to DynamoDB if needed to retrieve the entire result set. Setting this will limit the number of items retrieved by each request, NOT the total number of results that will be retrieved. Use DynamoDBMapper.queryPage to retrieve a single page of items from DynamoDB.
To limit the number of results, you can use queryPage() instead of query(). And apply withLimit(1) to your query expression.

Bulk read Couchbase documents

I want to asynchronously read a number of documents from a Couchbase bucket. This is my code:
JsonDocument student = bucketStudent.get(studentID);
The problem is for a large data file with a lot of studentIDs, it would take a long time to get all documents for these studentIDs because the get() method is called for each studentID. Is it possible to have a list of studentIDs as input and return an output of a list of students instead of getting a single document for each studentID?

If you are running a query node, you can use N1QL for this. Your query would look like this:
SELECT * FROM myBucket USE KEYS ["key1", "key2", "key3"]
In practice you would probably pass in the array of strings as a parameter, like this:
SELECT * FROM myBucket USE KEYS ?
You will need a primary index for your bucket, or queries like this won't work.

AFAIK couchbase SDK does not have a native function for a bulk get operation.
The node.js SDK has a getMulti method, but it's basically an iteration over an array and then get() is fired for each element.
I've found in my applications that the key-value approach is still faster than the SELECT * on a primary index but the N1QL query is remarkably close (on couchbase 5.x).
Just a quick tip: if you have A LOT of ids to fetch and you decide to go with the N1QL queries, try to split that list in smaller chunks. It's speeds up the queries and you can actually manage your errors better and avoid getting some nasty timeouts.

Retrieving multiple documents using the document IDs is not supported by default in the Couchbase Java SDK. To achieve that you'll need to use a N1QL query as below
SELECT S.* FROM Student S USE KEYS ["StudentID1", "StudentID2", "StudentID3"]
which would return an array of Documents with the given IDs. Construct the query with com.couchbase.client.java.query.N1qlQuery, and use either of below to execute
If you're using Spring's CouchbaseTemplate, you can use the below
List<T> findByN1QL(com.couchbase.client.java.query.N1qlQuery n1ql,
Class<T> entityClass)
If you're using Couchbase's Java SDK, you can use the below
N1qlQueryResult query(N1qlQuery query)
Note that you'll need an index on your bucket to run N1QL queries.

It is possible now. The java sdk gives capability to do multi get. It is present in 2 flavours
Async bulk get
N1Q1 query. Does not work with binary documents
The Couchbase document suggests to use Async bulk get, But that has additional dependency on reactive java client. you can see official documentation here.
There are several tutorial explaining the usage. link .
Here is how a sample get would look like, with java sdk 3.2
List<String> docsToFetch = Arrays.asList("airline_112", "airline_1191", "airline_1203");
Map<String, GetResult> successfulResults = new ConcurrentHashMap<>();
Map<String, Throwable> erroredResults = new ConcurrentHashMap<>();
Flux.fromIterable(docsToFetch).flatMap(key -> reactiveCollection.get(key).onErrorResume(e -> {
erroredResults.put(key, e);
return Mono.empty();
}).doOnNext(getResult -> successfulResults.put(key, getResult))).last().block();
source .

Sort the values by Date in mongodb

I am new to mongodb and I am trying to sort all my rows by date. I have records from mixed sources and I trying to sort it separately. I didn't update the dateCreated while writing into db for some records. Later I found and I added dateCreated to all my records in the db. Say I have total of 4000 records, first 1000 I don't have dateCreated. Latest 3000 has that column. Here I am trying to get the last Updated record using dateCreated column. Here is my code.
db.person.find({"source":"Naukri"}&{dateCreated:{$exists:true}}).sort({dateCreated: 1}).limit(10)
This code retruns me some results (from that 1000 records) where I can't see that dateCreated column at all. Moreover if I change (-1) here {dateCreated: -1} I am getting results from some other source, but not Naukri.
So I need help this cases,
How do I sort by dateCreated to get the latest updated record and by sources also.
I am using Java API to get the records from Mongo. I'd be grateful if someone helps me to find how I will use the same query with java also.
Hope my question is clear. Thanks in advance.

From the documentation you will (and you will, won't you - nod yes) read, you will find that the first argument to the find command you are using is what is called a query document. In this document you are specifying a list of fields and conditions, "comma" separated, which is the equivalent of an and condition in declarative syntax such as SQL.
The problem with your query is it was not valid, and did not match anything. The correct syntax would be as follows:
db.person.find({"source":"Naukri", dateCreated:{$exists:true}})
.sort({dateCreated: -1})
.limit(10)
So now this will filter by the value provided for "source" and where the "dateCreated" field exists, meaning it is there and it contains something.
I recommend looking at the links below, the first of the two concerned with structuring mongoDB queries and the find method and it's arguments. All of the functionality translates to every language implementation.
As for the Java API and how to use, there are different methods depending on which you are comfortable with. The API provides a BasicDBObject class which is more or less equivalent to the JSON document notation, and is sort of a hashmap concept. For something a bit more along the lines of the shell methods and a helper to be a little more like some of the dynamic languages approach, there is the QueryBuilder class which the last two links give example and information on. These allow chaining to make your query more readable.
There are many examples on Stack Overflow alone. I suggest you take a look.
http://docs.mongodb.org/manual/tutorial/query-documents/
http://docs.mongodb.org/manual/reference/method/db.collection.find/
How to do this MongoDB query using java?
http://api.mongodb.org/java/2.2/com/mongodb/QueryBuilder.html

Your query is not correct.Update it as follows :
db.person.find({"source":"Naukri", dateCreated:{$exists:true}}).sort({dateCreated: 1}).limit(10)
In Java, you can do it as follows :
Mongo mongo = ...
DB db = mongo.getDB("yourDbName");
DBCollection coll = db.getCollection("person");
DBObject query = new BasicDBObject();
query.put("source", "Naukri");
query.put("dateCreated", new BasicDBObject($exists : true));
DBCursor cur = coll.find(query).sort(new BasicDBObject("dateCreated", 1)).limit(10);
while(cur.hasNext()) {
DBObject obj = cur.next();
// Get data from the result object here
}

Can lucene only sort and search for nothing?

I want to list the lastest 10 rows order by id DESC
Sort sort = new Sort(new SortField[]{new SortField("id",SortField.INT,true)});
TopDocs topDocs=indexSearch.search(null,null,10,sort);//no need Query,only sort
...
I got a 500 exception because the Query parameter is null
How can I implement it in a best way?
btw:id field is a NumericField,write using:
new NumericField("id",Integer.MAX_VALUE,Field.Store.YES,true)

You should use the MatchAllDocsQuery for that.
Lucene Query is a peculiar object that isn't only the specification of the query semantics, but also the implementation of the most efficient execution strategy for each particular query type. That's why there must be a special Query even for this "no-op"

BTW: if you want to search the latest X rows it's better you add a new date field with the time this doc was added to repository and not to rely on the counter (id on your case).
try to think what happen if you update an existed doc or you reach Integer.MAX_VALUE

Identify existence of keywords in document from list

I want to create a tag list for a Lucene document based on a pre-determined list.
So, if we have a document with the text
Looking for a Java programmer with experience in Lucene
and we have the keyword list (about 1000 items)
java, php, lucene, c# [...]
I want to identify that the keywords Java and Lucene exist in the document.
Just doing a java OR php OR lucene will not work because then I will not know which keyword generated the hit.
Any suggestions on how to implement this in Lucene?

I assume that you have one or more indexed fields, and you want to build your tag cloud based on the intersection of your keywords and the indexed terms for a document.
Your problem is very similar to highlighting, so the same ideas apply, you can either:
re-analyze the stored fields of your Lucene document,
use term vectors for fast access to your documents' stored fields.
Note that if you want to use term vectors, you need to enable them at compile time (see Field.TermVector.YES documentation and Field constructor).

Yes, this works
FullTextSession fts = Search.getFullTextSession(getSessionFactory().getCurrentSession());
Query q = fts.getSearchFactory().buildQueryBuilder()
.forEntity(Offer.class).get()
.keyword()
.onField("id")
.matching(myId)
.createQuery();
Object[] dId = (Object[]) fts.createFullTextQuery(q, Offer.class)
.setProjection(ProjectionConstants.DOCUMENT_ID)
.uniqueResult();
if(dId != null){
IndexReader indexReader = fts.getSearchFactory().getIndexReaderAccessor().open(Offer.class);
TermFreqVector freq = indexReader.getTermFreqVector((Integer) dId[0], "description");
}
You have to remember to index the field with TermVector.YES in your hibernate search annotation for the field.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Reverse search in Hibernate Search - java

Related

How to query DynamoDB and return the first matching result instead of querying the whole collection?

Bulk read Couchbase documents

Sort the values by Date in mongodb

Can lucene only sort and search for nothing?

Identify existence of keywords in document from list

Categories

Resources