Lucene full text joined with database criteria - java

I have an app with case records stored in a Derby database, and am using Lucene to full-text index case notes and descriptions. The full-text is relatively static, but some database fields can change daily on many records, to updating Lucene from the database is not a good option.
What I want to do is allow the user to do a full-text query along with some SQL criteria. For example: all cases that have the words "water" and "melon" (the full-text portion) that were edited in the last 2 days, and their "importance" flag is set to "medium" (the SQL portion). (the full-text query could be much more complex, and similarly for the SQL portion).
This involves a "join" (actually "AND") of the full-text results with the DB results, I can either run a full-text search and check each record for DB criteria, or vice versa, depending on whether the full-text or the SQL criteria yield smaller number of records. This is obviously a slow process.
Are there better/faster solutions?

We have done something like this. We used a postgres database but as you said, returning all the matching ids from the database for checking on the index is too slow.
In our case, we only needed to update flags for a document. Fortunately, we could allow us to have a little bit expensive update operation on the database. So we decided to have a blob entry for each flag containing all matching document ids as serialized ArrayList or something alike. When searching, we just needed to retrieve one entry from the database which was fast enough for a million of ids.

Related

Historically data search in SQL database

I have a use case to search normalized SQL database for given criteria in historical database of more than few million records. Using StoredProcedure to join normalized table is solving the search but performance is very slow.
Is there any alternate where we take the data in to memory and perform search.
Would like to know approach to solve problem.
You could setup Elastic search that will cache frequently executed searchs
Use the APACHE Module named SOLR that has ability to handle big data with faceted search.
https://lucene.apache.org/solr/

Google App-Engine Datastore is extremely slow

I need help in understanding why the below code is taking 3 to 4 seconds.
UPDATE: Use case for my application is to get the activity feed of a person since last login. This feed could contain updates from friends or some new items outside of his network that he may find interesting. The Activity table stores all such activities and when a user logs in, I run a query on the GAE-DataStore to return above activities. My application supports infinite scrolling too, hence I need the cursor feature of GAE. At a given time, I get around 32 items but the activities table could have millions of rows (as it contains data from all the users).
Currently the Activity table is small and contains 25 records only and the below java code reads only 3 records from the same table.
Each record in the Activity table has 4 UUID fields.
I cannot imagine how the query would behave if the table contained millions of rows and result contained 100s of rows.
Is there something wrong with the below code I have below?
(I am using Objectify and app-engine cursors)
Filter filter = new FilterPredicate("creatorID", FilterOperator.EQUAL, userId);
Query<Activity> query = ofy().load().type(Activity.class).filter(filter);
query = query.startAt(Cursor.fromWebSafeString(previousCursorString));
QueryResultIterator<Activity> itr = query.iterator();
while (itr.hasNext())
{
Activity a = itr.next();
System.out.println (a);
}
I have gone through Google App Engine Application Extremely slow and verified that response time improves if I keep on refreshing my page (which calls the above code). However, the improvement is only ~30%
Compare this with any other database and the response time for such tiny data is in milliseconds, not even 100s of milliseconds.
Am I wrong in expecting a regular database kind of performance from the GAE DataStore?
I do not want to turn on memcache just yet as I want to improve this layer without caching first.
Not exactly sure what your query is supposed to do but it doesn't look like it requires a cursor query. In my humble opinion the only valid use case for cursor queries is a paginated query for data with a limited count of result rows. Since your query does not have a limit i don't see why you would want to use a cursor at all.
When you need millions of results you're probably doing ad-hoc analysis of data (as no human could ever interpret millions of raw data rows) you might be better off using BigQuery instead of the appengine datastore. I'm just guessing here, but for normal front end apps you rarely need millions of rows in a result but only a few (maybe hundreds at times) which you filter from the total available rows.
Another thing:
Are you sure that it is the query that takes long? It might as well be the wrapper around the query. Since you are using cursors you would have to recall the query until there are no more results. The handling of this could be costly.
Lastly:
Are you testing on appengine itself or on the local development server? The devserver can obviouily not simulate a cloud and thus could be slower (or faster) than the real thing at times. The devserver does not know about instance warmup times either when your query spawns new instances.
Speaking of cloud: The thing about cloud databases is not that they have the best performance for very little data but that they scale and perform consistently with a couple of hundreds and a couple of billions of rows.
Edit:
After performing a retrieval operation, the application can obtain a
cursor, which is an opaque base64-encoded string marking the index
position of the last result retrieved.
[...]
The cursor's position is defined as the location in the result list
after the last result returned. A cursor is not a relative position in
the list (it's not an offset); it's a marker to which the Datastore
can jump when starting an index scan for results. If the results for a
query change between uses of a cursor, the query notices only changes
that occur in results after the cursor. If a new result appears before
the cursor's position for the query, it will not be returned when the
results after the cursor are fetched.
(Datastore Queries)
These two statements make be believe that the query performance should be consistent with or without cursor queries.
Here are some more things you might want to check:
How do you register your entity classes with objectify?
What does your actual test code look like? I'd like to see how and where you measure.
Can you share a comparison between cursor query and query without cursors?
Improvement with multiple request could be the result of Objectifys integrated caching. You might want to disable caching for datastore performance tests

Database Search with key words using jpa

I'm doing college work where I have to search by keywords. My entity is called Position and I'm using MySQL. The fields that I need to search are:
    - date
    - positionCode
    - title
    - location
    - status
    - company
    - tecnoArea
I need to search the same word in all of these fields. To this end, I used criteria API to create a dynamic query. It is the same word for several fields and it should get the maximum possible results. Do you have any advice about how to optimize the search on the database. Should I do several queries?
EDIT
I will use an OR constraint.
If you will need to find the key word at any position within the data you will need to use LIKE with wildcards, eg. title LIKE '%manager%'. Since date and positionCode (presumably a numeric type) are not likely to contain the key word, to achieve a very small performance gain, I would omit searching these columns for the key word. Your query is going to need to do a serial read, which means that all rows in the table will need to be brought into main memory to evaluate and retrieve the result set of your query. Given a serial read is going to happen anyway, I do not think there is too much you can do to optimize the query when searching multiple columns. I am not familiar with the "criteria api to create dynamic queries", but using dynamic queries in other systems is non-optimal - they must be parsed and evaluated every time the are run and most query optimize-rs cannot make use of the statistics for cost-based optimization to improve performance like they can with explicitly defined SQL.
Not sure what your database is.
If it is Oracle, you can use Oracle text.
The below link might be useful :
http://swiss-army-development.blogspot.com/2012/02/keyword-search-via-oracle-text.html

Mapping Lucene search results with a relational database

I have an application which holds a list of documents. These documents are
indexed using Lucene.
I can search on keywords of the documents. I loop the TopDocs and get the
ID field (of each Lucene doc) which is related to the ID column in my
relational database. From all these ID's, I create a list.
After building the list of ID's, I make a database query which is executing
the following SELECT statement (JPA):
SELECT d From Document WHERE id IN (##list of ID's retrieved from Lucene##)
This list of document is sent to the view (GUI).
But, some documents are private and should not be in the list. Therefore,
we have some extra statements in the SELECT query to do some security
checks:
SELECT d From Document WHERE id IN (##list of ID's retrieved from Lucene##)
AND rule1 = foo
AND rule2 = bar
But now I'm wondering: I'm using the speed of Lucene to quickly search
documents, but I still have to do the SELECT query. So I'm loosing
performance on this one :-( ...
Does Lucene have some component which does this mapping for you? Or are
there any best practices on this issue? How do big projects map the Lucene
results to the relation database? Because the view should be rendering the
results?
Many thanks!
Jochen
Some suggestions:
In Lucene, you can use a Filter to narrow down the search result according to your rules.
Store the primary key or a unique key (an ID, a serial number, etc.) in Lucene. Then, your relational database can make unique key lookups and make things very fast.
Lucene can act as storage of your documents too. If applicable in your case, you just retrieve the individual documents' content from Lucene and don't need to go to your relational database.
Why don't you use lucene to index the table in the database? That way you can do everything in 1 lucene query.
if this is a big issue maybe it's worth looking at ManifoldCF that supports document level security that might fit your needs.

Does using Limit in query using JDBC, have any effect in performance?

If we use the Limit clause in a query which also has ORDER BY clause and execute the query in JDBC, will there be any effect in performance? (using MySQL database)
Example:
SELECT modelName from Cars ORDER BY manuDate DESC Limit 1
I read in one of the threads in this forum that, by default a set size is fetched at a time. How can I find the default fetch size?
I want only one record. Originally, I was using as follows:
SQL Query:
SELECT modelName from Cars ORDER BY manuDate DESC
In the JAVA code, I was extracting as follows:
if(resultSett.next()){
//do something here.
}
Definitely the LIMIT 1 will have a positive effect on the performance. Instead of the entire (well, depends on default fetch size) data set of mathes being returned from the DB server to the Java code, only one row will be returned. This saves a lot of network bandwidth and Java memory usage.
Always delegate as much as possible constraints like LIMIT, ORDER, WHERE, etc to the SQL language instead of doing it in the Java side. The DB will do it much better than your Java code can ever do (if the table is properly indexed, of course). You should try to write the SQL query as much as possibe that it returns exactly the information you need.
Only disadvantage of writing DB-specific SQL queries is that the SQL language is not entirely portable among different DB servers, which would require you to change the SQL queries everytime when you change of DB server. But it's in real world very rare anyway to switch to a completely different DB make. Externalizing SQL strings to XML or properties files should help a lot anyway.
There are two ways the LIMIT could speed things up:
by producing less data, which means less data gets sent over the wire and processed by the JDBC client
by potentially having MySQL itself look at fewer rows
The second one of those depends on how MySQL can produce the ordering. If you don't have an index on manuDate, MySQL will have to fetch all the rows from Cars, then order them, then give you the first one. But if there's an index on manuDate, MySQL can just look at the first entry in that index, fetch the appropriate row, and that's it. (If the index also contains modelName, MySQL doesn't even need to fetch the row after it looks at the index -- it's a covering index.)
With all that said, watch out! If manuDate isn't unique, the ordering is only partially deterministic (the order for all rows with the same manuDate is undefined), and your LIMIT 1 therefore doesn't have a single correct answer. For instance, if you switch storage engines, you might start getting different results.

Categories