I have an application which holds a list of documents. These documents are
indexed using Lucene.
I can search on keywords of the documents. I loop the TopDocs and get the
ID field (of each Lucene doc) which is related to the ID column in my
relational database. From all these ID's, I create a list.
After building the list of ID's, I make a database query which is executing
the following SELECT statement (JPA):
SELECT d From Document WHERE id IN (##list of ID's retrieved from Lucene##)
This list of document is sent to the view (GUI).
But, some documents are private and should not be in the list. Therefore,
we have some extra statements in the SELECT query to do some security
checks:
SELECT d From Document WHERE id IN (##list of ID's retrieved from Lucene##)
AND rule1 = foo
AND rule2 = bar
But now I'm wondering: I'm using the speed of Lucene to quickly search
documents, but I still have to do the SELECT query. So I'm loosing
performance on this one :-( ...
Does Lucene have some component which does this mapping for you? Or are
there any best practices on this issue? How do big projects map the Lucene
results to the relation database? Because the view should be rendering the
results?
Many thanks!
Jochen
Some suggestions:
In Lucene, you can use a Filter to narrow down the search result according to your rules.
Store the primary key or a unique key (an ID, a serial number, etc.) in Lucene. Then, your relational database can make unique key lookups and make things very fast.
Lucene can act as storage of your documents too. If applicable in your case, you just retrieve the individual documents' content from Lucene and don't need to go to your relational database.
Why don't you use lucene to index the table in the database? That way you can do everything in 1 lucene query.
if this is a big issue maybe it's worth looking at ManifoldCF that supports document level security that might fit your needs.
Related
I'm doing college work where I have to search by keywords. My entity is called Position and I'm using MySQL. The fields that I need to search are:
- date
- positionCode
- title
- location
- status
- company
- tecnoArea
I need to search the same word in all of these fields. To this end, I used criteria API to create a dynamic query. It is the same word for several fields and it should get the maximum possible results. Do you have any advice about how to optimize the search on the database. Should I do several queries?
EDIT
I will use an OR constraint.
If you will need to find the key word at any position within the data you will need to use LIKE with wildcards, eg. title LIKE '%manager%'. Since date and positionCode (presumably a numeric type) are not likely to contain the key word, to achieve a very small performance gain, I would omit searching these columns for the key word. Your query is going to need to do a serial read, which means that all rows in the table will need to be brought into main memory to evaluate and retrieve the result set of your query. Given a serial read is going to happen anyway, I do not think there is too much you can do to optimize the query when searching multiple columns. I am not familiar with the "criteria api to create dynamic queries", but using dynamic queries in other systems is non-optimal - they must be parsed and evaluated every time the are run and most query optimize-rs cannot make use of the statistics for cost-based optimization to improve performance like they can with explicitly defined SQL.
Not sure what your database is.
If it is Oracle, you can use Oracle text.
The below link might be useful :
http://swiss-army-development.blogspot.com/2012/02/keyword-search-via-oracle-text.html
I have an app with case records stored in a Derby database, and am using Lucene to full-text index case notes and descriptions. The full-text is relatively static, but some database fields can change daily on many records, to updating Lucene from the database is not a good option.
What I want to do is allow the user to do a full-text query along with some SQL criteria. For example: all cases that have the words "water" and "melon" (the full-text portion) that were edited in the last 2 days, and their "importance" flag is set to "medium" (the SQL portion). (the full-text query could be much more complex, and similarly for the SQL portion).
This involves a "join" (actually "AND") of the full-text results with the DB results, I can either run a full-text search and check each record for DB criteria, or vice versa, depending on whether the full-text or the SQL criteria yield smaller number of records. This is obviously a slow process.
Are there better/faster solutions?
We have done something like this. We used a postgres database but as you said, returning all the matching ids from the database for checking on the index is too slow.
In our case, we only needed to update flags for a document. Fortunately, we could allow us to have a little bit expensive update operation on the database. So we decided to have a blob entry for each flag containing all matching document ids as serialized ArrayList or something alike. When searching, we just needed to retrieve one entry from the database which was fast enough for a million of ids.
I wonder what the best way is to connect the Datastore and the Search API.
What I'm looking for is whenever I create some entity (e.g. a product) that this product will be added to a search index. On update the index should be updated as well, and when deleting the product - you guess right - the product should be removed from search index.
When searching for a product I want to do a full-text search on the product index, but instead of the documents I need the real entities. Probably I will need to first search using the index, and then do a second call to the datastore?
What worries me most is keeping the datastore and search index in synch.
And of course also going through the search index and the datastore will not only be cumbersome but I feel it might also give pains in terms of pagination.
I wonder if some people already have "connected" the datastore and search api this way and what the results have been, and maybe some best practices available. The appengine docs are not telling much is this area.
In order to user the Search API, you need to define your searchable data into documents, and then structure them into an index by using the Index class. Thus, for the time being you need to do exactly what you describe, keep in sync the searchable documents with your datastore entities.
I have a requirement to implement contact database. This contact database is special in a way that user should be able to dynamically (on runtime) add properties he/she wants to track about the contact. Some of these properties are of type string, other numbers and dates. Some of the properties have pre-defined values, others are free fields etc.. User wants to be also able to query such structure fast and easily. The database needs to handle easily 500 000 contacts each having around 10 properties.
It leads to dynamic property model having Contact class with dynamic properties.
class Contact{
private Map<DynamicProperty, Collection<DynamicValue> values> propertiesAndValues;
//other userfull methods
}
The question is how can I store such a structure in "some database" - it does not have to be RDBMS so that I can easily express queries such as
Get all contacts whose name starts with Martin, they are from Company of size 5000 or less, order by time when this contact was inserted in a database, only first 100 results (provide pagination), where each of these segments correspond to a dynamic property.
I need:
filtering - equal, partial equal, (bigger, smaller for integers, dates) and maybe aggregation - but it is not necessary at this point
sorting
pagination
I was considering RDBMS, but this leads more less to this structure which is quite hard to query and it tends to be slow for this amount of data
contact(id serial pk,....);
dynamic_property(dp_id serial pk, ...);
--only one of the values is not empty
dynamic_property_value(dpv_id serial pk, dynamic_property_fk int, value_integer int, date_value timestamp, text_value text);
contact_properties(pav_id serial pk, contact_id_fk int, dynamic_propert_fk int);
property_and_its_value(pav_id_fk int, dpv_id int);
I consider following options:
store contacts in RDBMS and use Lucene for querying - is there anything that would help with this?
Store dynamic properties as XML and store it to rdbms and use xpath support - unfortunatelly it seems to be pretty slow for 500000 contacts
use another database - Mango DB or Jackrabbit to store this information
Which way would you go and why?
Wikipedia has a great entry on Entity-Attribute-Value modeling which is a data modeling technique for representing entities with arbitrary properties. It's typically used for clinical data, but might apply to your situation as well.
Have you considered using Lucene for your querying needs? You could probably get away with just using Lucene and store all your data in the index. Although I wouldn't recommend using Lucene as your only persistence store.
Alternatively, you could use Lucene along with a RDBMS and take advantage of something like Compass.
You could try other kind of databases like CouchDB which is a document oriented db and is distributed
If you want a dumb solution, for your contacts table you could add some 50 columns like STRING_COLUMN1, STRING_COLUMN2... upto 10, DATE_COLUMN1..DATE_COLUMN10. You have another DESCRIPTION column. So if a row has a name which is a string then STRING_COLUMN1 stores the value of your name and the DESCRIPTION column value would be "STRING_COLUMN1-NAME". In this case querying can be a bit tricky. I know many purists laugh at this, but I have seen a similar requirement solved this way in one of the apps :)
I have a classifieds website, with approx 30 categories of classifieds.
I am on the stage where I have to build MySQL tables and index them with SOLR.
Each row in a table has around 15 fields...
I am looking for performance!
I wonder which of these two methods works best:
1- Have one MySQL table for each category, meaning 30 tables, and then have multiple indexes in SOLR ( This would mean that if the user only wants to search in one specific category, then that table/index is searched, thus gaining performance (I think). However, if the user searches ALL categories at once, then all tables/indexes would have to be searched. )
2- Have one and only one MySQL table, and only one index in SOLR.
Thanks
Assuming that all of the different types of classifieds have the same structure, I would do the following:
Store the text in a single table, along with another field for category (and other fields for whatever other information is associated with a category).
In Solr, build an index that has a text field, a category field, and a PK field. The text and category fields would be indexed but not stored, and the PK field (storing the primary key corresponding to your MySQL table) would be stored but not indexed.
Allow the user to do two kinds of searches: one with just text, and one with text and category. For the latter, the category should be an exact match. The Solr search will return a list of PKs which will allow you to then retrieve documents from MySQL.
You will not see much of a performance improvement by splitting your index up into 30 indices, because Solr/Lucene is already very efficient at finding data via its inverted indices. Specifying the category name is sufficient.