Which method should I go with; Indexing MySQL db with SOLR - java

I have a classifieds website, with approx 30 categories of classifieds.
I am on the stage where I have to build MySQL tables and index them with SOLR.
Each row in a table has around 15 fields...
I am looking for performance!
I wonder which of these two methods works best:
1- Have one MySQL table for each category, meaning 30 tables, and then have multiple indexes in SOLR ( This would mean that if the user only wants to search in one specific category, then that table/index is searched, thus gaining performance (I think). However, if the user searches ALL categories at once, then all tables/indexes would have to be searched. )
2- Have one and only one MySQL table, and only one index in SOLR.
Thanks

Assuming that all of the different types of classifieds have the same structure, I would do the following:
Store the text in a single table, along with another field for category (and other fields for whatever other information is associated with a category).
In Solr, build an index that has a text field, a category field, and a PK field. The text and category fields would be indexed but not stored, and the PK field (storing the primary key corresponding to your MySQL table) would be stored but not indexed.
Allow the user to do two kinds of searches: one with just text, and one with text and category. For the latter, the category should be an exact match. The Solr search will return a list of PKs which will allow you to then retrieve documents from MySQL.
You will not see much of a performance improvement by splitting your index up into 30 indices, because Solr/Lucene is already very efficient at finding data via its inverted indices. Specifying the category name is sufficient.

Related

Hibernate query to fetch records taking much time

I am trying to retrieve a set of records from a table. The query I am using is:
select * from EmployeeUpdates eu where eu.updateid>0 and eu.department = 'EEE'
The table EmployeeUpdates has around 20 million records. 'updateid' is the primary key and there are no records currently in the table with the department 'EEE'. But the query is taking lots of time, due to which the web-service call is getting timed out.
Currently we have index only on the column 'updateid'. 'department' is a new column added for which we are expecting 'EEE' records.
What changes can I make to retrieve the results faster?
First off, your sql isn't valid, looks like you're missing an 'and' between the 2 conditions.
I'm guessing that all the update ID's are positive, and as its the primary key, they're unique, so I suspect eu.updateid>0 matches every row. This means it's not technically a Tablespace scan, but an index based scan, although if that scan then has all 20 million rows after matching the index, you might as well have a table space scan. The only thing you can really do is add an index to the department field. Depending on what this data is, you could have it on a seperate table, with a numeric primary key and then store that as a foreign key on the eu table. This would mean you scanned through all the departments, then got the updated associated to them, rather than searching every single update for a specific department.
I think you should look into using a Table-per-subclass mapping (more here: http://docs.jboss.org/hibernate/orm/3.3/reference/en-US/html/inheritance.html#inheritance-tablepersubclass-discriminator). You can make the department the discriminator and then you'd have a EEEEmployeUpdates and ECEmployeeUpdates classes. Your query could change then to just query the EEEEmployeeUpdates.

Generate xml dynamically in java using data from a table in database

Here is the problem that i am currently stuck with for the past few days. And I am looking for guidance / approaches on how to handle.Hints & suggestions welcome.
so here is the problem.The database has a table "group" which has two columns : group_id on parent_group_id.group_id is the primary key for the table .All entries in this table represent groups/sub-groups.If one adds a sub-group from the front end ,then an entry gets inserted in to the group table with an auto-generated group_id which MySQL generates.the parent_group_id corresponds to the group_id of the group on which a sub-group was added.So in essence it's acting like a foreign key to the group_id column.My task cut out here is to generate an XML in java using the data from the group table. So this is where i am stuck.I know it's gonna be a recursive function which needs to be written but cant figure out a way how to dynamically create the nodes and fill the data from the Db at the same time.The end XML needs to be sent as json data to the front end.
A group can have n-sub groups and the hierarchy can go on.For ex- Say Vehicle is root node with group_id =1.It can have cars & bikes as sub-groups.so the parent_group_id will be 1 for car and bike and group id say will be 2& 3 respectively.
P.S: this is the first time i am posting here having had used this site for the past one year.Please let me know if any more info is needed or whether you are able to comprehend my problem.
If you split the task into two, it will be more manageable.
Here are some useful links on querying hierarchical data in relational databases and specifically in MySQL:
What are the options for storing hierarchical data in a relational database?
http://mikehillyer.com/articles/managing-hierarchical-data-in-mysql/
http://www.slideshare.net/billkarwin/models-for-hierarchical-data
http://en.wikipedia.org/wiki/Common_table_expressions#Common_table_expression
As long as you have the query result properly sorted, you will be able to traverse it recursively, building the XML tree step by step.
Was able to solve it by using recursive functions :). Loaded all the data using the entity class and then iterated over the data using recursive functions to build the tree like structure.I didn't try and take the sql way.

Mapping Lucene search results with a relational database

I have an application which holds a list of documents. These documents are
indexed using Lucene.
I can search on keywords of the documents. I loop the TopDocs and get the
ID field (of each Lucene doc) which is related to the ID column in my
relational database. From all these ID's, I create a list.
After building the list of ID's, I make a database query which is executing
the following SELECT statement (JPA):
SELECT d From Document WHERE id IN (##list of ID's retrieved from Lucene##)
This list of document is sent to the view (GUI).
But, some documents are private and should not be in the list. Therefore,
we have some extra statements in the SELECT query to do some security
checks:
SELECT d From Document WHERE id IN (##list of ID's retrieved from Lucene##)
AND rule1 = foo
AND rule2 = bar
But now I'm wondering: I'm using the speed of Lucene to quickly search
documents, but I still have to do the SELECT query. So I'm loosing
performance on this one :-( ...
Does Lucene have some component which does this mapping for you? Or are
there any best practices on this issue? How do big projects map the Lucene
results to the relation database? Because the view should be rendering the
results?
Many thanks!
Jochen
Some suggestions:
In Lucene, you can use a Filter to narrow down the search result according to your rules.
Store the primary key or a unique key (an ID, a serial number, etc.) in Lucene. Then, your relational database can make unique key lookups and make things very fast.
Lucene can act as storage of your documents too. If applicable in your case, you just retrieve the individual documents' content from Lucene and don't need to go to your relational database.
Why don't you use lucene to index the table in the database? That way you can do everything in 1 lucene query.
if this is a big issue maybe it's worth looking at ManifoldCF that supports document level security that might fit your needs.

How to identify duplicate items gathered from multiple feeds and link to them in a Database

I have a Database storing details of products which are taken from many sites, and gathered through the individual sites API's. When I call the feed, the details are stored in a database table.
The problem I'm having is that because the exact same product is listed on many sites by the seller I end up having duplicate items in my database, and then when I display them on a web page there are many duplicates.
The problem is that the item doesn't have any obvious unique identifier, it has specific details of the item (of which there could be many), and then a description of the item from the seller.
What I would like is for the item to show up once, and then give the user details of where else the item is listed.
How would I identify the duplicates that have come in, without slowing down the entire database? How would I also then pick one advert from all the duplicates, and then store what other sites the advert is displayed on.
Thanks for any help.
The problem is two-fold, and both are on your side. When you figure out how to deal with that, writing the code into a program (Java or SQL will be easy). I'll name them first and then identify the solutions.
For some unknown reason, you have assumed that collecting product descriptions from mulitple sites will not collect the same product.
You are used to the common and nonsensical Id column, which is fine when you are working with spreadsheets prototyping functionality; but it is nowhere near what is required for a database or Development-level functionality. Your users (or boss) have naturally expected database capability from the database, and you did not provide any. (And no, it does not require fuzzy string logic or magic of any kind.)
Solution
This is a condensed version of the IDEF1X Standard for modelling Relational Databases; the portion re Identifiers.
You need to think in database terms, and think about the database tables you need to perform your function, which means you are not allowed to use an auto-increment Id column. That column gives a spreadsheet a RowId, but it does not imply anything about the content of the table, or the columns that identify a product.
And you cannot simply rip data off another website, you need to think about what your website requires for products. What does your company understand a product to be, and how does it identify a product ?
Identify all the columns and datatypes for the columns.
Identify which columns are mandatory and which are optional.
Identify which are strong Identifiers. Eg. Manufacturer and Model; the short Product Name, not the long Description (or may be for your company, the long description is an Identifier). Work with your users, and work that out.
You will find you actually have a small cluster of tables around Product, such as Manufacturer, ProductType, perhaps Vendor, etc.
Organise those tables, and Normalise them, so that you are not duplicating data.
Make sure you treat those Identifiers with a bit of respect. Choose which will be unique. Those are Candidate Keys. You need at least one per table, and there will be more than one in Product. All the Identifiers that will be searched on will need to be indexed (Unique or not). Note that Unique Indices cannot be Nullable, so you cannot choose an optional column.
What makes a single Unique Identifier for Product may not be a single column. That's ok, we can evaluate multiple columns for keys in databases; they are called Compound Keys.
Take the best, most stable (one which will not change) Unique Identifier, one of the Candidate Keys, and make that the Primary Key.
If, and only if, the Unique Identifier, the Primary Key, which may be a Compound Key, is very long, and therefore unsuitable for a Primary Key, which is migrated to the child tables, then add a Surrogate Key. That will be the Id column. Note that that is an additional column and additional Index. It is not a substitute for the Identifiers of Product, the Candidate Keys; they cannot be removed.
So far we have a Product database on your companies side of the web, that is meaningful to it. Now we are in a position to evaluate products from the other side of the web; and when we do, we have a framework on our side that is strong, against which we can measure the rubbish that we get from the other side of the web.
Feeds
You need a WebSite table to manage the feeds.
There will be an Associative table (many-to-many) between Product and WebSite. Let's call it ProductSite. It will contain only our ProductId, and the WebSiteCode. It may containPrice`. The contents are valid for a single feed cycle.
Load each feed into a staging database or schema, an incoming ProductIn table, maybe one per source website. This is just the flat file from the external source. Add a column IsValid and set the Default to true.
Then write some SQL that compares that ProductIn table, with its loose and floppy contents, with our Product table with its strong Identifiers.
The way I would do it is, several waves of separate checks, each marking the rows that fail, with IsValid to false. At the end Insert the IsValid rows into our ProductSite.
You might be lucky, and get away with an optimistic approach. That is, as long as you find a match on a few important columns, the match is valid. (reverse the Default and update of the IsValid boolean).
This is the proc that will require some back-and-forth work, until it settles down. That is why you need to work with your users re the Indentifiers. The goal is to exclude no external products, but your starting point will exclude many. That will include going back to our Product table and improving the content (values in the rows) of the Identifiers, and other relevant columns that you use to identify matching rows.
Repeat for each WebSite.
Now populate our website from our Product table, using information that we are confident about, and show which sites have the product for sale from ProductSite.
I don't think this is a code or database problem (yet). You say:
The problem is that the item doesn't have any obvious unique identifier
You need to work out what that uniqeness is before you can ask a computer to do that for you. It sounds like you need some sort of fuzzy, string similarity algorithm.
Some examples of data that you consider duplicates might help.

persisting dynamic properties and query

I have a requirement to implement contact database. This contact database is special in a way that user should be able to dynamically (on runtime) add properties he/she wants to track about the contact. Some of these properties are of type string, other numbers and dates. Some of the properties have pre-defined values, others are free fields etc.. User wants to be also able to query such structure fast and easily. The database needs to handle easily 500 000 contacts each having around 10 properties.
It leads to dynamic property model having Contact class with dynamic properties.
class Contact{
private Map<DynamicProperty, Collection<DynamicValue> values> propertiesAndValues;
//other userfull methods
}
The question is how can I store such a structure in "some database" - it does not have to be RDBMS so that I can easily express queries such as
Get all contacts whose name starts with Martin, they are from Company of size 5000 or less, order by time when this contact was inserted in a database, only first 100 results (provide pagination), where each of these segments correspond to a dynamic property.
I need:
filtering - equal, partial equal, (bigger, smaller for integers, dates) and maybe aggregation - but it is not necessary at this point
sorting
pagination
I was considering RDBMS, but this leads more less to this structure which is quite hard to query and it tends to be slow for this amount of data
contact(id serial pk,....);
dynamic_property(dp_id serial pk, ...);
--only one of the values is not empty
dynamic_property_value(dpv_id serial pk, dynamic_property_fk int, value_integer int, date_value timestamp, text_value text);
contact_properties(pav_id serial pk, contact_id_fk int, dynamic_propert_fk int);
property_and_its_value(pav_id_fk int, dpv_id int);
I consider following options:
store contacts in RDBMS and use Lucene for querying - is there anything that would help with this?
Store dynamic properties as XML and store it to rdbms and use xpath support - unfortunatelly it seems to be pretty slow for 500000 contacts
use another database - Mango DB or Jackrabbit to store this information
Which way would you go and why?
Wikipedia has a great entry on Entity-Attribute-Value modeling which is a data modeling technique for representing entities with arbitrary properties. It's typically used for clinical data, but might apply to your situation as well.
Have you considered using Lucene for your querying needs? You could probably get away with just using Lucene and store all your data in the index. Although I wouldn't recommend using Lucene as your only persistence store.
Alternatively, you could use Lucene along with a RDBMS and take advantage of something like Compass.
You could try other kind of databases like CouchDB which is a document oriented db and is distributed
If you want a dumb solution, for your contacts table you could add some 50 columns like STRING_COLUMN1, STRING_COLUMN2... upto 10, DATE_COLUMN1..DATE_COLUMN10. You have another DESCRIPTION column. So if a row has a name which is a string then STRING_COLUMN1 stores the value of your name and the DESCRIPTION column value would be "STRING_COLUMN1-NAME". In this case querying can be a bit tricky. I know many purists laugh at this, but I have seen a similar requirement solved this way in one of the apps :)

Categories