Dynamically create search criteria and results using Java and sql database - java

I am currently working with a Java based web application (JSF) backed by Hibernate that has a variety of different search pages for different areas.
A search page contains a search fields section, which a user can customize the search fields that they are interested in. There are a range of different search field types that can be added (exact text, starts with, contains, multi-select list boxes, comma separated values, and many more). Search fields are not required to be filled in and are ignored, where as some other search fields require a different search field to have a value for this search field to work.
We currently use a custom search object per area that is specific to that area and has hard coded getter and setter search fields.
public interface Search {
SearchFieldType getSearchPropertyOne();
void setSearchPropertyOne(SearchFieldType searchPropertyOne);
AnotherSearchFieldType getSearchPropertyTwo();
void setSearchPropertyTwo(AnotherSearchFieldType searchPropertyTwo);
...
}
In this example, SearchFieldType and AnotherSearchFieldType represent different search types like a TextSearchField or a NumericSearchField which has a search type (Starts with, Contains, etc.) or (Greater Than, Equals, Less Than, etc.) respectively and a search value that they can enter or leave empty (to ignore the search field).
We use this search object to prepare a Criteria object
The search results section is a table that can also be customized by the user to contain only columns of the result object that they are interested in. Most columns can be ordered ascending or descending.
We back our results in a Result object per result which also hard codes the columns that can be displayed. This table is backed by hibernate annotations, but we are trying to use flat data instead of allowing other hibernate backed objects to minimize lazy joining data.
#Entity(table = "result_view")
public interface Result {
#Column(name = "result_field_one")
Long getResultFieldOne();
void setResultFieldOne(Long resultFieldOne);
#Column(name = "result_field_two")
String getResultFieldTwo();
void setResultFieldTwo(String resultFieldTwo);
...
}
The search page is backed by a view in our database which handles the joining to all the tables needed for every possible outcome. This view has gotten pretty massive and we take a huge performance hit for every search, even when a user only really wants to search on one field and display a few columns because we have upwards of thirty search field options and thirty different columns they can display and this is all backed by the one view.
On top of this, users request new search fields and columns all the time that they would like added to the page. We end up having to alter the search and result objects as well as the backing view to make these changes.
We are trying to look into this matter and find alternatives to this. One approach mentioned was to create different views that we dynamically choose based on the fields searched on or displayed in the results table. The different views might join different columns and we pick and choose which view we need for any given search.
I'm trying to think about the problem a different way. I think it might be better to not use a view and instead dynamically join tables we need based on what search fields and result columns are requested. I also feel that the search and result objects should not contain hard coded getters/setters and should instead be a collection of search fields and a collection (or map) of result columns. I have yet to completely flesh out my idea.
Would hibernate still be a valid solution to this issue? I wouldn't want to have to create a Result object used in a hibernate criteria since they result columns can be different. Both search fields and/or result columns might require joining tables.
Is there a framework I could use that might help solve the problem? I've been trying to look for something, and the closest thing I have found is SqlBuilder.
Has anyone else solved a similar problem dynamically?
I would prefer not to reinvent the wheel if a solution already exists.
I apologize that this ended up as a wall of text. This is my first stackoverflow post, and I wanted to make sure I thoroughly defined my problem.
Thanks in advance for your answers!

I don't fully understand the problem. But JPA Criteria API seems very flexible, which can be used to build query based on user-submitted filtering conditions.

Related

Hibernate pagination with ____ToMany mapping

I'm writing this on the fly on my phone, so forgive the crappy code samples.
I have entities with a manytomany relationship:
#JoinTable(name="foo", #JoinColum="...", #InverseJoinColumn="...")
#ManyToMany
List list = new ArrayList();
I want their data to be retrieved in a paginated way.
I know about setFirstResult and setMaxResults. Is there a way to use this with the mapping? As in, I retrieve the object and get the list filled with contents equal to the amount of records for a single page, with the appropriate offset.
I guess I'm just unclear of the best way to do this. I could just manually use hibernate criteria to have the effect, but I feel thats missing the API. I have this mapping, I want to see if there's a way to use it in a paginated way.
PS. If this is impractical, just say. Also, if it is, can I still use the mapping to add new entries to the join table. As in, if the entity is a persisted entity in the DB, but I haven't fetched the manytomany list, can I add something new to it and when its persisted with cascade all it'll be added to the join table without clearing the other entries?
The type of the relationship between entities that are part of your query isn't that important. There are a couple of ways to tackle this.
If your database supports the LIMIT keyword in it's queries, you would be able to use it to get data sets, assuming you sort your data. Note that if your data changes while your user is navigating between pages, you might see some duplication or miss some records. You'll be stuck having to rewrite if your database changes to one that doesn't have the LIMIT keyword.
If you need to freeze the data at the point of the original query you need to use a 3rd party framework or write your own to fetch a list of Ids for your query then split up that list and fetch by id in a subset for pagination. This is more reliable can be made to work for any database.
Displaytag is a data paging framework I've used and that I therefore can tell you works well for large datasets. It's also one of the older solutions for this problem and is not part of an extended framework.
http://displaytag.sourceforge.net/11/tut_externalSortAndPage.html
Table sorter is another one I came across. This one uses JQuery and fetches the entire data set in one query, so strictly speaking it doesn't meet your "fetches the data in a paginated way" criteria. (This might not be appropriate for large sets).
http://tablesorter.com/docs/
This tutorial might be helpful:
http://theopentutorials.com/examples/java-ee/jsp/pagination-in-servlet-and-jsp/
If you're already using a framework take a look at whether that framework has tackled pagination:
Spring MVC provides a data pager
http://blog.fawnanddoug.com/2012/05/pagination-with-spring-mvc-spring-data.html
GWT provides a data pager:
http://www.gwtproject.org/javadoc/latest/com/google/gwt/user/cellview/client/SimplePager.html
The following refrences might be helpful too:
JDBC Pagination
which also points to:
http://java.avdiel.com/Tutorials/JDBCPaging.html

Database Search with key words using jpa

I'm doing college work where I have to search by keywords. My entity is called Position and I'm using MySQL. The fields that I need to search are:
    - date
    - positionCode
    - title
    - location
    - status
    - company
    - tecnoArea
I need to search the same word in all of these fields. To this end, I used criteria API to create a dynamic query. It is the same word for several fields and it should get the maximum possible results. Do you have any advice about how to optimize the search on the database. Should I do several queries?
EDIT
I will use an OR constraint.
If you will need to find the key word at any position within the data you will need to use LIKE with wildcards, eg. title LIKE '%manager%'. Since date and positionCode (presumably a numeric type) are not likely to contain the key word, to achieve a very small performance gain, I would omit searching these columns for the key word. Your query is going to need to do a serial read, which means that all rows in the table will need to be brought into main memory to evaluate and retrieve the result set of your query. Given a serial read is going to happen anyway, I do not think there is too much you can do to optimize the query when searching multiple columns. I am not familiar with the "criteria api to create dynamic queries", but using dynamic queries in other systems is non-optimal - they must be parsed and evaluated every time the are run and most query optimize-rs cannot make use of the statistics for cost-based optimization to improve performance like they can with explicitly defined SQL.
Not sure what your database is.
If it is Oracle, you can use Oracle text.
The below link might be useful :
http://swiss-army-development.blogspot.com/2012/02/keyword-search-via-oracle-text.html

Google Search API: Loading all datastore data into document builder for full text search on all records

I have been following the tutorial regarding the Google Search API at https://developers.google.com/appengine/docs/java/search/overview. The information I have found is very clear on how to build the document and load it into an index. What I am not sure of is how to load the datastore data into the document.
What am trying to achieve is a simple %LIKE% query on a few fields. For example, I am working on a music library. If the user types in "glory", then I would like to use the Search API to return all entities with "glory" somewhere in the title. I have implemented the "starts with" work around by adding the search text to "\uFFFD", however, I find this insufficient. My users will be very novice, and it would also be helpful if they didn't have to pick a field as in a traditional search. So full text search seems the solution.
Here are my questions:
Should each record in my datastore be a document? Or all the records into one document? I have a pretty well fixed datastore size of only 1000 records. Could anyone provide an example of the correct method?
I would like to return the entire datastore entity (it's only 8 fields) as an Iterable of the type of my entity. Do we specify each field we need to return? The example just says:
for (ScoredDocument scoredDocument : results) {
// process scoredDocument
}
Does anybody have an example of what comes out of the stored document? Is it exactly what we put in or must you identify each field again? Or an example of processes a ScoredDocument returning a datastore entities?
If anybody could help fill in these blanks for me, I would appreciate it.
Thank you for looking at this with me.
What am trying to achieve is a simple %LIKE% query on a few fields
In order to achieve this you need to "tokenize" your records name, GAE provides FULL TEXT SEARCH so in order for you to get partial matches you need to add partial matches for every record so:
If your record's name is "Glory" you should add the tokens for "G","Gl","Glo","Glor","y","ry","ory","lory"
Here's a very basic implementation I use to provide partial search results (only for "starts with" not implementing "end with")
public void addField(String name, String value, boolean tokenize) {
addField(Field.newBuilder().setName(name).setText(value));
if (tokenize) {
for (int i = startTokenIndex ; i < value.length() ;i++) {
addField(Field.newBuilder().setName("token" + (lastTokenIndex++))
.setText(value.substring(0, i)));
}
}
}
Should each record in my datastore be a document?
Yes. you could even match the document ID with the entity's datastore ID for quick matching. (or you can just add it as a separate field)
I would like to return the entire datastore entity (it's only 8
fields) as an Iterable of the type of my entity. Do we specify each
field we need to return?
You need to store your entity's ID in your document, that way when your search returns a set of documents you just retrieve all entities with those IDs.
Does anybody have an example of what comes out of the stored document?
Is it exactly what we put in or must you identify each field again? Or
an example of processes a ScoredDocument returning a datastore
entities?
Documents return all fields you stored in them, plus a lot of data like scoring, id, etc. The "processing" in your case would consist of getting the entity id form the Document.
If you are certain your records wont grow above 1000 you could virtually store everything within your search index. Just bear in mind the index is not designed for that and will face some serious limitations when scaling, which the datastore obviously doesn't.

How to identify duplicate items gathered from multiple feeds and link to them in a Database

I have a Database storing details of products which are taken from many sites, and gathered through the individual sites API's. When I call the feed, the details are stored in a database table.
The problem I'm having is that because the exact same product is listed on many sites by the seller I end up having duplicate items in my database, and then when I display them on a web page there are many duplicates.
The problem is that the item doesn't have any obvious unique identifier, it has specific details of the item (of which there could be many), and then a description of the item from the seller.
What I would like is for the item to show up once, and then give the user details of where else the item is listed.
How would I identify the duplicates that have come in, without slowing down the entire database? How would I also then pick one advert from all the duplicates, and then store what other sites the advert is displayed on.
Thanks for any help.
The problem is two-fold, and both are on your side. When you figure out how to deal with that, writing the code into a program (Java or SQL will be easy). I'll name them first and then identify the solutions.
For some unknown reason, you have assumed that collecting product descriptions from mulitple sites will not collect the same product.
You are used to the common and nonsensical Id column, which is fine when you are working with spreadsheets prototyping functionality; but it is nowhere near what is required for a database or Development-level functionality. Your users (or boss) have naturally expected database capability from the database, and you did not provide any. (And no, it does not require fuzzy string logic or magic of any kind.)
Solution
This is a condensed version of the IDEF1X Standard for modelling Relational Databases; the portion re Identifiers.
You need to think in database terms, and think about the database tables you need to perform your function, which means you are not allowed to use an auto-increment Id column. That column gives a spreadsheet a RowId, but it does not imply anything about the content of the table, or the columns that identify a product.
And you cannot simply rip data off another website, you need to think about what your website requires for products. What does your company understand a product to be, and how does it identify a product ?
Identify all the columns and datatypes for the columns.
Identify which columns are mandatory and which are optional.
Identify which are strong Identifiers. Eg. Manufacturer and Model; the short Product Name, not the long Description (or may be for your company, the long description is an Identifier). Work with your users, and work that out.
You will find you actually have a small cluster of tables around Product, such as Manufacturer, ProductType, perhaps Vendor, etc.
Organise those tables, and Normalise them, so that you are not duplicating data.
Make sure you treat those Identifiers with a bit of respect. Choose which will be unique. Those are Candidate Keys. You need at least one per table, and there will be more than one in Product. All the Identifiers that will be searched on will need to be indexed (Unique or not). Note that Unique Indices cannot be Nullable, so you cannot choose an optional column.
What makes a single Unique Identifier for Product may not be a single column. That's ok, we can evaluate multiple columns for keys in databases; they are called Compound Keys.
Take the best, most stable (one which will not change) Unique Identifier, one of the Candidate Keys, and make that the Primary Key.
If, and only if, the Unique Identifier, the Primary Key, which may be a Compound Key, is very long, and therefore unsuitable for a Primary Key, which is migrated to the child tables, then add a Surrogate Key. That will be the Id column. Note that that is an additional column and additional Index. It is not a substitute for the Identifiers of Product, the Candidate Keys; they cannot be removed.
So far we have a Product database on your companies side of the web, that is meaningful to it. Now we are in a position to evaluate products from the other side of the web; and when we do, we have a framework on our side that is strong, against which we can measure the rubbish that we get from the other side of the web.
Feeds
You need a WebSite table to manage the feeds.
There will be an Associative table (many-to-many) between Product and WebSite. Let's call it ProductSite. It will contain only our ProductId, and the WebSiteCode. It may containPrice`. The contents are valid for a single feed cycle.
Load each feed into a staging database or schema, an incoming ProductIn table, maybe one per source website. This is just the flat file from the external source. Add a column IsValid and set the Default to true.
Then write some SQL that compares that ProductIn table, with its loose and floppy contents, with our Product table with its strong Identifiers.
The way I would do it is, several waves of separate checks, each marking the rows that fail, with IsValid to false. At the end Insert the IsValid rows into our ProductSite.
You might be lucky, and get away with an optimistic approach. That is, as long as you find a match on a few important columns, the match is valid. (reverse the Default and update of the IsValid boolean).
This is the proc that will require some back-and-forth work, until it settles down. That is why you need to work with your users re the Indentifiers. The goal is to exclude no external products, but your starting point will exclude many. That will include going back to our Product table and improving the content (values in the rows) of the Identifiers, and other relevant columns that you use to identify matching rows.
Repeat for each WebSite.
Now populate our website from our Product table, using information that we are confident about, and show which sites have the product for sale from ProductSite.
I don't think this is a code or database problem (yet). You say:
The problem is that the item doesn't have any obvious unique identifier
You need to work out what that uniqeness is before you can ask a computer to do that for you. It sounds like you need some sort of fuzzy, string similarity algorithm.
Some examples of data that you consider duplicates might help.

persisting dynamic properties and query

I have a requirement to implement contact database. This contact database is special in a way that user should be able to dynamically (on runtime) add properties he/she wants to track about the contact. Some of these properties are of type string, other numbers and dates. Some of the properties have pre-defined values, others are free fields etc.. User wants to be also able to query such structure fast and easily. The database needs to handle easily 500 000 contacts each having around 10 properties.
It leads to dynamic property model having Contact class with dynamic properties.
class Contact{
private Map<DynamicProperty, Collection<DynamicValue> values> propertiesAndValues;
//other userfull methods
}
The question is how can I store such a structure in "some database" - it does not have to be RDBMS so that I can easily express queries such as
Get all contacts whose name starts with Martin, they are from Company of size 5000 or less, order by time when this contact was inserted in a database, only first 100 results (provide pagination), where each of these segments correspond to a dynamic property.
I need:
filtering - equal, partial equal, (bigger, smaller for integers, dates) and maybe aggregation - but it is not necessary at this point
sorting
pagination
I was considering RDBMS, but this leads more less to this structure which is quite hard to query and it tends to be slow for this amount of data
contact(id serial pk,....);
dynamic_property(dp_id serial pk, ...);
--only one of the values is not empty
dynamic_property_value(dpv_id serial pk, dynamic_property_fk int, value_integer int, date_value timestamp, text_value text);
contact_properties(pav_id serial pk, contact_id_fk int, dynamic_propert_fk int);
property_and_its_value(pav_id_fk int, dpv_id int);
I consider following options:
store contacts in RDBMS and use Lucene for querying - is there anything that would help with this?
Store dynamic properties as XML and store it to rdbms and use xpath support - unfortunatelly it seems to be pretty slow for 500000 contacts
use another database - Mango DB or Jackrabbit to store this information
Which way would you go and why?
Wikipedia has a great entry on Entity-Attribute-Value modeling which is a data modeling technique for representing entities with arbitrary properties. It's typically used for clinical data, but might apply to your situation as well.
Have you considered using Lucene for your querying needs? You could probably get away with just using Lucene and store all your data in the index. Although I wouldn't recommend using Lucene as your only persistence store.
Alternatively, you could use Lucene along with a RDBMS and take advantage of something like Compass.
You could try other kind of databases like CouchDB which is a document oriented db and is distributed
If you want a dumb solution, for your contacts table you could add some 50 columns like STRING_COLUMN1, STRING_COLUMN2... upto 10, DATE_COLUMN1..DATE_COLUMN10. You have another DESCRIPTION column. So if a row has a name which is a string then STRING_COLUMN1 stores the value of your name and the DESCRIPTION column value would be "STRING_COLUMN1-NAME". In this case querying can be a bit tricky. I know many purists laugh at this, but I have seen a similar requirement solved this way in one of the apps :)

Categories