Insert unique entities to the sqlite DB - java

There will be up to 100k entities in sqlite DB with following structure:
ID (Numeric, PK)
KEY (Varchar, Unique/PK)
Other fields, mostly varchars
I have a list of about 100-1k entities. I want to add to the DB only those entities, which KEY is not present in the DB and show the list of added ones.
Example
As an relevant example you may consider something like this: library with books as entities. Each Book has global unique ISBN number (KEY) and unique id (ID) in the library catalog.
Some person brings to the library set of books. Library checks the books by ISBN in the catalog, takes 'new' books and shows to the person list of taken books.
Some thoughts how it can be achieved:
1) select all KEYs from the DB, put them into [hash]set, in loop verify that KEY from new entities does not exist in the set.
2) like #1 but instead of selecting all KEYs, select only KEYs that present both in DB and the list
3) in loop check existence of entity with additional select query
4) enable constraints in the DB, check existence by catching exceptions
All of them have their own disadvantages, I believe. Can you suggest something better?
For now I'm asking mostly about 'best practices', I believe any of the approach will work for my case without huge performance issues (no actual tests for now, I'm just in analysis phase), but how should it be done better?
Code will be in Java, I plan to use simple DAOs with JDBC, but if someone suggests Hibernate as an alternative approach, I will reconsider my thoughts.

Your suggested solutions are all valid and possible situations, altough Number #1 and Number #4 are rarely good ideas. Number #2 and #3 are solutions that are often used. To choose from the two you need to answer the following questions:
What will be the typical use case for your application? (sometimes you just get 1 new book, you will always get hundreds of books at once, mixed)
Based on what metrics will this code be "judged", what are you trying to optimize for? (readability/maintainability, performance, etc)
Based on the answers to these questions above you can either pick #2 or #3.

Related

Java - Google App Engine - modelling graph structures in Google Datastore

Google Apps Engine offers the Google Datastore as the only NoSQL database (I think it is based on BigTable).
In my application I have a social-like data structure and I want to model it as I would do in a graph database. My application must save heterogeneous objects (users,files,...) and relationships among them (such as user1 OWNS file2, user2 FOLLOWS user3, and so on).
I'm looking for a good way to model this typical situation, and I thought to two families of solutions:
List-based solutions: Any object contains a list of other related objects and the object presence in the list is itself the relationship (as Google said in the JDO part https://developers.google.com/appengine/docs/java/datastore/jdo/relationships).
Graph-based solution: Both nodes and relationships are objects. The objects exist independently from the relationships while each relationship contain a reference to the two (or more) connected objects.
What are strong and weak points of these two approaches?
About approach 1: This is the simpler approach one can think of, and it is also presented in the official documentation but:
Each directed relationship make the object record grow: are there any limitations on the number of the possible relationships given for instance by the object dimension limit?
Is that a JDO feature or also the datastore structure allows that approach to be naturally implemented?
The relationship search time will increase with the list, is this solution suitable for large (million) of relationships?
About approach 2: Each relationship can have a higher level of characterization (it is an object and it can have properties). And I think memory size is not a Google problem, but:
Each relationship requires its own record, so the search time for each related couple will increase as the total number of relationships increase. Is this suitable for large amount of relationships(millions, billions)? I.e. does Google have good tricks to search among records if they are well structured? Or I will be soon in a situation in which if I want to search a friend of User1 called User4 I have to wait seconds?
On the other side each object doesn't increase in dimension as new relationships are added.
Could you help me to find other important points on the two approaches in such a way to chose the best model?
First, the search time in the Datastore does not depend on the number of entities that you store, only on the number of entities that you retrieve. Therefore, if you need to find one relationship object out of a billion, it will take the same time as if you had just one object.
Second, the list approach has a serious limitation called "exploding indexes". You will have to index the property that contains a list to make it searchable. If you ever use a query that references more than just this property, you will run into this issue - google it to understand the implications.
Third, the list approach is much more expensive. Every time you add a new relationship, you will rewrite the entire entity at considerable writing cost. The reading costs will be higher too if you cannot use keys-only queries. With the object approach you can use keys-only queries to find relationships, and such queries are now free.
UPDATE:
If your relationships are directed, you may consider making Relationship entities children of User entities, and using an Object id as an id for a Relationship entity as well. Then your Relationship entity will have no properties at all, which is probably the most cost-efficient solution. You will be able to retrieve all objects owned by a user using keys-only ancestor queries.
I have an AppEngine application and I use both approaches. Which is better depends on two things: the practical limits of how many relationships there can be and how often the relationships change.
NOTE 1: My answer is based on experience with Objectify and heavy use of caching. Mileage may vary with other approaches.
NOTE 2: I've used the term 'id' instead of the proper DataStore term 'name' here. Name would have been confusing and id matches objectify terms better.
Consider users linked to the schools they've attended and vice versa. In this case, you would do both. Link the users to schools with a variation of the 'List' method. Store the list of school ids the user attended as a UserSchoolLinks entity with a different type/kind but with the same id as the user. For example, if the user's id = '6h30n' store a UserSchoolLinks object with id '6h30n'. Load this single entity by key lookup any time you need to get the list of schools for a user.
However, do not do the reverse for the users that attended a school. For that relationship, insert a link entity. Use a combination of the school's id and the user's id for the id of the link entity. Store both id's in the entity as separate properties. For example, the SchoolUserLink for user '6h30n' attending school 'g3g0a3' gets id 'g3g0a3~6h30n' and contains the fields: school=g3g0a3 and user=6h30n. Use a query on the school property to get all the SchoolUserLinks for a school.
Here's why:
Users will see their schools frequently but change them rarely. Using this approach, the user's schools will be cached and won't have to be fetched every time they hit their profile.
Since you will be getting the user's schools via a key lookup, you won't be using a query. Therefore, you won't have to deal with eventual consistency for the user's schools.
Schools may have many users that attended them. By storing this relationship as link entities, we avoid creating a huge single object.
The users that attended a school will change a lot. This way we don't have to write a single, large entity frequently.
By using the id of the User entity as the id for the UserSchoolLinks entity we can fetch the links knowing just the id of the user.
By combining the school id and the user id as the id for the SchoolUser link. We can do a key lookup to see if a user and school are linked. Once again, no need to worry about eventual consistency for that.
By including the user id as a property of the SchoolUserLink we don't need to parse the SchoolUserLink object to get the id of the user. We can also use this field to check consistency between both directions and have a fallback in case somehow people are attending hundreds of schools.
Downsides:
1. This approach violates the DRY principle. Seems like the least of evils here.
2. We still have to use a query to get the users who attended a school. That means dealing with eventual consistency.
Don't forget Update the UserSchoolLinks entity and add/remove the SchoolUserLink entity in a transaction.
You question is too complex but I try explain the best solution (I will answer in Python but same can be done in Java).
class User(db.User):
followers = db.StringListProperty()
Simple add follower.
user = User.get(key)
user.followers.append(str(followerKey))
This allow fast query who is followed and followers
User.all().filter('followers', followerKey) # -> followed
This query i/o costly so you can make it faster but more complicated and costly in i/o writes:
class User(db.User):
followers = db.StringListProperty()
follows = db.StringListProperty()
Whatever this is complicated during changes since delete of Users need update follows so you need 2 writes.
You can also store relationships but it is the worse scenario since it is more complex than second example with followers and follows ... - keep in mind than entity can have 1Mb it is not limit but can be.

relationship and build database

For an excercise I need to build something like :
For a course I need to create a review that is made up out of certain reviewlines and feedbackscores.
This review object (unique instance) needs to be filled in by a list of customers.
Depending on the course the review is for, the review will change (e.g.for one course the number of reviewlines and feedbackscores will change). Each customer can be enrolled in more then one course and each review is specific for him.
Now how do I need to see the relationsship between "review" object (unique instance) and "customer" if I want to use JPA to save this all to the db?
A customer can have more then one review he/she needs to fill in.
A certain review object needs to be filled in by many customers (but this is a review object with a certain build [reviewlines and feedbackscores]) and unique for him.
Maybe I see it to complex but what is the best way to build this?
Try the following:
I think it's covered all your design points.
I am trying to read between the lines of your comments, and I think you want to implement a system where you capture a number of 'rules' for the Review (I'm guessing, but examples may be that reviews can be up to n lines, there must be at least m CustomerReviews before the Review gains a degree of quality). If this is indeed the case, I have created a ReviewTemplate class:
ReviewTemplate would have attributes/columns for each of value you would need. These attributes/columns are duplicated on Review
Populate ReviewTemplate with a number of rows, then create a row in Course and link it to one ReviewTemplate
When a Course needs a Review, copy the fields from the ReviewTemplate into the Review
In Java, implement the business rules for Review using the copied values - not the values on ReviewTemplate.
Why copy the values? Well, I bet that at some point, users want to edit the ReviewTemplate table. If so, what happens to the Review objects using the edited ReviewTemplates? Does the modified value on ReviewTemplate somehow invalidate past Reviews and break your business logic? No, because you copied the rule values to Review and so past Reviews will not change.
EDIT: Answers to specific questions
How do you see the duplicating? I can create an entity ReviewTemplate with the specified attributes. In this entity there will be a relationship with reviewlines and feedbackscores.
I see each ReviewTemplate as holding prototypical values for a particular 'type' of Review, which just might include a default reviewLine (but that might not make sense) and a default feedbackScore. When you create the Review, you would do the following:
Instantiate the Review and populate with values from ReviewTemplate
Instantiate as many CustomerReview objects as you need, linking them to the relevant Customer objects (I infer this step from your previous comments. It might also make sense to omit this step until a Customer voluntarily elects to review a Course)
(If appropriate) Populate the CustomerReview attribute feedbackScore with the default value from ReviewTemplate
Instantiate CustomerReviewLine records as appropriate
If you follow this approach, you do not need to add a relationship between ReviewTemplate and CustomerReviewLines.
When I e.g. state that customers 1 to 4 need to fill in the review 4 specific "objects" need to be created that will hold the information and also 4 sets of the needed reviewlines and feedbackscores need to be created so they all can hold the information.
Absolutely.
I just don't know how to implement this is a JPA structure so the information is hold in the db ... ?
JPA allows you to attack the problem in many ways, but the best practice is to manually create both the DB schema and the Java classes (eg see https://stackoverflow.com/a/2585763/1395668). Therefore, for each entity in the diagram, you need to:
Write SQL DDL statements to create the table, columns, primary key and foreign keys, and
Write a Java class denoted with the #entity annotation. Within the class, you will also need to annotate the id (primary key) with #id and the relationships with #OneToMany or #ManyToOne (theirs additional parameters in the annotation to set as well).
Now, on the JPA side, you can do things like:
ReviewTemplate template = course.getReviewTemplate(); //assuming the variable course
Review review = new Review();
review.setCourse(course);
review.setRuleOne(template.getRuleOne());
// Copy other properties here
EntityManager em = // get the entity manager here
em.persist(review);
// Assume a set or list of customers
for (Customer customer : customers) {
CustomerReview cr = new CustomerReview();
cr.setReview(review);
cr.setCustomer(customer);
cr.setFeedbackScore(template.getDefaultFeedbackScore());
// set other CustomerReview properties here
em.persist(cr);
// You can create CustomerReviewLine here as well
If written inside a standard EJB Session Bean, this will all be nicely transacted, and you will have all your new records committed into the DB.
EDIT 2: Additional question
(I'm assuming that the second comment completely supersedes the first)
So when I create a reviewtemplate and I link it to a bunch of customers I write the template to the db and create a bunch of reviews based on the template but linked to the specific customer and with his own unique reviewlines and feedbackscores. Like I see it now the reviewline (more a question or discription) is the same for each review (of a template), it is only the score that changes between the customers
I finally think I understand ReviewLine. I had thought it a place where the Customer enters lines of text the comprise the CustomerReview. I now believe that ReviewLine is a specific question that the Customer is asked, and which the Customer provides a feedbackScore.
With this understanding, here is an updated ER/Class diagram.
Note that there are some significant changes - there are several more tables:
ReviewLineTemplate provides a place for template questions to be stored on a ReviewTemplate
When a Review is instantiated/inserted (which is a copy of a specific ReviewTemplate), the ReviewLineTemplates are copied as ReviewLines. The copy operation allows two important features:
On creation, a Review and its ReviewLines can be customized without affecting the ReviewTemplate or ReviewLineTemplate
Over time, the ReviewTemplate and ReviewLineTemplate can be updated, edited and continually improved, without changing the questions that the Customer has already answered. If CustomerFeedbackScore were linked to ReviewLineTemplate directly, then editing the ReviewLineTemplate would change the question that the Customer has answered, silently invalidating the feedbackScore.
FeedbackScore has been moved to a join-table between ReviewLine and CustomerReview.
Note that this model is fully denormalised which makes it more 'correct' but harder to build a GUI for. A common 'optimization' might be to introduce:
10 (say) columns on ReviewTemplate and Review called reviewLine1 through reviewLine10.
10 (say) columns on CustomerReview called feedbackScore1 through feedbackScore10.
Remove the ReviewTemplateLine, ReviewLine and CustomerReviewLine tables
Doing so is not normalised, and may introduce a set of other problems. YMMV
The structure of data always depends on the requirements, and there never exists a "one-and-only" solution. So, do you need maximised atomiticy or a high performance data system?
The fastest and easiest solution would be not using a database, but hash tables. In your case, you could have something like 3 hash tables for customer, review, and probably another one for the n:n relationship. Or if you're using a database, you could just store an array of the review-primary-keys in one field in the customer table.
However, we all learn in school to do atomicity, so let's do that (I just write the primary/foreign keys!):
Customer (unique_ID, ...)
Review (unique_ID, ...)
Customer_Review (customer_ID, review_ID, ...) --> n:n-relationship
The Customer_Review describes the n:n-relationship between customers and reviews. But if there is only one customer per review possible, you'll do that like this:
Customer (unique_ID, ...)
Review (pk: unique_ID, fk: customer_ID, ...) --> 1:n-relationship
However, I suggest you need to learn ERM as a good starting point: http://en.wikipedia.org/wiki/Entity_relationship_model
You need a ManyToMany relation :
One customer -> several reviews.
One review -> several customers.
So you will have 3 tables in your database schema : Customer, review and a junction table with the customer ID and the review ID.
See Wikipedia : Many to Many

Multiple link between 2 tables columns...bad design approach?

Hello
I'm developing a webapp and i'm about to design the database, i came across this question.
Is it a bad design to have more then 1 link between 2 tables?
The picture i have posted is a very quick and small example just to make it clearer.
If i would like to display all the offers,i would like to insert also the products they are related to, in this case i could retrieve the product name by creating a product instance retrieved with the product id from the product id field in the offer object, but it would require more queries execution and more typing work, so i was thinking to include the product name directly in the offer so that i can simply retrieve all offers and eventually display the related product by browsing the DB with its product id.
Would you consider this a bad approach?
I have been looking around for cases like mine but i have only found approaches with 1 connection between tables (with unique id's)
Thank you
This is data denormalization. Don't do it (in most cases). Design the tables correctly, let the database do the correct work with the correct queries. It will be much easier to maintain and work with over time.
Use the ID in the offer table to lookup the product name in the products table.
yes this would be bad.
removing the redundant name would be proper normalization. just link on the id, that will be the best way.
In general there is no limit to the number of relationships (links) between two tables, but each relationship should have a unique meaning. If, in your example, Product Name and Product ID are both candidate keys and each name always has the same ID then you should definitely not have two PK/FK relationships between these tables.
#Joe is right. Normalization is the best approach to take with database design. The reason being so that additions, deletions, and modifications of a field can be made in just one table and then propagated through the rest of the database via the defined relationships.

How to identify duplicate items gathered from multiple feeds and link to them in a Database

I have a Database storing details of products which are taken from many sites, and gathered through the individual sites API's. When I call the feed, the details are stored in a database table.
The problem I'm having is that because the exact same product is listed on many sites by the seller I end up having duplicate items in my database, and then when I display them on a web page there are many duplicates.
The problem is that the item doesn't have any obvious unique identifier, it has specific details of the item (of which there could be many), and then a description of the item from the seller.
What I would like is for the item to show up once, and then give the user details of where else the item is listed.
How would I identify the duplicates that have come in, without slowing down the entire database? How would I also then pick one advert from all the duplicates, and then store what other sites the advert is displayed on.
Thanks for any help.
The problem is two-fold, and both are on your side. When you figure out how to deal with that, writing the code into a program (Java or SQL will be easy). I'll name them first and then identify the solutions.
For some unknown reason, you have assumed that collecting product descriptions from mulitple sites will not collect the same product.
You are used to the common and nonsensical Id column, which is fine when you are working with spreadsheets prototyping functionality; but it is nowhere near what is required for a database or Development-level functionality. Your users (or boss) have naturally expected database capability from the database, and you did not provide any. (And no, it does not require fuzzy string logic or magic of any kind.)
Solution
This is a condensed version of the IDEF1X Standard for modelling Relational Databases; the portion re Identifiers.
You need to think in database terms, and think about the database tables you need to perform your function, which means you are not allowed to use an auto-increment Id column. That column gives a spreadsheet a RowId, but it does not imply anything about the content of the table, or the columns that identify a product.
And you cannot simply rip data off another website, you need to think about what your website requires for products. What does your company understand a product to be, and how does it identify a product ?
Identify all the columns and datatypes for the columns.
Identify which columns are mandatory and which are optional.
Identify which are strong Identifiers. Eg. Manufacturer and Model; the short Product Name, not the long Description (or may be for your company, the long description is an Identifier). Work with your users, and work that out.
You will find you actually have a small cluster of tables around Product, such as Manufacturer, ProductType, perhaps Vendor, etc.
Organise those tables, and Normalise them, so that you are not duplicating data.
Make sure you treat those Identifiers with a bit of respect. Choose which will be unique. Those are Candidate Keys. You need at least one per table, and there will be more than one in Product. All the Identifiers that will be searched on will need to be indexed (Unique or not). Note that Unique Indices cannot be Nullable, so you cannot choose an optional column.
What makes a single Unique Identifier for Product may not be a single column. That's ok, we can evaluate multiple columns for keys in databases; they are called Compound Keys.
Take the best, most stable (one which will not change) Unique Identifier, one of the Candidate Keys, and make that the Primary Key.
If, and only if, the Unique Identifier, the Primary Key, which may be a Compound Key, is very long, and therefore unsuitable for a Primary Key, which is migrated to the child tables, then add a Surrogate Key. That will be the Id column. Note that that is an additional column and additional Index. It is not a substitute for the Identifiers of Product, the Candidate Keys; they cannot be removed.
So far we have a Product database on your companies side of the web, that is meaningful to it. Now we are in a position to evaluate products from the other side of the web; and when we do, we have a framework on our side that is strong, against which we can measure the rubbish that we get from the other side of the web.
Feeds
You need a WebSite table to manage the feeds.
There will be an Associative table (many-to-many) between Product and WebSite. Let's call it ProductSite. It will contain only our ProductId, and the WebSiteCode. It may containPrice`. The contents are valid for a single feed cycle.
Load each feed into a staging database or schema, an incoming ProductIn table, maybe one per source website. This is just the flat file from the external source. Add a column IsValid and set the Default to true.
Then write some SQL that compares that ProductIn table, with its loose and floppy contents, with our Product table with its strong Identifiers.
The way I would do it is, several waves of separate checks, each marking the rows that fail, with IsValid to false. At the end Insert the IsValid rows into our ProductSite.
You might be lucky, and get away with an optimistic approach. That is, as long as you find a match on a few important columns, the match is valid. (reverse the Default and update of the IsValid boolean).
This is the proc that will require some back-and-forth work, until it settles down. That is why you need to work with your users re the Indentifiers. The goal is to exclude no external products, but your starting point will exclude many. That will include going back to our Product table and improving the content (values in the rows) of the Identifiers, and other relevant columns that you use to identify matching rows.
Repeat for each WebSite.
Now populate our website from our Product table, using information that we are confident about, and show which sites have the product for sale from ProductSite.
I don't think this is a code or database problem (yet). You say:
The problem is that the item doesn't have any obvious unique identifier
You need to work out what that uniqeness is before you can ask a computer to do that for you. It sounds like you need some sort of fuzzy, string similarity algorithm.
Some examples of data that you consider duplicates might help.

persisting dynamic properties and query

I have a requirement to implement contact database. This contact database is special in a way that user should be able to dynamically (on runtime) add properties he/she wants to track about the contact. Some of these properties are of type string, other numbers and dates. Some of the properties have pre-defined values, others are free fields etc.. User wants to be also able to query such structure fast and easily. The database needs to handle easily 500 000 contacts each having around 10 properties.
It leads to dynamic property model having Contact class with dynamic properties.
class Contact{
private Map<DynamicProperty, Collection<DynamicValue> values> propertiesAndValues;
//other userfull methods
}
The question is how can I store such a structure in "some database" - it does not have to be RDBMS so that I can easily express queries such as
Get all contacts whose name starts with Martin, they are from Company of size 5000 or less, order by time when this contact was inserted in a database, only first 100 results (provide pagination), where each of these segments correspond to a dynamic property.
I need:
filtering - equal, partial equal, (bigger, smaller for integers, dates) and maybe aggregation - but it is not necessary at this point
sorting
pagination
I was considering RDBMS, but this leads more less to this structure which is quite hard to query and it tends to be slow for this amount of data
contact(id serial pk,....);
dynamic_property(dp_id serial pk, ...);
--only one of the values is not empty
dynamic_property_value(dpv_id serial pk, dynamic_property_fk int, value_integer int, date_value timestamp, text_value text);
contact_properties(pav_id serial pk, contact_id_fk int, dynamic_propert_fk int);
property_and_its_value(pav_id_fk int, dpv_id int);
I consider following options:
store contacts in RDBMS and use Lucene for querying - is there anything that would help with this?
Store dynamic properties as XML and store it to rdbms and use xpath support - unfortunatelly it seems to be pretty slow for 500000 contacts
use another database - Mango DB or Jackrabbit to store this information
Which way would you go and why?
Wikipedia has a great entry on Entity-Attribute-Value modeling which is a data modeling technique for representing entities with arbitrary properties. It's typically used for clinical data, but might apply to your situation as well.
Have you considered using Lucene for your querying needs? You could probably get away with just using Lucene and store all your data in the index. Although I wouldn't recommend using Lucene as your only persistence store.
Alternatively, you could use Lucene along with a RDBMS and take advantage of something like Compass.
You could try other kind of databases like CouchDB which is a document oriented db and is distributed
If you want a dumb solution, for your contacts table you could add some 50 columns like STRING_COLUMN1, STRING_COLUMN2... upto 10, DATE_COLUMN1..DATE_COLUMN10. You have another DESCRIPTION column. So if a row has a name which is a string then STRING_COLUMN1 stores the value of your name and the DESCRIPTION column value would be "STRING_COLUMN1-NAME". In this case querying can be a bit tricky. I know many purists laugh at this, but I have seen a similar requirement solved this way in one of the apps :)

Categories