I have a requirement to implement contact database. This contact database is special in a way that user should be able to dynamically (on runtime) add properties he/she wants to track about the contact. Some of these properties are of type string, other numbers and dates. Some of the properties have pre-defined values, others are free fields etc.. User wants to be also able to query such structure fast and easily. The database needs to handle easily 500 000 contacts each having around 10 properties.
It leads to dynamic property model having Contact class with dynamic properties.
class Contact{
private Map<DynamicProperty, Collection<DynamicValue> values> propertiesAndValues;
//other userfull methods
}
The question is how can I store such a structure in "some database" - it does not have to be RDBMS so that I can easily express queries such as
Get all contacts whose name starts with Martin, they are from Company of size 5000 or less, order by time when this contact was inserted in a database, only first 100 results (provide pagination), where each of these segments correspond to a dynamic property.
I need:
filtering - equal, partial equal, (bigger, smaller for integers, dates) and maybe aggregation - but it is not necessary at this point
sorting
pagination
I was considering RDBMS, but this leads more less to this structure which is quite hard to query and it tends to be slow for this amount of data
contact(id serial pk,....);
dynamic_property(dp_id serial pk, ...);
--only one of the values is not empty
dynamic_property_value(dpv_id serial pk, dynamic_property_fk int, value_integer int, date_value timestamp, text_value text);
contact_properties(pav_id serial pk, contact_id_fk int, dynamic_propert_fk int);
property_and_its_value(pav_id_fk int, dpv_id int);
I consider following options:
store contacts in RDBMS and use Lucene for querying - is there anything that would help with this?
Store dynamic properties as XML and store it to rdbms and use xpath support - unfortunatelly it seems to be pretty slow for 500000 contacts
use another database - Mango DB or Jackrabbit to store this information
Which way would you go and why?
Wikipedia has a great entry on Entity-Attribute-Value modeling which is a data modeling technique for representing entities with arbitrary properties. It's typically used for clinical data, but might apply to your situation as well.
Have you considered using Lucene for your querying needs? You could probably get away with just using Lucene and store all your data in the index. Although I wouldn't recommend using Lucene as your only persistence store.
Alternatively, you could use Lucene along with a RDBMS and take advantage of something like Compass.
You could try other kind of databases like CouchDB which is a document oriented db and is distributed
If you want a dumb solution, for your contacts table you could add some 50 columns like STRING_COLUMN1, STRING_COLUMN2... upto 10, DATE_COLUMN1..DATE_COLUMN10. You have another DESCRIPTION column. So if a row has a name which is a string then STRING_COLUMN1 stores the value of your name and the DESCRIPTION column value would be "STRING_COLUMN1-NAME". In this case querying can be a bit tricky. I know many purists laugh at this, but I have seen a similar requirement solved this way in one of the apps :)
Related
I'm working on a project with an existing cassandra database.
The schema looks like this:
partition key (big int)
clustering key1 (timestamp)
data (text)
1
2021-03-10 11:54:00.000
{a:"somedata", b:2, ...}
My question is: Is there any advantage storing data in a json string?
Will it save some space?
Until now I discovered disadvantages only:
You cannot (easily) add/drop columns at runtime, since the application could override the json string column.
Parsing the json string is currently the bottleneck regarding performance.
No, there is no real advantage to storing JSON as string in Cassandra unless the underlying data in the JSON is really schema-less. It will also not save space but in fact use more because each item has to have a key+value instead of just storing the value.
If you can, I would recommend mapping the keys to CQL columns so you can store the values natively and accessing the data is more flexible. Cheers!
Erick is spot-on-correct with his answer.
The only thing I'd add, would be that storing JSON blobs in a single column makes updates (even more) problematic. If you update a single JSON property, the whole column gets rewritten. Also the original JSON blob is still there...just "obsoleted" until compaction runs. The only time that storing a JSON blob in a single column makes any sense, is if the properties don't change.
And I agree, mapping the keys to CQL columns is a much better option.
I don't disagree with the excellent and already accepted answer by #erick-ramirez.
However there is often a good case to be made for using frozen UDTs instead of separate columns for related data that is only ever going to be set and retrieved at the same time and will not be specifically filtered as part of your query.
The "frozen" part is important as it means less work for cassandra but does mean that you rewrite the whole value each update.
This can have a large performance boost over a large number of columns. The nice ScyllaDB people have a great post on that:
If You Care About Performance, Employ User Defined Types
(I know Scylla DB is not exactly Cassandra but I've seen multiple articles that say the same thing about Cassandra)
One downside is that you add work to the application layer and sometimes mapping complex UDTs to your Java types will be interesting.
I am wondering how I would store my custom network level in a MySQL table. I could make four columns, 'level', 'exp', 'expreq' and 'total'. Only this will take up four columns, and as I am storing name, rank and other data in the same table it will be too many columns in the end. Are there better ways? Should I make another table?
In a relational data model, and for expansion ability you have to do it in a different table. by which the master can point to the detailed table where you can have as many attributes as you can.
BUT
This has an obvious impact on the memory when it becomes large, in addition to that, this approach is usually being replaced by less-normalized version of the tables by introducing concepts like "Custom Fields"
OR
If it is me, and this table will be accessible by certain programming language, I would store them in JSON format in very simple table. and let the program do the processing overhead
I have a "variable" structure to be put in a table DB. By "variable" I mean a sequence of couples field/value in which the "kind" of field determines the value type, I don't know exactly field order and I don't know how many times fields can repeat. Sometimes group of fields will repeat several times (it is a fiscal model).
Additional requirement: I should map these variable data into web page forms, handling some CRUD work.
JQuery-ui, Struts 2, Hibernate. Preferred DBMS: MySQL.
The solutions I thought of:
vertical table. I could have some performance issue, which I could resolve with materialized views that "pivot" the rows in columns when I need massive data process. Not gone so far in this direction as it seems to be very expensive for development.
LOB fields. Pack my columns into one of those, perhaps having a "mapping" table to decode each column. My idea is to pull-out searchable fields as "real" columns in order to leave in the LOB just the less interesting mob of data and not to generate performance problems.
or better 2a. Use an xml inside the LOB field. This could be useful to pack/unpack data more comfortably, specially having to map data to a web form.
What do you think? And more, is there some way to create automatic views from xml fields? Or better to map such data to web form? I suspect Hibernate Tools won't work in any of the cases I described.
I hope I have been clear, it's still a bit confusing even to me :)
Your option 1 is the Entity-Attribute-Value antipattern.
See my answer to Product table, many kinds of product, each product has many parameters and my blog post EAV FAIL for alternatives and some reasons why EAV is wrong, at least for a relational database (I cover EAV in my book, SQL Antipatterns Volume 1: Avoiding the Pitfalls of Database Programming).
Also read this article about how a similar structure nearly doomed a company: Bad CaRMa.
Your options 2 & 3 are similar to described in How FriendFeed uses MySQL to store schema-less data. I don't know of any automatic way for an ORM to maintain that structure for you. You do have the chore of keeping your inverted index tables in sync with your LOB data.
I'm getting introduced to serialization and ran into some problems when pairing it with LinkedList
Consider i have the following table:
CREATE TABLE JAVA_OBJECTS (
ID BIGINT NOT NULL UNIQUE AUTO_INCREMENT,
OBJ_NAME VARCHAR(50),
OBJ_VALUE BLOB
);
And i'm planning to store 3 object types - so the table may look like so -
ID OBJ_NAME OBJ_VALUE
============================
1 Class1 BLOB
2 Class2 BLOB
3 Class1 BLOB
4 Class3 BLOB
5 Class3 BLOB
And i'll use 3 different LinkedList's to manage these objects..
I've been able to implement LoadFromTable() and StoreIntoTable(Class1 obj1).
My question is - if i change an attribute for a Class2 object in LinkedList<Class2>, how do i effect the change in the DB for this individual item? Also take into account that the order of the elements in LinkedList may change..
Thanks : )
* EDIT
Yes, i understand that i'll have to delete/update a row in my DB table. But how do i keep track of WHICH row to update? I'm only storing the objects in the List, not their respective IDs in the table.
You'll have to store their IDs in the objects you are storing. However, I would suggest not trying to roll your own ORM system, and instead use something like Hibernate.
If you change an attribute in a an object or the order of items. You will have to delete that row and insert the updated list again.
How do i effect the change in the DB for this individual item?
I hope I get you right. The SQL update and delete statements allow you to add a WHERE clause in which you chose the ID of the row to update.
e.g.
UPDATE JAVA_OBJECTS SET OBJ_NAME ="new name" WHERE ID = 2
EDIT:
To prevent problems with your Ids you could wrap you object
class Wrapper {
int dbId;
Object obj;
}
And add them instead of the 'naked' object into your LinkedList
You can use AUTO_INCREMENT attribute for your table and then use the mysql_insert_id() function to retrieve the id assigned to the row added/updated by the last INSERT/UPDATE statement. Along with this maintain a map (eg a HashMap) from the java object to the Id. Using this map you can keep track of which row to delete/update.
Edit: See the answer to this question as well.
I think the real problem here is, that you mix and match different levels of abstraction. By storing serialized Java objects into a relational database as BLOBs you have to consider several drawbacks:
You loose interoperability. Applications written in other languages than Java are not able to read the data back. Even other Java applications have to have the class files of the serialized classes in their classpath.
Changing the class definitions of the stored classes will end up in maintenance nightmares.
You give up the advantages of a relational database. Serialization hides the actual data from the database. So the database is presented only with a black box. You are unable to execute any meaningfull query against the real data. All what you have is the ID and block of bytes.
You have to implement low level data handling by yourself. Actually the database is made to handle your data effectively, but because of serialization you hinder it doing its job. So you are on your own and you are running into that problem right now.
So in most cases you benifit from separation of concerns and using the right tool for a job.
Here are some suggestions:
Separate the internal data handling inside your application from persistent storage. Design your database schema in a way to enable the built-in database features to handle the data efficently. In case of a relational database like MySQL you can choose from different technologies like plain JDBC, object relational mappers like JPA or simple mappers like MyBatis. Separation here means to avoid to contaminate the database with implementation specific concerns.
If you have for example in your Java application a List of Person instances and each Person consists of a name and an age. Then you would represent that list in a relational database as a table consisting of a VARCHAR field for the name and a numeric field for the age and maybe a third field for a unique key. Then the database is able to do what it can do best: managing large amounts of data.
Inside your application you typically separate the persistent layer from the rest of your program containing the code to communicate with the database.
In some use cases a relational database may not be the appropiate tool. Maybe in a single user desktop application with a small set of data it may be the best to simply serialize your Person list into a plain file and read it back at the next start up.
But there exists other alternatives to persist your data. Maybe some kind of object oriented database is the right tool. In particular I have experiences with Fast Objects. As a simplification it is serialization on steroids. There is no need for a layer like JPA or JDBC between your application and your database. You are able to store the class instances directly into the database. But unlike the relational database with its BLOB field, the OODB knows your classes and the actual data and can benefit from that.
Another alternative may be JDBM or Berkeley DB.
So separation of concerns and choosing the right persistence strategy (and using it the right way) is a key concern for the success of your project. But doing it right is hard even for experienced developers.
I have a Database storing details of products which are taken from many sites, and gathered through the individual sites API's. When I call the feed, the details are stored in a database table.
The problem I'm having is that because the exact same product is listed on many sites by the seller I end up having duplicate items in my database, and then when I display them on a web page there are many duplicates.
The problem is that the item doesn't have any obvious unique identifier, it has specific details of the item (of which there could be many), and then a description of the item from the seller.
What I would like is for the item to show up once, and then give the user details of where else the item is listed.
How would I identify the duplicates that have come in, without slowing down the entire database? How would I also then pick one advert from all the duplicates, and then store what other sites the advert is displayed on.
Thanks for any help.
The problem is two-fold, and both are on your side. When you figure out how to deal with that, writing the code into a program (Java or SQL will be easy). I'll name them first and then identify the solutions.
For some unknown reason, you have assumed that collecting product descriptions from mulitple sites will not collect the same product.
You are used to the common and nonsensical Id column, which is fine when you are working with spreadsheets prototyping functionality; but it is nowhere near what is required for a database or Development-level functionality. Your users (or boss) have naturally expected database capability from the database, and you did not provide any. (And no, it does not require fuzzy string logic or magic of any kind.)
Solution
This is a condensed version of the IDEF1X Standard for modelling Relational Databases; the portion re Identifiers.
You need to think in database terms, and think about the database tables you need to perform your function, which means you are not allowed to use an auto-increment Id column. That column gives a spreadsheet a RowId, but it does not imply anything about the content of the table, or the columns that identify a product.
And you cannot simply rip data off another website, you need to think about what your website requires for products. What does your company understand a product to be, and how does it identify a product ?
Identify all the columns and datatypes for the columns.
Identify which columns are mandatory and which are optional.
Identify which are strong Identifiers. Eg. Manufacturer and Model; the short Product Name, not the long Description (or may be for your company, the long description is an Identifier). Work with your users, and work that out.
You will find you actually have a small cluster of tables around Product, such as Manufacturer, ProductType, perhaps Vendor, etc.
Organise those tables, and Normalise them, so that you are not duplicating data.
Make sure you treat those Identifiers with a bit of respect. Choose which will be unique. Those are Candidate Keys. You need at least one per table, and there will be more than one in Product. All the Identifiers that will be searched on will need to be indexed (Unique or not). Note that Unique Indices cannot be Nullable, so you cannot choose an optional column.
What makes a single Unique Identifier for Product may not be a single column. That's ok, we can evaluate multiple columns for keys in databases; they are called Compound Keys.
Take the best, most stable (one which will not change) Unique Identifier, one of the Candidate Keys, and make that the Primary Key.
If, and only if, the Unique Identifier, the Primary Key, which may be a Compound Key, is very long, and therefore unsuitable for a Primary Key, which is migrated to the child tables, then add a Surrogate Key. That will be the Id column. Note that that is an additional column and additional Index. It is not a substitute for the Identifiers of Product, the Candidate Keys; they cannot be removed.
So far we have a Product database on your companies side of the web, that is meaningful to it. Now we are in a position to evaluate products from the other side of the web; and when we do, we have a framework on our side that is strong, against which we can measure the rubbish that we get from the other side of the web.
Feeds
You need a WebSite table to manage the feeds.
There will be an Associative table (many-to-many) between Product and WebSite. Let's call it ProductSite. It will contain only our ProductId, and the WebSiteCode. It may containPrice`. The contents are valid for a single feed cycle.
Load each feed into a staging database or schema, an incoming ProductIn table, maybe one per source website. This is just the flat file from the external source. Add a column IsValid and set the Default to true.
Then write some SQL that compares that ProductIn table, with its loose and floppy contents, with our Product table with its strong Identifiers.
The way I would do it is, several waves of separate checks, each marking the rows that fail, with IsValid to false. At the end Insert the IsValid rows into our ProductSite.
You might be lucky, and get away with an optimistic approach. That is, as long as you find a match on a few important columns, the match is valid. (reverse the Default and update of the IsValid boolean).
This is the proc that will require some back-and-forth work, until it settles down. That is why you need to work with your users re the Indentifiers. The goal is to exclude no external products, but your starting point will exclude many. That will include going back to our Product table and improving the content (values in the rows) of the Identifiers, and other relevant columns that you use to identify matching rows.
Repeat for each WebSite.
Now populate our website from our Product table, using information that we are confident about, and show which sites have the product for sale from ProductSite.
I don't think this is a code or database problem (yet). You say:
The problem is that the item doesn't have any obvious unique identifier
You need to work out what that uniqeness is before you can ask a computer to do that for you. It sounds like you need some sort of fuzzy, string similarity algorithm.
Some examples of data that you consider duplicates might help.