Single data column vs multiple columns in Cassandra

Single data column vs multiple columns in Cassandra - java

I'm working on a project with an existing cassandra database.
The schema looks like this:
partition key (big int)
clustering key1 (timestamp)
data (text)
1
2021-03-10 11:54:00.000
{a:"somedata", b:2, ...}
My question is: Is there any advantage storing data in a json string?
Will it save some space?
Until now I discovered disadvantages only:
You cannot (easily) add/drop columns at runtime, since the application could override the json string column.
Parsing the json string is currently the bottleneck regarding performance.

No, there is no real advantage to storing JSON as string in Cassandra unless the underlying data in the JSON is really schema-less. It will also not save space but in fact use more because each item has to have a key+value instead of just storing the value.
If you can, I would recommend mapping the keys to CQL columns so you can store the values natively and accessing the data is more flexible. Cheers!

Erick is spot-on-correct with his answer.
The only thing I'd add, would be that storing JSON blobs in a single column makes updates (even more) problematic. If you update a single JSON property, the whole column gets rewritten. Also the original JSON blob is still there...just "obsoleted" until compaction runs. The only time that storing a JSON blob in a single column makes any sense, is if the properties don't change.
And I agree, mapping the keys to CQL columns is a much better option.

I don't disagree with the excellent and already accepted answer by #erick-ramirez.
However there is often a good case to be made for using frozen UDTs instead of separate columns for related data that is only ever going to be set and retrieved at the same time and will not be specifically filtered as part of your query.
The "frozen" part is important as it means less work for cassandra but does mean that you rewrite the whole value each update.
This can have a large performance boost over a large number of columns. The nice ScyllaDB people have a great post on that:
If You Care About Performance, Employ User Defined Types
(I know Scylla DB is not exactly Cassandra but I've seen multiple articles that say the same thing about Cassandra)
One downside is that you add work to the application layer and sometimes mapping complex UDTs to your Java types will be interesting.

Related

Most efficient way to store unused chunk of data in PostgreSQL

There are few columns in a table and about 100+ columns based data, which only need to be stored for later export to another sources.
This data (besides the first few columns mentioned) doesn't need to be indexed / filtered or be manipulated in some sort. There are no queries, that can check this data in any way.
The only thing, that application layer can retrieve the whole row with additional unused workload and deserialize it for further conversion in external format.
There was an idea to serialize whole class into this field, but later we realized, that it's a tremendous overhead for data size (because of additional java class metadata).
So it's a simple key-value data (keys set is static as the relational model suggests).
What is a right way and data type to store this additional unused data in PostgreSQL in terms of DB performance (50+ TB storage)? Perhaps it's worth to omit key data and store only values as array (since keys are static) and get values after deserialization by index at the application layer (since DB performance on the first place)?

a_horse_with_no_name, thanks a lot, but jsonb is really tricky data type.
In terms of required amount of bytes for single tuple, that contains jsonb, one must always keep in mind - the size of key names in json format.
Such that, if someone want to reinvent the wheel and store large key names as single byte indexes - it will decrease overall tuple size,
but it isn't better than storing all data as typical relational table fields, because TOAST algorithm applies for both cases.
Another way is to use EXTERNAL storage method for single jsonb field.
In that case PostgreSQL will keep more tuples in cache, since there is no need to keep whole jsonb data in memory.
Anyway, i ended up with combination of protobuf + zlib in bytea field type (since there is no need to query data in bytea field in our system) :
Uber research for protobuf + zlib

Get all documents from Couchbase bucket

I am writing Couchbase DAO using Java API. I store all documents for one entity in particular bucket. I wonder what is the best way to get all documents from this bucket?
Thanks in advance!

First: do you plan to store each entity type in their own buckets? That will probably not work in the long run, unless you plan to only ever have no more than 10 total entities. Buckets are not made to organize data like that: they are meant to store a variety of different types of data.
Second: do you really want to get all data from a bucket? That seems like a very uncommon use case. It's almost like asking "how do I query all data from all tables in a relational database"
That being said, I could imagine a very specialized situation where you'd want to do this. So, you could:
Create a PRIMARY index and execute a N1QL query like SELECT * FROM mybucket;
Create a very simple map/reduce view index of the data.
Both of these things can be done with the Java SDK.

Variable structure in a DB table to be read in a Web form

I have a "variable" structure to be put in a table DB. By "variable" I mean a sequence of couples field/value in which the "kind" of field determines the value type, I don't know exactly field order and I don't know how many times fields can repeat. Sometimes group of fields will repeat several times (it is a fiscal model).
Additional requirement: I should map these variable data into web page forms, handling some CRUD work.
JQuery-ui, Struts 2, Hibernate. Preferred DBMS: MySQL.
The solutions I thought of:
vertical table. I could have some performance issue, which I could resolve with materialized views that "pivot" the rows in columns when I need massive data process. Not gone so far in this direction as it seems to be very expensive for development.
LOB fields. Pack my columns into one of those, perhaps having a "mapping" table to decode each column. My idea is to pull-out searchable fields as "real" columns in order to leave in the LOB just the less interesting mob of data and not to generate performance problems.
or better 2a. Use an xml inside the LOB field. This could be useful to pack/unpack data more comfortably, specially having to map data to a web form.
What do you think? And more, is there some way to create automatic views from xml fields? Or better to map such data to web form? I suspect Hibernate Tools won't work in any of the cases I described.
I hope I have been clear, it's still a bit confusing even to me :)

Your option 1 is the Entity-Attribute-Value antipattern.
See my answer to Product table, many kinds of product, each product has many parameters and my blog post EAV FAIL for alternatives and some reasons why EAV is wrong, at least for a relational database (I cover EAV in my book, SQL Antipatterns Volume 1: Avoiding the Pitfalls of Database Programming).
Also read this article about how a similar structure nearly doomed a company: Bad CaRMa.
Your options 2 & 3 are similar to described in How FriendFeed uses MySQL to store schema-less data. I don't know of any automatic way for an ORM to maintain that structure for you. You do have the chore of keeping your inverted index tables in sync with your LOB data.

persisting dynamic properties and query

I have a requirement to implement contact database. This contact database is special in a way that user should be able to dynamically (on runtime) add properties he/she wants to track about the contact. Some of these properties are of type string, other numbers and dates. Some of the properties have pre-defined values, others are free fields etc.. User wants to be also able to query such structure fast and easily. The database needs to handle easily 500 000 contacts each having around 10 properties.
It leads to dynamic property model having Contact class with dynamic properties.
class Contact{
private Map<DynamicProperty, Collection<DynamicValue> values> propertiesAndValues;
//other userfull methods
}
The question is how can I store such a structure in "some database" - it does not have to be RDBMS so that I can easily express queries such as
Get all contacts whose name starts with Martin, they are from Company of size 5000 or less, order by time when this contact was inserted in a database, only first 100 results (provide pagination), where each of these segments correspond to a dynamic property.
I need:
filtering - equal, partial equal, (bigger, smaller for integers, dates) and maybe aggregation - but it is not necessary at this point
sorting
pagination
I was considering RDBMS, but this leads more less to this structure which is quite hard to query and it tends to be slow for this amount of data
contact(id serial pk,....);
dynamic_property(dp_id serial pk, ...);
--only one of the values is not empty
dynamic_property_value(dpv_id serial pk, dynamic_property_fk int, value_integer int, date_value timestamp, text_value text);
contact_properties(pav_id serial pk, contact_id_fk int, dynamic_propert_fk int);
property_and_its_value(pav_id_fk int, dpv_id int);
I consider following options:
store contacts in RDBMS and use Lucene for querying - is there anything that would help with this?
Store dynamic properties as XML and store it to rdbms and use xpath support - unfortunatelly it seems to be pretty slow for 500000 contacts
use another database - Mango DB or Jackrabbit to store this information
Which way would you go and why?

Wikipedia has a great entry on Entity-Attribute-Value modeling which is a data modeling technique for representing entities with arbitrary properties. It's typically used for clinical data, but might apply to your situation as well.

Have you considered using Lucene for your querying needs? You could probably get away with just using Lucene and store all your data in the index. Although I wouldn't recommend using Lucene as your only persistence store.
Alternatively, you could use Lucene along with a RDBMS and take advantage of something like Compass.

You could try other kind of databases like CouchDB which is a document oriented db and is distributed
If you want a dumb solution, for your contacts table you could add some 50 columns like STRING_COLUMN1, STRING_COLUMN2... upto 10, DATE_COLUMN1..DATE_COLUMN10. You have another DESCRIPTION column. So if a row has a name which is a string then STRING_COLUMN1 stores the value of your name and the DESCRIPTION column value would be "STRING_COLUMN1-NAME". In this case querying can be a bit tricky. I know many purists laugh at this, but I have seen a similar requirement solved this way in one of the apps :)

Which serialization format for key/value pairs is best indexable in RDBMS?

I have a certain object type that is stored in a database. This type now gets additional information associated with it which will differ in structure among instances. Although for groups of instances the information will be identically structured, the structure will only be known at runtime and will change over time.
I decided to just add a blob field to the table and store the key/value pairs there in some serialized format. From your experience, what format is most advisable?
In the context of my application, the storage space for this is secondary. There's one particular operation that I want to be fast, which is looking up the correct instance for a given set of key / value pairs (so it's a kind of variable-field composite key). I guess that means, is there a format that plays particularly well with typical database indexing?
Additionally, I might be interested in looking for a set of instances that share the same set of keys (an adhoc "class", if you wish).
I'm writing this in Java and I'm storing in various types of SQL databases. I've got JSON, GPB and native Java serialization on my radar, favouring the cross-language formats. I can think of two basic strategies:
store the set of values in the table and add a foreign key to a separate table that contains the set of keys
store the key/value pairs in the table

If your goal is to take advantage of database indexes, storing the unstructured data in a BLOB is not going to be effective. BLOBs are essentially opaque from the RDBMS's perspective.
I gather from your description that the unstructured part of the data takes the form of an arbitrary set of key-value pairs associated with the object, right? Well, if the types of all keys are the same (e.g. they're all strings), I'd recommend simply creating a child table with (at least) three columns: the key, the value, and a foreign key to the parent object's row in its table. Since the keys will then be stored in the database as a regular column, they can be indexed effectively. The index should also include the foreign key to the parent table.
A completely different approach would be to look at a "schemaless" database engine like CouchDB, which is specifically designed to deal with unstructured data. I have zero experience with such systems and I don't know how well the rest of your application would lend itself to this alternative storage strategy, but it might be worth looking into.

Not really an anwser to your question, but did you considered looking at the Java Edition of BerkeleyDB ? Duplicate keys and serialized values can be stored with this (fast) engine.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Single data column vs multiple columns in Cassandra - java

Related

Most efficient way to store unused chunk of data in PostgreSQL

Get all documents from Couchbase bucket

Variable structure in a DB table to be read in a Web form

persisting dynamic properties and query

Which serialization format for key/value pairs is best indexable in RDBMS?

Categories

Resources