Most efficient way to store unused chunk of data in PostgreSQL - java

There are few columns in a table and about 100+ columns based data, which only need to be stored for later export to another sources.
This data (besides the first few columns mentioned) doesn't need to be indexed / filtered or be manipulated in some sort. There are no queries, that can check this data in any way.
The only thing, that application layer can retrieve the whole row with additional unused workload and deserialize it for further conversion in external format.
There was an idea to serialize whole class into this field, but later we realized, that it's a tremendous overhead for data size (because of additional java class metadata).
So it's a simple key-value data (keys set is static as the relational model suggests).
What is a right way and data type to store this additional unused data in PostgreSQL in terms of DB performance (50+ TB storage)? Perhaps it's worth to omit key data and store only values as array (since keys are static) and get values after deserialization by index at the application layer (since DB performance on the first place)?

a_horse_with_no_name, thanks a lot, but jsonb is really tricky data type.
In terms of required amount of bytes for single tuple, that contains jsonb, one must always keep in mind - the size of key names in json format.
Such that, if someone want to reinvent the wheel and store large key names as single byte indexes - it will decrease overall tuple size,
but it isn't better than storing all data as typical relational table fields, because TOAST algorithm applies for both cases.
Another way is to use EXTERNAL storage method for single jsonb field.
In that case PostgreSQL will keep more tuples in cache, since there is no need to keep whole jsonb data in memory.
Anyway, i ended up with combination of protobuf + zlib in bytea field type (since there is no need to query data in bytea field in our system) :
Uber research for protobuf + zlib

Related

Single data column vs multiple columns in Cassandra

I'm working on a project with an existing cassandra database.
The schema looks like this:
partition key (big int)
clustering key1 (timestamp)
data (text)
1
2021-03-10 11:54:00.000
{a:"somedata", b:2, ...}
My question is: Is there any advantage storing data in a json string?
Will it save some space?
Until now I discovered disadvantages only:
You cannot (easily) add/drop columns at runtime, since the application could override the json string column.
Parsing the json string is currently the bottleneck regarding performance.
No, there is no real advantage to storing JSON as string in Cassandra unless the underlying data in the JSON is really schema-less. It will also not save space but in fact use more because each item has to have a key+value instead of just storing the value.
If you can, I would recommend mapping the keys to CQL columns so you can store the values natively and accessing the data is more flexible. Cheers!
Erick is spot-on-correct with his answer.
The only thing I'd add, would be that storing JSON blobs in a single column makes updates (even more) problematic. If you update a single JSON property, the whole column gets rewritten. Also the original JSON blob is still there...just "obsoleted" until compaction runs. The only time that storing a JSON blob in a single column makes any sense, is if the properties don't change.
And I agree, mapping the keys to CQL columns is a much better option.
I don't disagree with the excellent and already accepted answer by #erick-ramirez.
However there is often a good case to be made for using frozen UDTs instead of separate columns for related data that is only ever going to be set and retrieved at the same time and will not be specifically filtered as part of your query.
The "frozen" part is important as it means less work for cassandra but does mean that you rewrite the whole value each update.
This can have a large performance boost over a large number of columns. The nice ScyllaDB people have a great post on that:
If You Care About Performance, Employ User Defined Types
(I know Scylla DB is not exactly Cassandra but I've seen multiple articles that say the same thing about Cassandra)
One downside is that you add work to the application layer and sometimes mapping complex UDTs to your Java types will be interesting.

Java write object to MySQL - each field or serialized byte array?

I'm trying to save an instance of a Serializable Java class to a MySQL database.
As far as I know I have two options:
Create a table that contains all fields as columns and re-create the object from the field-data saved in the database.
Serialize the instance -> get the byte array -> save the byte array to the database -> re-create the instance from the byte array.
My Question: Which way works faster and which way needs less space?
And my second question: How would I easilly write and get the byte array from the mysql database using jdbc?
Saving the serialized byte array would maybe save space as you wouldn't have all of the meta data associated with table columns, headers, etc... Having said that, I don't know that there would be any noticeable difference in speed or storage space by saving the individual fields in columns in the database versus one column with the object bytes. If you're going to serialize it and save it, you might as well save it to a file and not use a database at all. Not to mention, as your objects and model change, loading older versions could be problematic. Maintainability would be a nightmare as well.
Personally, I'd never save a serialized byte array of an object in a database unless there was a very specific business case or reason to do so. I'd just create the table, columns, and persist it that way using JDBC or your favorite persistenace framework (like Hibernate). Saving it as a serialized byte array only seems to limit what you can do with the data. If you don't want to create the database, tables, columns, etc... then consider just serializing it and writing to a file. That would probably save some space and time as you wouldn't have to maintain a database server. Granted, the more objects you have, the more files, and the harder it would be to search and query that data.
TL;DR: I'd just create the database tables for the data you're trying to save. I don't see any noticeable benefits from saving it in a database as a serialized byte array.

What data structure to use for big data

I have an excel sheet with a million rows. Each row has 100 columns.
Each row represents an instance of a class with 100 attributes, and the columns values are the values for these attributes.
What data structure is the most optimal for use here, to store the million instance of data?
Thanks
It really depends on how you need to access this data and what you want to optimize for – like, space vs. speed.
If you want to optimize for space, well, you could just serialize and compress the data, but that would likely be useless if you need to read/manipulate the data.
If you access by index, the simplest thing is an array of arrays.
If you instead use an array of objects, where each object holds your 100 attributes, you have a better way to structure your code (encapsulation!)
If you need to query/search the data, it really depends on the kind of queries. You may want to have a look at BST data structures...
One million rows with 100 values where is each value uses 8 bytes of memory is only 800 MB which will easily fit into the memory of most PC esp if they are 64-bit. Try to make the type of each column as compact as possible.
A more efficient way of storing the data is by column. i.e. you have array for each column with a primitive data type. I suspect you don't even need to do this.
If you have many more rows e.g. billions, you can use off heap memory i.e. memory mapped files and direct memory. This can efficient store more data than you have main memory while keeping you heap relatively small. (e.g. 100s of GB off-heap with 1 GB in heap)
If you want to store all the data in memory, you can use one of the implementations of Table from Guava, typically ArrayTable for dense tables or HashBasedTable if most cells are expected to be empty. Otherwise, a database (probably with some cache system like ehcache or terracota) would be a better shot.
Your best option would be to store them in a table in an actual database, like Postgres etc. These are optimised to work for what you are talking about!
In that kind of data i would prefer using a MYSQL database because it is faster and can accumulate a large file like that.
The best option would be using a database that can store large number of data and fast enough for faster accessibility like ORACLE, MSSQL, MYSQL and any other databases that are fast and can store large amount of data.
If you really have a million rows or more with 100 values each, I doubt it will all fit into your memory... or is there a special reason for it? For example poor performance using a database?
Since you wnat to have random access, I'd use a persistence provider like hibernate and some database you like (for example mysql).
But be aware that the way you use the persistence provider has a great impact on performance. For example you should use batch-inserts (which are incompatible with autogenerated ids).

persisting dynamic properties and query

I have a requirement to implement contact database. This contact database is special in a way that user should be able to dynamically (on runtime) add properties he/she wants to track about the contact. Some of these properties are of type string, other numbers and dates. Some of the properties have pre-defined values, others are free fields etc.. User wants to be also able to query such structure fast and easily. The database needs to handle easily 500 000 contacts each having around 10 properties.
It leads to dynamic property model having Contact class with dynamic properties.
class Contact{
private Map<DynamicProperty, Collection<DynamicValue> values> propertiesAndValues;
//other userfull methods
}
The question is how can I store such a structure in "some database" - it does not have to be RDBMS so that I can easily express queries such as
Get all contacts whose name starts with Martin, they are from Company of size 5000 or less, order by time when this contact was inserted in a database, only first 100 results (provide pagination), where each of these segments correspond to a dynamic property.
I need:
filtering - equal, partial equal, (bigger, smaller for integers, dates) and maybe aggregation - but it is not necessary at this point
sorting
pagination
I was considering RDBMS, but this leads more less to this structure which is quite hard to query and it tends to be slow for this amount of data
contact(id serial pk,....);
dynamic_property(dp_id serial pk, ...);
--only one of the values is not empty
dynamic_property_value(dpv_id serial pk, dynamic_property_fk int, value_integer int, date_value timestamp, text_value text);
contact_properties(pav_id serial pk, contact_id_fk int, dynamic_propert_fk int);
property_and_its_value(pav_id_fk int, dpv_id int);
I consider following options:
store contacts in RDBMS and use Lucene for querying - is there anything that would help with this?
Store dynamic properties as XML and store it to rdbms and use xpath support - unfortunatelly it seems to be pretty slow for 500000 contacts
use another database - Mango DB or Jackrabbit to store this information
Which way would you go and why?
Wikipedia has a great entry on Entity-Attribute-Value modeling which is a data modeling technique for representing entities with arbitrary properties. It's typically used for clinical data, but might apply to your situation as well.
Have you considered using Lucene for your querying needs? You could probably get away with just using Lucene and store all your data in the index. Although I wouldn't recommend using Lucene as your only persistence store.
Alternatively, you could use Lucene along with a RDBMS and take advantage of something like Compass.
You could try other kind of databases like CouchDB which is a document oriented db and is distributed
If you want a dumb solution, for your contacts table you could add some 50 columns like STRING_COLUMN1, STRING_COLUMN2... upto 10, DATE_COLUMN1..DATE_COLUMN10. You have another DESCRIPTION column. So if a row has a name which is a string then STRING_COLUMN1 stores the value of your name and the DESCRIPTION column value would be "STRING_COLUMN1-NAME". In this case querying can be a bit tricky. I know many purists laugh at this, but I have seen a similar requirement solved this way in one of the apps :)

Which serialization format for key/value pairs is best indexable in RDBMS?

I have a certain object type that is stored in a database. This type now gets additional information associated with it which will differ in structure among instances. Although for groups of instances the information will be identically structured, the structure will only be known at runtime and will change over time.
I decided to just add a blob field to the table and store the key/value pairs there in some serialized format. From your experience, what format is most advisable?
In the context of my application, the storage space for this is secondary. There's one particular operation that I want to be fast, which is looking up the correct instance for a given set of key / value pairs (so it's a kind of variable-field composite key). I guess that means, is there a format that plays particularly well with typical database indexing?
Additionally, I might be interested in looking for a set of instances that share the same set of keys (an adhoc "class", if you wish).
I'm writing this in Java and I'm storing in various types of SQL databases. I've got JSON, GPB and native Java serialization on my radar, favouring the cross-language formats. I can think of two basic strategies:
store the set of values in the table and add a foreign key to a separate table that contains the set of keys
store the key/value pairs in the table
If your goal is to take advantage of database indexes, storing the unstructured data in a BLOB is not going to be effective. BLOBs are essentially opaque from the RDBMS's perspective.
I gather from your description that the unstructured part of the data takes the form of an arbitrary set of key-value pairs associated with the object, right? Well, if the types of all keys are the same (e.g. they're all strings), I'd recommend simply creating a child table with (at least) three columns: the key, the value, and a foreign key to the parent object's row in its table. Since the keys will then be stored in the database as a regular column, they can be indexed effectively. The index should also include the foreign key to the parent table.
A completely different approach would be to look at a "schemaless" database engine like CouchDB, which is specifically designed to deal with unstructured data. I have zero experience with such systems and I don't know how well the rest of your application would lend itself to this alternative storage strategy, but it might be worth looking into.
Not really an anwser to your question, but did you considered looking at the Java Edition of BerkeleyDB ? Duplicate keys and serialized values can be stored with this (fast) engine.

Categories