How to store and sort very big map? - java

I need to read large text file, parse each line and store parsed content in a map, with a String key mapped to objects I will create. For large maps, it seems it consumes memory quickly. I need to sort the map before it is output into a file; otherwise, I don't need to store all key-value pairs in memory.
I searched, and some suggested using map-reduce, and some suggested a database. In particular, it says Berkeley DB is a good choice. Is it straightforward to sort large key-pair values in Berkeley DB in Java and is it convenient to use it?

Related

Single data column vs multiple columns in Cassandra

I'm working on a project with an existing cassandra database.
The schema looks like this:
partition key (big int)
clustering key1 (timestamp)
data (text)
1
2021-03-10 11:54:00.000
{a:"somedata", b:2, ...}
My question is: Is there any advantage storing data in a json string?
Will it save some space?
Until now I discovered disadvantages only:
You cannot (easily) add/drop columns at runtime, since the application could override the json string column.
Parsing the json string is currently the bottleneck regarding performance.
No, there is no real advantage to storing JSON as string in Cassandra unless the underlying data in the JSON is really schema-less. It will also not save space but in fact use more because each item has to have a key+value instead of just storing the value.
If you can, I would recommend mapping the keys to CQL columns so you can store the values natively and accessing the data is more flexible. Cheers!
Erick is spot-on-correct with his answer.
The only thing I'd add, would be that storing JSON blobs in a single column makes updates (even more) problematic. If you update a single JSON property, the whole column gets rewritten. Also the original JSON blob is still there...just "obsoleted" until compaction runs. The only time that storing a JSON blob in a single column makes any sense, is if the properties don't change.
And I agree, mapping the keys to CQL columns is a much better option.
I don't disagree with the excellent and already accepted answer by #erick-ramirez.
However there is often a good case to be made for using frozen UDTs instead of separate columns for related data that is only ever going to be set and retrieved at the same time and will not be specifically filtered as part of your query.
The "frozen" part is important as it means less work for cassandra but does mean that you rewrite the whole value each update.
This can have a large performance boost over a large number of columns. The nice ScyllaDB people have a great post on that:
If You Care About Performance, Employ User Defined Types
(I know Scylla DB is not exactly Cassandra but I've seen multiple articles that say the same thing about Cassandra)
One downside is that you add work to the application layer and sometimes mapping complex UDTs to your Java types will be interesting.

Most efficient way to store unused chunk of data in PostgreSQL

There are few columns in a table and about 100+ columns based data, which only need to be stored for later export to another sources.
This data (besides the first few columns mentioned) doesn't need to be indexed / filtered or be manipulated in some sort. There are no queries, that can check this data in any way.
The only thing, that application layer can retrieve the whole row with additional unused workload and deserialize it for further conversion in external format.
There was an idea to serialize whole class into this field, but later we realized, that it's a tremendous overhead for data size (because of additional java class metadata).
So it's a simple key-value data (keys set is static as the relational model suggests).
What is a right way and data type to store this additional unused data in PostgreSQL in terms of DB performance (50+ TB storage)? Perhaps it's worth to omit key data and store only values as array (since keys are static) and get values after deserialization by index at the application layer (since DB performance on the first place)?
a_horse_with_no_name, thanks a lot, but jsonb is really tricky data type.
In terms of required amount of bytes for single tuple, that contains jsonb, one must always keep in mind - the size of key names in json format.
Such that, if someone want to reinvent the wheel and store large key names as single byte indexes - it will decrease overall tuple size,
but it isn't better than storing all data as typical relational table fields, because TOAST algorithm applies for both cases.
Another way is to use EXTERNAL storage method for single jsonb field.
In that case PostgreSQL will keep more tuples in cache, since there is no need to keep whole jsonb data in memory.
Anyway, i ended up with combination of protobuf + zlib in bytea field type (since there is no need to query data in bytea field in our system) :
Uber research for protobuf + zlib

removing duplicates in java on large scale data

I have the following issue.
I'm connecting to some place using and API and getting the data as an inputstream.
the goal is to save the data after removing duplicate lines.
duplication defined by columns 10, 15, 22.
i'm getting the data using several threads.
currently I first save the data into a csv file and then remove duplicates.
I want to do it while i'm reading the data.
the volume of the data is about 10 million records.
I have limited memory that I can use.
the machine has 32gb of memory but I am limited since there are other applications that using it.
I read here about using hash maps.
but I'm not sure I have enough memory to use it.
does any one has a suggestion how to solve this issue?
A Hashmap will use up at least as much memory as your raw data. Therefore, it is probably not feasible for the size of your data set (however, you should check that, because if it is, it's the easiest option).
What I would do is write the data to a file or database, compute a hash value for the fields to be deduplicated, and store the hash values in memory with a suitable reference to the file (e.g. the byte index of where the original value is in the written file). The reference should of course be as small as possible.
When you hit a hash match, look up the original value and check whether it is identical (as hashes for different values may fall together).
The question, now, is how many duplicates you expect. If you expect few matches, I would choose a cheap write and expensive read solution, i.e. dumping everything linearly into a flat file and reading back from that file.
If you expect many matches, it's probably the other way round, i.e. having an indexed file or set of files, or even a database (make sure it's a database where write operations are not too expensive).
The solution depends on how big is your data in columns 10, 15, 22.
Assuming it's not too big (say, ca. 1kb) you can actually implement an in-memory solution.
Implement a Key class to store values from columns 10, 15, 22. Carefully implement equals and hashCode methods. (You may also use a normal ArrayList instead.)
Create a Set which would contain keys of all records you read.
For each record you read, check if it's key is already in that set. If yes, skip the record. If not, write the record to output, add the key to the set. Make sure you work with set in a thread-safe manner.
In the worst case you'll need number of records * size of key amount of memory. For 10000000 records and the assumed <1kb per key this should work with around 10GB.
If the key size is still too large, you'll probably need a database to store the set of key.
Another option would be storing hashes of keys instead of full keys. This will require much less memory but you may be getting hash collisions. This may lead to "false positives", i.e. false duplicates which aren't actually duplicates. To completely avoid this you'll need a database.
You can use ConcurrentHashSet. it will automatically remove the duplicate element and it's thread safe up to a certain limit

Fast search through a N*M data structure?

I have the data: number[M][N], it's inputted through a stream, so I can put it whatever data structure I want.
I have to search through it many times using different pairs of short values. So I need to get numbers of rows using values in two columns.
I can create an additional array and use a binary search to find positions using it in inputted data, something like an index in a data base, but is there a standard libraries to solve a task like this?
You can put it into more than one data structure if the searching warrants this. You could have the data in a HashMap, TreeMap, and another Map that would have the key-value mapping the other way around (if that makes sense in your case).
What's the data like, and how do you need to search it?

Which serialization format for key/value pairs is best indexable in RDBMS?

I have a certain object type that is stored in a database. This type now gets additional information associated with it which will differ in structure among instances. Although for groups of instances the information will be identically structured, the structure will only be known at runtime and will change over time.
I decided to just add a blob field to the table and store the key/value pairs there in some serialized format. From your experience, what format is most advisable?
In the context of my application, the storage space for this is secondary. There's one particular operation that I want to be fast, which is looking up the correct instance for a given set of key / value pairs (so it's a kind of variable-field composite key). I guess that means, is there a format that plays particularly well with typical database indexing?
Additionally, I might be interested in looking for a set of instances that share the same set of keys (an adhoc "class", if you wish).
I'm writing this in Java and I'm storing in various types of SQL databases. I've got JSON, GPB and native Java serialization on my radar, favouring the cross-language formats. I can think of two basic strategies:
store the set of values in the table and add a foreign key to a separate table that contains the set of keys
store the key/value pairs in the table
If your goal is to take advantage of database indexes, storing the unstructured data in a BLOB is not going to be effective. BLOBs are essentially opaque from the RDBMS's perspective.
I gather from your description that the unstructured part of the data takes the form of an arbitrary set of key-value pairs associated with the object, right? Well, if the types of all keys are the same (e.g. they're all strings), I'd recommend simply creating a child table with (at least) three columns: the key, the value, and a foreign key to the parent object's row in its table. Since the keys will then be stored in the database as a regular column, they can be indexed effectively. The index should also include the foreign key to the parent table.
A completely different approach would be to look at a "schemaless" database engine like CouchDB, which is specifically designed to deal with unstructured data. I have zero experience with such systems and I don't know how well the rest of your application would lend itself to this alternative storage strategy, but it might be worth looking into.
Not really an anwser to your question, but did you considered looking at the Java Edition of BerkeleyDB ? Duplicate keys and serialized values can be stored with this (fast) engine.

Categories