Reading a SSTable in Java - java

Sorted Strings Table is a file of key/value string pairs, sorted by keys
But it is not clear what fields SSTable entity (class) should have
Should we store all values?
It is clear that we save the shift for each key, but I still cannot fully understand how to store all this
After reading a good article, I still don't fully understand: https://www.igvita.com/2012/02/06/sstable-and-log-structured-storage-leveldb/

Related

Questions about Java's library map classes?

The answers are (2) and (4) but not sure why. I don't have much foundation on these topics. Could someone please explain why these are the correct answers and why the others are incorrect.
Thank you
A HashMap is a data structure that consists of keys and values. Values are stored in the HashMap with an associated key. The values can then be retrieved by recalling from the HashMap with the same key you used when putting the value in.
1
TreeMaps and LinkedHashMaps are different versions of a Map. A HashMap uses hashing to store its keys, whereas a TreeMap uses a binary search tree to store its keys and a LinkedHashMap uses a LinkedList to store keys. If you iterate over a HashMap, the keys will be returned in hash-sorted order (unpredictable in most cases), because that's how they were stored. The TreeMap, however, has a tree of all the values, so when you iterate over the tree, you'll get all the keys in actual sorted order. A LinkedHashMap has the keys in an ordered list, so an iterator will return the keys in the same order in which you inserted them.
2, 3, and 5
In a HashMap, values are looked up using their keys. If you had duplicate keys, the HashMap couldn't know which value to return. Therefore, every key in a HashMap must be unique, but the values do not have to be.
4
In a normal HashMap, the key is hashed and then inserted in the appropriate spot. With a TreeMap and a LinkedHashMap, you have the additional overhead of inserting the key into the tree or linked list which will take up additional time and memory.

Memory-efficient way to store large List<Map<String,String>> where many map entries are identical

I'm looking for a memory-efficient way to store tabular data typically consisting of about 150000 rows x 200 columns.
The cell values are Strings with lengths somewhere in the range of 0-200 characters.
The data rows are initially generated by taking all possible combinations of rows from smaller tables. So while all rows are unique, the columns contain many copies of the same value.
The data is not read-only. Some of the columns (typically up to 20 of the 200) get updated with values that depend on the values of other columns. And new columns (also about 20 I'd expect) with computed values are going to be added to the table.
The existing legacy code heavily depends on the data being stored in a List of Map<String, String>s that map column name to cell value.
But the current implementation, an ArrayList<HashMap<String,String>>, is taking many gigabytes of memory.
I tried calling String.intern() on the keys and values that get inserted into the HashMap. That halved the memory footprint. But it still seems horribly inefficient to keep all those identical Map.Entrys around.
So I was wondering: Can you suggest a more memory-efficient data structure to somehow share the identical column values but that would allow me to keep the external List<Map<String, String>> interface the same?
We already have guava on the class path so using collections from guava is fine.
I have found GS-Collections to be much better suited for memory efficient Maps/Sets. They get around a lot of the overhead of storing map entry objects by using some clever tricks with arrays behind the scenes.

Java JSON object insertion order maintenance

I am using a Java JSON object to store some data. But when I printed it, I found that it stores the data randomly. For example, I stored data like this:
obj.put("key1","val1");
obj.put("key2","val2");
And when I printed it:
{"key2":"val2","key1":"val1"}'
I googled it and found that JSON objects are unordered sets of key value pair. So it doesn't store the order of data.
I need some help in storing data in a JSON object with their order.
Arrays are ordered so use an array of key-value objects [ {key1: val1}, {key2: val2} ]

java solution for hashing lines that contain variables in .csv

I have a file that represent a table recorded in .csv or similar format. Table may include missing values.
I look for a solution (preferably in java), that would process my file in the incremental manner without loading everything into memory, as my file can be huge. I need to identify duplicate records in my file, being able to specify which columns I want to exclude from consideration; then produce an output grouping those duplicate records. I would add an additional value at the end with a group number and output in the same format (.csv) sorted by group number.
I hope an effective solution can be found with some hashing function. For example, reading all lines and storing a hash value with each line number, hash calculated based on the set of variables I provide as an input.
Any ideas?
Ok, here is the paper that holds the key to the answer: P. Gopalan & J. Radhakrishnan "Finding duplicates in a data stream".

Old values in hash map being overwritten by new values?

I have one hash map. I'm storing 12 different key,values pairs in it.
The first 8 values are stored fine, but when I try to put the 9th value it overwrites the old value. But the size increases.
If I try to get the old values, I get nulls. I have also checked the hash map table. Only 8 values are there. The old values are overwritten.
here have only 7 values but size is 9 . how it's possible ?
What could I be doing wrong?
Make sure you use different keys. If that's the case, make sure equals and hashcode for your key class work as required, i.e. when two objects are equal, their hashcodes must be same. And of course, equals for different key values (or what you'd expect to be distinct keys) must return false.
If that doesn't help, post a minimal, yet complete (compilable) example that demonstrates your problem.
As for the size=9 but only 7 values in the table, you are misunderstanding the internal workings of the HashMap. All values are not stored in the top-level table. The table is more like "buckets" that store entries grouped by certain hashcode ranges. Each "bucket" holds a chain of linked entries so what you are seeing in the table are just the first entries in each particular range chain. The size is always correct though, in terms of total number of entries in the map.
As for entries overwriting eachother, that happens only when you put en entry with a key that is identical (hashCode and equals) to en existing entry. So you are either adding with an existing key, or you are adding with null as key (null is permissible as key, but you can only have one entry with the key null).
Check your code, are you adding with null keys? If you are using instances of a custom class (one you created yourself) as key, have you implemented hashCode() and equals() according to the specifications (see http://download.oracle.com/javase/6/docs/api/java/lang/Object.html#hashCode%28%29)? Are you making sure that you are really using unique keys for all 12 put operations?

Categories