Sorting BIG Data XML file

Sorting BIG Data XML file - java

I have an XML file that has a compressed size of about 100 GB (uncompressed 1 TB). This file contains about 100 million entries in the following way:
<root>
<entry>
<id>1234</id>
...
</entry>
<entry>
<id>1230</id>
...
</entry
</root>
I would like to sort this file by id. What would be a good way to do so?
By the way, I can use a machine with 16 cores and 128 GB RAM.

You can think about using a streaming processor like Saxon http://www.saxonica.com/html/documentation/sourcedocs/streaming/ and sort using XSLT.
Another option may be to store the data as key, values in DB, order them using SQL and recreating the XML. You would be leveraging the power of DB to manage large amount of data.
Similar question (not same): Sort multigigabyte xml file

Because the values (i.e id) are natural numbers, the best algorithm to sort them is Counting Sort with TETA(n) time order.
Suppose the values are in range [1 .. k]
Counting Sort >
Temp: C[1..k]
Input: A[1..n]
Output: B[1..n]
CountingSort (A, B, k)
{
for(i=1 to k) C[i]=0;
for(i=1 to n) C[A[i]]++;
for(i=2 to k) C[i]=C[i]+C[i-1];
for(i=n downto 1)
{
B[C[A[i]]] = A[i];
C[A[i]]--;
}
}
This algorithm is Stable.
You can also use Radix Sort with the same order.

At this stage it's useful to remember the techniques that people used to sort magnetic tapes or decks of punched cards, in the days when the data was much larger than available direct access memory. (I once watched a team of operators sort a quarter of a million cards - about 120 trays). You basically need a combination of streaming, merging, and splitting, which are all operations available in principle using XSLT 3.0. There are two processors available, Saxon-EE and Exselt, and neither is yet a 100% complete implementation, so you'll be constrained by the limitations of the products more than the spec.
My instinct would be to go for a digit-by-digit sort. You don't say how long the id's used as sort keys are. "Digits" here of course doesn't have to mean decimal digits, but assuming decimal for simplicity, the basic idea is that you first split the file into 10 buckets based on the last digit of the sort key, then you process the buckets in sequence based on this ordering, this time sorting by the penultimate digit, and carry on for as many digits as there are in the key: one pass of the complete file for each digit in the sort key.
If the id's are dense then presumably with 100m keys they are about 8 digits long, that would be 8 passes and if we assume a processing speed of 10Gb/min, which is probably the best you can get from off-the-shelf XML parsers, then each pass of a 1Tb file is going to take 2 hours, so 8 passes would be 16 hours. But it might be better to use say base-100 so you split into 100 files on each pass, then you only have 4 passes.
The essential XSLT 3.0 code is:
<xsl:stream href="in.xml">
<xsl:fork>
<xsl:for-each-group select="record"
group-by="substring(key, $digit, 1)">
<xsl:result-document href="temp{current-grouping-key()}">
<xsl:sequence select="current-group()"/>
</xsl:result-document>
</xsl:for-each-group>
</xsl:fork>
Now the bad news: in Saxon-EE 9.7 this code probably isn't sufficiently optimised. Although in principle the items in each group could be streamed directly to the relevant serialised result-document, Saxon doesn't yet treat this case specially and will build each group in memory before processing it. I don't know if Exselt can do any better.
So is there an alternative? Well, perhaps we could try something like this:
Split the file into N files: that is, put the first X/N items into file 1, the next X/N into file 2, and so on.
Sort each file, conventionally in memory.
Do a streamed merge of the resulting files, using xsl:merge.
I think that would work in Saxon. The first step can be done using <xsl:for-each-group group-adjacent="(position()-1) idiv $N"> which is fully streamed in Saxon.
This is essentially a 3-pass solution, in that each item is parsed and serialized three times. I would go for splitting the 1Tb file into 100 10Gb files. Doing an in-memory XSLT sort of a 10Gb is pushing it, but you've got some horsepower to play with. You could however run into Java addressing limits: arrays and strings have 1G limits, I think.

Related

How to reduce memory usage for a HashMap<String, Integer> like data structure

Before starting to explain my problem, I should mention that I am not looking for a way to increase Java heap memory. I should strictly store these objects.
I am working on storing huge number (5-10 GB) of DNA sequences and their counts (Integer) in a hash table. The DNA sequences (with length 32 or less) consists of 'A', 'C', 'G', 'T', and 'N' (undefined) chars. As we know, when storing a large number of objects in memory, Java has poor space efficiency compared to lower level languages like C and C++. Thus, if I store this sequence as string (it holds about 100 MB memory for a sequence with length ~30), I see the error.
I tried to represent nucleic acids as 'A'=00, 'C'=01, 'G'=10, 'T'=11 and neglect 'N' (because it ruins the char to 2-bit transform as the 5-th acid). Then, concatenate these 2-bit acids into byte array. It brought some improvement but unfortunately I see the error after a couple of hours again. I need a convenient solution or at least a workaround to handle this error. Thank you in advance.

Being fairly complex maybe this here is a weird idea, and would require quite a lot of work, but this is what I would try:
You already pointed out two individual subproblems of your overall task:
the default HashMap implementation may be suboptimal for such large collection sizes
you need to store something else than strings
The map implementation
I would recommend to write a highly tailored hash map implementation for the Map<String, Long> interface. Internally you do not have to store strings. Unfortunately 5^32 > 2^64, so there is no way to pack your whole string into a single long, well, let's stick to two longs for a key. You can make string to/back long[2] conversion fairly efficiently on the fly when providing a string key to your map implementation (use bit shifts etc).
As for packing the values, here are some considerations:
for a key-value pair a standard hashmap will need to have an array of N longs for buckets, where N is the current capacity, when the bucket is found from the hash key it will need to have a linked list of key-value pairs to resolve keys that produce identical hash codes. For your specific case you could try to optimize it in the following way:
use a long[] of size 3N where N is the capacity to store both keys and values in a continuous array
in this array, at locations 3 * (hashcode % N) and 3 * (hashcode % N) + 1 you store the long[2] representation of the key, of the first key that matches this bucket or of the only one (on insertion, zero otherwise), at location 3 * (hashcode % N) + 2 you store the corresponding count
for all those cases where a different key results in the same hash code and thus the same bucket, your store the data in a standard HashMap<Long2KeyWrapper, Long>. The idea is to keep the capacity of the array mentioned above (and resize correspondingly) large enough to have by far the largest part of the data in that contiguous array and not in the fallback hash map. This will dramatically reduce the storage overhead of the hashmap
do not expand the capacity in N=2N iterations, make smaller growth steps, e.g. 10-20%. this will cost performance on populating the map, but will keep your memory footprint under control
The keys
Given the inequality 5^32 > 2^64 your idea to use bits to encode 5 letters seems to be the best I can think of right now. Use 3 bits and correspondingly long[2].

I recommend you look into the Trove4j Collections API; it offers Collections that hold primitives which will use less memory than their boxed, wrapper classes.
Specifically, you should check out their TObjectIntHashMap.
Also, I wouldn't recommended storing anything as a String or char until JDK 9 is released, as the backing char array of a String is UTF-16 encoded, using two bytes per char. JDK 9 defaults to UTF-8 where only one byte is used.

If you're using on the order of ~10gb of data, or at least data with an in memory representation size of ~10gb, then you might need to think of ways to write the data you don't need at the moment to disk and load individual portions of your dataset into memory to work on them.
I had this exact problem a few years ago when I was conducting research with monte carlo simulations so I wrote a Java data structure to solve it. You can clone/fork the source here: github.com/tylerparsons/surfdep
The library supports both MySQL and SQLite as the underlying database. If you don't have either, I'd recommend SQLite as it's much quicker to set up.
Full disclaimer: this is not the most efficient implementation, but it will handle very large datasets if you let it run for a few hours. I tested it successfully with matrices of up to 1 billion elements on my Windows laptop.

Data structure choice for ngrams upto length 5, when building count-based distributional model

I am building a distributional model (count based) from text. Basically for each ngram (a sequence of words), I have to store a count. I need reasonably quick access to the count. For n=5, technically all possible 5-grams are (10^4)^5 even if I assume a conservative estimate of 10k words, which is too high. But many combinations of these n-grams wouldn't exist in text, so a 5d array kind of structure is out of consideration.
I built a trie, where each word is a node. So this trie would be really wide, with max depth 5. That gave me considerable saving of memory. But I still run out of memory (64GB) after I train on enough files. To be fair, I am not using any super efficient Java practices here. Each node has a count, index of word as int. I then have a HashMap to store children. I initially started with a list. Tried to sort it each time I added a child, but I was losing lot of time there, so moved to HashMap. Even with a list, I will run out of memory after reading some more files.
So I guess I need to divide my task into parts, store each part to disk. But ultimately, when accessing I would need to merge these data structures. So I think the way forward is a disk based solution, where I know which file to access for ngrams which start with something (some sort of ordering). As I see it, the problem with trie is it's not very efficient when I go around to merging it. I would need to load two parts into memory to merge. That wouldn't really work.
What approach would you recommend? I looked into a HashMap encoding based structure for language models (like the one berkeleylm uses). But in their use case, they don't need to reconstruct the ngram, so they just hash it and store the hashvalue as context. I need to be able to access the context later.
Any suggestions? Is there any value in using a database? Can they do it without being in-memory?

I wouldn't use HashMap, it's quite memory intensive, a simple sorted array should be better, you can then use binary search on it.
Maybe you could also try a binary prefix-trie. First you create a single char-string, for example by interleave the letters of the words into a single string (I suppose you could also concatenate them, separated by a blank). This long String could then be stored in a binary trie. See CritBit1D for an example.
You could also use a multi-dimensional tree. Many trees are limited to 64bit numbers, but you cold turn the eight leading ASCII characters of every word into a 64-bit integer number and then store that as a 5D key. That should be much more efficient than a 5D array. Multi-dim indexes are: kd-trees, R-trees or quadtrees. The 5-gram-count and the full 5-gram (including remaining characters) can be stored separately in the VALUE that can be associated with each 5D-KEY.
If you are using Java you could try my very own tree. It's a prefix-sharing bitwise quadtree. It is very memory efficient, very well suited to larger datasets (1M entries upwards) and works natively with 'integer' rather than 'float'. It also has very good nearest neighbour search.

Huffman Code Decoder Encoder In Java Source Generation

I want to create a fast Huffman Code decoder in Java and therefore thought about lookup tables. Since those tables consume memory and we use Java code to navigate and access the tables one can easily (or not) write a programm / method that expresses the same table.
The problem with that approach is, I dont know what is the best strategy. I know it is a lot about what fits in the cache and branch prediction. Also the switch case implementation meaning the actual ASM is beyond me. If I have a in memory lookup table (or a hierarchy of it) I will be able to simply jump in and out but I doupt that for my purposal that table would fit in the cache.
Since I actually walk a tree one could implement it as if else statements requireing a certain number of comparisms but for each comparism it would need additional binary operations.
So the following options exist:
General Algorithm using in Memory lookup tables
If/else representation of the decision tree
If/else representation with small switch statements to find the correct group of symboles (same bit pattern length) (fewer if statements, might be more code).
Switch statement representation of the code
Writing and benchmarking is quite tricky so any initial thoughts would be great.
One additional problem that comes into play is the order of bits. The most significant bit comes always first meaning it is stored in reverse order.
If your tree is A = 0, B = 10, C = 11 to write BAC it would actually be 01 + 0 + 11 (plus means append).
So actually the code have to be written in reverse order. using if /else or switch approach for groups it would not be a problem since masking out the bits is simple and the reverse of bit is simply possible but it would lose the idea of getting the index within the group out of the mask since in reverse bit order add and remove have different meaning and also a simple lookup is not possible.
Reversing the bits is a costly operation (I use 4bit lookup tables) not outweighting the performance penality of binary operations.
But reversing the bits on the go is better suited for this and require four operations per bit (shifting up, Masking out, add and also shifting the input down). Since I read bits ahead all those operations will be done in registers so they might take only a few cycles.
This way I can use switch, sub and if to find the right symbol group and also to return those.
So finaly I need advices. Since my codes are global for language processing, they can be hardwired (ie be in source).
I wonder what the parser generators like ANTRL use to express those decisions. Since they also seam to switch or if/else based on the input symbole it would might give me a clue.
[Updates]
I found a simplification that avoids the reverse bit problem but still adds costs per group. So I end up in writing the bits in the order of the groups to traverse. So I will not need four modifications per bit but per group (different bit lengths).
For each group we have:
1. The value for the first element, the size (and therefore the value for the last element within that group.
Therefore for each group the algorithm looks like:
1. Read mbits and combine with the current read value.
2. Compare the value with the last value of that group is it smaller its within that group if not its outside. -> read next
3. If it is inside the group aan array of values can be accessed or use a switch statement.
This is totally generic and can be used without loops making it efficient. Also if the group was detected, the bit length of the code is known and the bits can be consumed from source since the code looks far ahead (reading from stream).
[Update 2]
To access the actual value one could use a single big array of elements grouped by group. Since the propability reduces for group to group it is very likely that a significant part fits L2 or L1 cache speeding up access here.
Or one uses switch statements.
[Update 3]
Depending on the cases of a switch the compiler generates either a tableswitch or a lookup switch. The lookup switch has a complexity of O(log n) and stores key, jmp offset pairs which is not preferable. Therefore checking for groups is better suited for if/else.
The tableswitch itself uses only a table of jump offsets and it only takes substract, compare, access, jmp to reach the destination, than it must executes a return value on a constant.
Therefore a table access looks more promising. Also to avoid an unnecessary jump each group might contain the logic to access and return the group symbols table. Storing everything in a big table is promising since it might be int or short per symbole and my codes often do only have 1000 to 4000 symbols at most making it actually short.
I will check if 1 - pattern will give me the opportunity to store and access the masks in a better way allowing for binary searching the correct group instead of advancing in O(n) and might even avoid any shift operations at all during the processing.

I couldn't make sense of most of what you wrote in your (long) question, but there is a simple approach.
We'll start with a single table. Let's say your longest Huffman code is 15 bits. (In fact, deflate limits the size of its Huffman codes to 15 bits.) Then construct a table with 32768 entries, where each entry is the number of bits in the next code, and the symbol for that code. For codes less than 15 bits, there is more than one entry in the table for the same code. E.g. if the code is 10010110 (7 bits) for the symbol 'C', then all of the indexes of the table xxxxxxxx10010110 have the same thing. Those entries all have {7, 'C'}.
Then you get 15 bits from the stream, and look up the next code in the table. You remove the number of bits from that table entry, and use the resulting symbol. Now you get as many bits from the stream as you need to have 15, and repeat. So if you used 7 bits, then get 8 more to get back to 15 and look up the next code.
The next subtlety is that if your Huffman code changes often, you might end up spending more time filling up that large table for each new Huffman code than you spend actually decoding. To avoid that, you can make a two-level table which has, say, a 9-bit lookup (512 entries) for the first portion of the code. If the code is 9-bits or less, then you proceed as above. That will be the most common case, since shorter codes are more frequent (that being the whole point of Huffman coding). If the table entry says that there are 10 or more bits in the code (and you don't know yet how much more), then you consume the first nine bits and go to a second-level table for those initial nine bits pointed to by the entry in the first table, that has entries for the remaining six bits (64 entries). That resolves the remainder of the code and so tells you how many more bits to consume and what the symbol is. This approach can greatly reduce the time spent filling tables, and is very nearly as fast since short codes are more common. This is the approach used by inflate in zlib.

In the end it was quite simple. I support almost all solutions now. One can test every symbol group (same bit length), use a lookup table (10bit + 10bit + 10bit (just tables of 10bit, symbolscount + 1 is the reference to those talbes)) and generating java (and if needed javascript but currently I use GWT to translate it).
I even use long reads and shift operations to reduce the access to binary information. This way the code gets more efficiently since I only support a maximum bit size (20bit (so a table of a table) which makes 2^20 symbols and therefore at most a million).
For the ordering I use a generator for the bit masks just using shift operations and no requirement of reversing bit orders or such.
The table lookups can also be expressed in Java storing the tables as arrays of arrays (its interesting how big the java files can be without compilers to complain)).
Also I found it interesting that since comparing is expressing an ordering (half order I guess) one can sort the symbols and instead of mapping the symbols mapping the comparison index. By comparing two index one can simply sort streams of codes without touching to much. By also storing the first or first two comparison index (16 or 32bit) one can efficiently sort and therefore binary sort compressed strings using the same Huffman code, which makes it ideal to compress strings in a certain language.

How to compress many strings across a data structure?

I have a 500GB collection of XML documents that I'm indexing. I'm currently only able to index 6GB of this collection with 32GB of RAM.
My index structure is a HashMap<String, PatriciaTrie<String, Integer>>, where the first string represents a term and the second string is of the format filepath+XPath, with the final integer representing the number of occurrences.
I used a trie to reduce the shared prefix and because I need the data sorted. It helped a little with compression, but it wasn't enough.
The total collection of filepath+XPath strings is somewhere between 1TB and 4TB within this data structure. I need to be able to compress this data structure entirely into memory. The target machine has 256GB RAM and 16 CPU cores. Less memory has multiple added benefits (such as reducing cold start time). Index time isn't such a big deal.
The XPaths represent about 250 total node types.
The approach I'm currently working on will build a Huffman table for each series of 2 tags, based on the tags that can possibly occur next. Often, this cuts the options down to about 4 or 5, which allows the XPath to be encoded into a much shorter bitstring, which can then be encoded as bytes.
The strings are typically 40-600 bytes (UTF-8), and I believe this should reduce everything after the filepath prefix (the first 40 characters, which are compressed by the trie) into at max 12 bytes (the deepest point on the tree is about 12 nodes deep, and each node is at worst 1 char to represent) for the structure and 12 bytes for the indexes (variable byte encoding, with very few elements containing indexes above 256), producing strings that are usually in the range 40-64 bytes.
I think this is a good approach, but I think I may be missing something.
Is there a better approach for compressing this data structure or the data that goes into it?
How do people usually compress many strings across the same data structure?
Is there any existing solution that compresses many strings independently based on the whole collection?
After the strings are in the data structure like this, are there any good techniques for compressing the tries based on the structure shared between them?

I think your biggest problem here is that you're storing too much data for each term. You don't say how many unique terms you have or how many individual files, but I'll give some example numbers.
Say you have 200,000 unique terms across 200 different files. So every unique term carries the weight of at least one file path, or 40 bytes. And that's before you start indexing anything.
You should be able to compress this data into a table of filepath+Xpath strings, and a list of terms, each of which contains references to entries in that table. So, for example, you might have:
Path table:
index Path
1 file+xpath1
2 file+xpath2
3 file+xpath3
...
999 file+xpath999
Terms
term references
foo 1, 19, 27, 33, 297
bar 99, 864, 865
...
Now, your paths table is probably still way too large. The first think you can do is to build a files table and make the first part of the paths entry an index into the files table. So you end up with:
Files
1 file1.xml
2 file2.xml
...
999 file999.xml
And then your paths become:
1 1,xpathA
2 1,xpathB
3 2,xpathQ
...
If you need more compression after that, build a string table that contains the xpath terms, and your paths entries become a series of indexes into that table. You have to be careful here, though, because allocation overhead for arrays or lists is going to make short lists very expensive. If you go this route, then you'll want to encode the paths list as one big binary array, and index into it. For example.
Words list
1 the
2 quick
3 brown
4 fox
Paths
index path
0 1(index of file),2(quick),4(fox),-1(terminator)
4 3(index of file),3(brown),-1(terminator)
7 etc . . .
The Paths table is just a big array that would look like this:
1,2,4,-1,3,3,-1,...
This minimizes data storage cost because no string is ever stored more than once. All you have is string tables and references to those strings. The amount of space it takes will be something like:
Combined length of all file names
Combined length of all path segment terms
(number of paths) * (average path length) * (size of integer index)
(number of terms) * (average number of references per term) * (size of integer index)
Building this in memory might be possible. It's hard to say without knowing how many individual terms you have. You'll need dictionaries for the file names, the paths, and the individual path segments if you use the words list. But it can all be done in a single pass if you have the memory.
If you don't have enough memory for the whole tree while you're building, you can load the file names and maintain the paths table in memory. As you find each term in a file, write it to disk along with its path reference. You end up with a disk file that looks like:
term, path reference
term, path reference
...
Use an external sort program to sort by term, and then go through and combine duplicates. When you're done you end up with a file that contains:
File names table
Path segments table
Paths
terms
Lookup is really easy. Find the term, look up each reference in the paths table, and decode the path by indexing into the file names and path segments tables.
I used something like this a few years back and it worked quite well. You should be able to write a program that analyzes your data to come up with the numbers (unique paths, number of file names, average number of references per term, etc.). From there, you can easily determine if using this technique will work for you.

Way to store a large dictionary with low memory footprint + fast lookups (on Android)

I'm developing an android word game app that needs a large (~250,000 word dictionary) available. I need:
reasonably fast look ups e.g. constant time preferable, need to do maybe 200 lookups a second on occasion to solve a word puzzle and maybe 20 lookups within 0.2 second more often to check words the user just spelled.
EDIT: Lookups are typically asking "Is in the dictionary?". I'd like to support up to two wildcards in the word as well, but this is easy enough by just generating all possible letters the wildcards could have been and checking the generated words (i.e. 26 * 26 lookups for a word with two wildcards).
as it's a mobile app, using as little memory as possible and requiring only a small initial download for the dictionary data is top priority.
My first naive attempts used Java's HashMap class, which caused an out of memory exception. I've looked into using the SQL lite databases available on android, but this seems like overkill.
What's a good way to do what I need?

You can achieve your goals with more lowly approaches also... if it's a word game then I suspect you are handling 27 letters alphabet. So suppose an alphabet of not more than 32 letters, i.e. 5 bits per letter. You can cram then 12 letters (12 x 5 = 60 bits) into a single Java long by using 5 bits/letter trivial encoding.
This means that actually if you don't have longer words than 12 letters / word you can just represent your dictionary as a set of Java longs. If you have 250,000 words a trivial presentation of this set as a single, sorted array of longs should take 250,000 words x 8 bytes / word = 2,000,000 ~ 2MB memory. Lookup is then by binary search, which should be very fast given the small size of the data set (less than 20 comparisons as 2^20 takes you to above one million).
IF you have longer words than 12 letters, then I would store the >12 letters words in another array where 1 word would be represented by 2 concatenated Java longs in an obvious manner.
NOTE: the reason why this works and is likely more space-efficient than a trie and at least very simple to implement is that the dictionary is constant... search trees are good if you need to modify the data set, but if the data set is constant, you can often run a way with simple binary search.

I am assuming that you want to check if given word belongs to dictionary.
Have a look at bloom filter.
The bloom filter can do "does X belong to predefined set" type of queries with very small storage requirements. If the answer to query is yes, it has small (and adjustable) probability to be wrong, if the answer to query is no, then the answer guaranteed to be correct.
According the Wikipedia article you could need less than 4 MB space for your dictionary of 250 000 words with 1% error probability.
The bloom filter will correctly answer "is in dictionary" if the word actually is contained in dictionary. If dictionary does not have the word, the bloom filter may falsely give answer "is in dictionary" with some small probability.

A very efficient way to store a directory is a Directed Acyclic Word Graph (DAWG).
Here are some links:
Directed Acyclic Word Graph or DAWG description with sourcecode
Construction of the CDAWG for a Trie
Implementation of directed acyclic word graph

You'll be wanting some sort of trie. Perhaps a ternary search trie would be good I think. They give very fast look-up and low memory usage. This paper gives some more info about TSTs. It also talks about sorting so not all of it will apply. This article might be a little more applicable. As the article says, TSTs
combine the time efficiency of digital
tries with the space efficiency of
binary search trees.
As this table shows, the look-up times are very comparable to using a hash table.

You could also use the Android NDK and do the structure in C or C++.

The devices that I worked basically worked from a binary compressed file, with a topology that resembled the structure of a binary tree. At the leafs, you would have the Huffmann compressed text. Finding a node would involve having to skip to various locations of the file, and then only load the portion of the data really needed.

Very cool idea as suggested by "Antti Huima" trying to Store dictionary words using long. and then search using binary search.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.