How to compress many strings across a data structure?

How to compress many strings across a data structure? - java

I have a 500GB collection of XML documents that I'm indexing. I'm currently only able to index 6GB of this collection with 32GB of RAM.
My index structure is a HashMap<String, PatriciaTrie<String, Integer>>, where the first string represents a term and the second string is of the format filepath+XPath, with the final integer representing the number of occurrences.
I used a trie to reduce the shared prefix and because I need the data sorted. It helped a little with compression, but it wasn't enough.
The total collection of filepath+XPath strings is somewhere between 1TB and 4TB within this data structure. I need to be able to compress this data structure entirely into memory. The target machine has 256GB RAM and 16 CPU cores. Less memory has multiple added benefits (such as reducing cold start time). Index time isn't such a big deal.
The XPaths represent about 250 total node types.
The approach I'm currently working on will build a Huffman table for each series of 2 tags, based on the tags that can possibly occur next. Often, this cuts the options down to about 4 or 5, which allows the XPath to be encoded into a much shorter bitstring, which can then be encoded as bytes.
The strings are typically 40-600 bytes (UTF-8), and I believe this should reduce everything after the filepath prefix (the first 40 characters, which are compressed by the trie) into at max 12 bytes (the deepest point on the tree is about 12 nodes deep, and each node is at worst 1 char to represent) for the structure and 12 bytes for the indexes (variable byte encoding, with very few elements containing indexes above 256), producing strings that are usually in the range 40-64 bytes.
I think this is a good approach, but I think I may be missing something.
Is there a better approach for compressing this data structure or the data that goes into it?
How do people usually compress many strings across the same data structure?
Is there any existing solution that compresses many strings independently based on the whole collection?
After the strings are in the data structure like this, are there any good techniques for compressing the tries based on the structure shared between them?

I think your biggest problem here is that you're storing too much data for each term. You don't say how many unique terms you have or how many individual files, but I'll give some example numbers.
Say you have 200,000 unique terms across 200 different files. So every unique term carries the weight of at least one file path, or 40 bytes. And that's before you start indexing anything.
You should be able to compress this data into a table of filepath+Xpath strings, and a list of terms, each of which contains references to entries in that table. So, for example, you might have:
Path table:
index Path
1 file+xpath1
2 file+xpath2
3 file+xpath3
...
999 file+xpath999
Terms
term references
foo 1, 19, 27, 33, 297
bar 99, 864, 865
...
Now, your paths table is probably still way too large. The first think you can do is to build a files table and make the first part of the paths entry an index into the files table. So you end up with:
Files
1 file1.xml
2 file2.xml
...
999 file999.xml
And then your paths become:
1 1,xpathA
2 1,xpathB
3 2,xpathQ
...
If you need more compression after that, build a string table that contains the xpath terms, and your paths entries become a series of indexes into that table. You have to be careful here, though, because allocation overhead for arrays or lists is going to make short lists very expensive. If you go this route, then you'll want to encode the paths list as one big binary array, and index into it. For example.
Words list
1 the
2 quick
3 brown
4 fox
Paths
index path
0 1(index of file),2(quick),4(fox),-1(terminator)
4 3(index of file),3(brown),-1(terminator)
7 etc . . .
The Paths table is just a big array that would look like this:
1,2,4,-1,3,3,-1,...
This minimizes data storage cost because no string is ever stored more than once. All you have is string tables and references to those strings. The amount of space it takes will be something like:
Combined length of all file names
Combined length of all path segment terms
(number of paths) * (average path length) * (size of integer index)
(number of terms) * (average number of references per term) * (size of integer index)
Building this in memory might be possible. It's hard to say without knowing how many individual terms you have. You'll need dictionaries for the file names, the paths, and the individual path segments if you use the words list. But it can all be done in a single pass if you have the memory.
If you don't have enough memory for the whole tree while you're building, you can load the file names and maintain the paths table in memory. As you find each term in a file, write it to disk along with its path reference. You end up with a disk file that looks like:
term, path reference
term, path reference
...
Use an external sort program to sort by term, and then go through and combine duplicates. When you're done you end up with a file that contains:
File names table
Path segments table
Paths
terms
Lookup is really easy. Find the term, look up each reference in the paths table, and decode the path by indexing into the file names and path segments tables.
I used something like this a few years back and it worked quite well. You should be able to write a program that analyzes your data to come up with the numbers (unique paths, number of file names, average number of references per term, etc.). From there, you can easily determine if using this technique will work for you.

Related

Storing a (string,integer) tuple more efficiently and apply binary search

Introduction
We store tuples (string,int) in a binary file. The string represents a word (no spaces nor numbers). In order to find a word, we apply binary search algorithm, since we know that all the tuples are sorted with respect to the word.
In order to store this, we use writeUTF for the string and writeInt for the integer. Other than that, let's assume for now there are no ways to distinguish between the start and the end of the tuple unless we know them in advance.
Problem
When we apply binary search, we get a position (i.e. (a+b)/2) in the file, which we can read using methods in Random Access File, i.e. we can read the byte at that place. However, since we can be in the middle of the word, we cannot know where this words starts or finishes.
Solution
Here're two possible solutions we came up with, however, we're trying to decide which one will be more space efficient/faster.
Method 1: Instead of storing the integer as a number, we thought to store it as a string (using eg. writeChars or writeUTF), because in that case, we can insert a null character in the end of the tuple. That is, we can be sure that none of the methods used to serialize the data will use the null character, since the information we store (numbers and digits) have higher ASCII value representations.
Method 2: We keep the same structure, but instead we separate each tuple with 6-8 (or less) bytes of random noise (same across the file). In this case, we assume that words have a low entropy, so it's very unlikely they will have any signs of randomness. Even if the integer may get 4 bytes that are exactly the same as those in the random noise, the additional two bytes that follow will not (with high probability).
Which of these methods would you recommend? Is there a better way to store this kind of information. Note, we cannot serialize the entire file and later de-serialize it into memory, since it's very big (and we are not allowed to).

I assume you're trying to optimize for speed & space (in that order).
I'd use a different layout, built from 2 files:
Interger + Index file
Each "record" is exactly 8 bytes long, the lower 4 are the integer value for the record, and the upper 4 bytes are an integer representing the offset for the record in the other file (the characters file).
Characters file
Contiguous file of characters (UTF-8 encoding or anything you choose). "Records" are not separated, not terminated in any way, simple 1 by 1 characters. For example, the records Good, Hello, Morning will look like GoodHelloMorning.
To iterate the dataset, you iterate the integer/index file with direct access (recordNum * 8 is the byte offset of the record), read the integer and the characters offset, plus the character offset of the next record (which is the 4 byte integer at recordNum * 8 + 12), then read the string from the characters file between the offsets you read from the index file. Done!

it's less than 200MB. Max 20 chars for a word.
So why bother? Unless you work on some severely restricted system, load everything into a Map<String, Integer> and get a few orders of magnitude speed up.
But let's say, I'm overlooking something and let's continue.
Method 1: Instead of storing the integer as a number, we thought to store it as a string (using eg. writeChars or writeUTF), because in that case, we can insert a null character
You don't have to as you said that your word contains no numbers. So you can always parse things like 0124some456word789 uniquely.
The efficiency depends on the distribution. You may win a factor of 4 (single digit numbers) or lose a factor of 2.5 (10-digit numbers). You could save something by using a higher base. But there's the storage for the string and it may dominate.
Method 2: We keep the same structure, but instead we separate each tuple with 6-8 (or less) bytes of random noise (same across the file).
This is too wasteful. Using four zeros between the data byte would do:
Find a sequence of at least four zeros.
Find the last zero.
That's the last separator byte.
Method 3: Using some hacks, you could ensure that the number contains no zero byte (either assuming that it doesn't use the whole range or representing it with five bytes). Then a single zero byte would do.
Method 4: As disk is organized in blocks, you should probably split your data into 4 KiB blocks. Then you can add some time header allowing you quick access to the data (start indexes for the 8th, 16th, etc. piece of data). The range between e.g., the 8th and 16th block should be scanned sequentially as it's both simpler and faster than binary search.

Sorting BIG Data XML file

I have an XML file that has a compressed size of about 100 GB (uncompressed 1 TB). This file contains about 100 million entries in the following way:
<root>
<entry>
<id>1234</id>
...
</entry>
<entry>
<id>1230</id>
...
</entry
</root>
I would like to sort this file by id. What would be a good way to do so?
By the way, I can use a machine with 16 cores and 128 GB RAM.

You can think about using a streaming processor like Saxon http://www.saxonica.com/html/documentation/sourcedocs/streaming/ and sort using XSLT.
Another option may be to store the data as key, values in DB, order them using SQL and recreating the XML. You would be leveraging the power of DB to manage large amount of data.
Similar question (not same): Sort multigigabyte xml file

Because the values (i.e id) are natural numbers, the best algorithm to sort them is Counting Sort with TETA(n) time order.
Suppose the values are in range [1 .. k]
Counting Sort >
Temp: C[1..k]
Input: A[1..n]
Output: B[1..n]
CountingSort (A, B, k)
{
for(i=1 to k) C[i]=0;
for(i=1 to n) C[A[i]]++;
for(i=2 to k) C[i]=C[i]+C[i-1];
for(i=n downto 1)
{
B[C[A[i]]] = A[i];
C[A[i]]--;
}
}
This algorithm is Stable.
You can also use Radix Sort with the same order.

At this stage it's useful to remember the techniques that people used to sort magnetic tapes or decks of punched cards, in the days when the data was much larger than available direct access memory. (I once watched a team of operators sort a quarter of a million cards - about 120 trays). You basically need a combination of streaming, merging, and splitting, which are all operations available in principle using XSLT 3.0. There are two processors available, Saxon-EE and Exselt, and neither is yet a 100% complete implementation, so you'll be constrained by the limitations of the products more than the spec.
My instinct would be to go for a digit-by-digit sort. You don't say how long the id's used as sort keys are. "Digits" here of course doesn't have to mean decimal digits, but assuming decimal for simplicity, the basic idea is that you first split the file into 10 buckets based on the last digit of the sort key, then you process the buckets in sequence based on this ordering, this time sorting by the penultimate digit, and carry on for as many digits as there are in the key: one pass of the complete file for each digit in the sort key.
If the id's are dense then presumably with 100m keys they are about 8 digits long, that would be 8 passes and if we assume a processing speed of 10Gb/min, which is probably the best you can get from off-the-shelf XML parsers, then each pass of a 1Tb file is going to take 2 hours, so 8 passes would be 16 hours. But it might be better to use say base-100 so you split into 100 files on each pass, then you only have 4 passes.
The essential XSLT 3.0 code is:
<xsl:stream href="in.xml">
<xsl:fork>
<xsl:for-each-group select="record"
group-by="substring(key, $digit, 1)">
<xsl:result-document href="temp{current-grouping-key()}">
<xsl:sequence select="current-group()"/>
</xsl:result-document>
</xsl:for-each-group>
</xsl:fork>
Now the bad news: in Saxon-EE 9.7 this code probably isn't sufficiently optimised. Although in principle the items in each group could be streamed directly to the relevant serialised result-document, Saxon doesn't yet treat this case specially and will build each group in memory before processing it. I don't know if Exselt can do any better.
So is there an alternative? Well, perhaps we could try something like this:
Split the file into N files: that is, put the first X/N items into file 1, the next X/N into file 2, and so on.
Sort each file, conventionally in memory.
Do a streamed merge of the resulting files, using xsl:merge.
I think that would work in Saxon. The first step can be done using <xsl:for-each-group group-adjacent="(position()-1) idiv $N"> which is fully streamed in Saxon.
This is essentially a 3-pass solution, in that each item is parsed and serialized three times. I would go for splitting the 1Tb file into 100 10Gb files. Doing an in-memory XSLT sort of a 10Gb is pushing it, but you've got some horsepower to play with. You could however run into Java addressing limits: arrays and strings have 1G limits, I think.

Data structure choice for ngrams upto length 5, when building count-based distributional model

I am building a distributional model (count based) from text. Basically for each ngram (a sequence of words), I have to store a count. I need reasonably quick access to the count. For n=5, technically all possible 5-grams are (10^4)^5 even if I assume a conservative estimate of 10k words, which is too high. But many combinations of these n-grams wouldn't exist in text, so a 5d array kind of structure is out of consideration.
I built a trie, where each word is a node. So this trie would be really wide, with max depth 5. That gave me considerable saving of memory. But I still run out of memory (64GB) after I train on enough files. To be fair, I am not using any super efficient Java practices here. Each node has a count, index of word as int. I then have a HashMap to store children. I initially started with a list. Tried to sort it each time I added a child, but I was losing lot of time there, so moved to HashMap. Even with a list, I will run out of memory after reading some more files.
So I guess I need to divide my task into parts, store each part to disk. But ultimately, when accessing I would need to merge these data structures. So I think the way forward is a disk based solution, where I know which file to access for ngrams which start with something (some sort of ordering). As I see it, the problem with trie is it's not very efficient when I go around to merging it. I would need to load two parts into memory to merge. That wouldn't really work.
What approach would you recommend? I looked into a HashMap encoding based structure for language models (like the one berkeleylm uses). But in their use case, they don't need to reconstruct the ngram, so they just hash it and store the hashvalue as context. I need to be able to access the context later.
Any suggestions? Is there any value in using a database? Can they do it without being in-memory?

I wouldn't use HashMap, it's quite memory intensive, a simple sorted array should be better, you can then use binary search on it.
Maybe you could also try a binary prefix-trie. First you create a single char-string, for example by interleave the letters of the words into a single string (I suppose you could also concatenate them, separated by a blank). This long String could then be stored in a binary trie. See CritBit1D for an example.
You could also use a multi-dimensional tree. Many trees are limited to 64bit numbers, but you cold turn the eight leading ASCII characters of every word into a 64-bit integer number and then store that as a 5D key. That should be much more efficient than a 5D array. Multi-dim indexes are: kd-trees, R-trees or quadtrees. The 5-gram-count and the full 5-gram (including remaining characters) can be stored separately in the VALUE that can be associated with each 5D-KEY.
If you are using Java you could try my very own tree. It's a prefix-sharing bitwise quadtree. It is very memory efficient, very well suited to larger datasets (1M entries upwards) and works natively with 'integer' rather than 'float'. It also has very good nearest neighbour search.

Memory and speed efficient search on Strings

I have a bunch of Strings I'd like a fast lookup for. Each String is 22 chars long and is looked up by the first 12 only (the "key" so to say), the full set of Strings is recreated periodically. They are loaded from a file and refreshed when the file changes. I have to deal with too little available memory, other server processes on my VPS need it too and need it more.
How do I best store the Strings and search for them?
My current idea is to store them all one after another inside a char[] (to save RAM), and sort them for faster lookups (I figure the lookup is fastest if I have them presorted so I can use binary or interpolation search). But I'm not exactly sure how I should code it - if anyone is in the mood for a challenging puzzle: here it is...
Btw: It's probably ok to exceed the memory constraints for a while during the recreation / sorting, but it shouldn't be by much or for long.
Thanks!
Update
For the "I want to know specifics" crowd (correct me if I'm wrong in the Java details): The source files contain about 320 000 entries (all ANSI text), I really want to stay (WAY!) below 64 MB RAM usage and the data is only part of my program. Here's some information on sizes of Java types in memory.
My VPS is a 32bit OS, so...
one byte[], all concatenated = 12 + length bytes
one char[], all concatenated = 12 + length * 2 bytes
String = 32 + length * 2 bytes (is Object, has char[] + 3 int)
So I have to keep in memory:
~7 MB if all are stored in a byte[]
~14 MB if all are stored in a char[]
~25 MB if all are stored in a String[]
> 40 MB if they are stored in a HashTable / Map (for which I'd probably have to finetune the initial capacity)
A HashTable is not magical - it helps on insertion, but in principle it's just a very long array of String where the hashCode modulus capacity is used as an index, the data is stored in the next free position after the index and searched lineary if it's not found there on lookup. But for a Hashtable, I'd need the String itself and a substring of the first 12 chars for lookup. I don't want that (or do I miss something here?), sorry folks...

I would probably use a cache solution for that, may be even guava will do. Of course sort them, then binary search. Unfortunately I do not have the time for it :(

Sounds like a HashTable would be the right implementation for this situation.
Searching is done in constant time and refreshing could be done in linear time.
Java Data Structure Big-O (Warning PDF)

I coded a solution myself - but it's a little different than the question I posted because I could use information I didn't publish (I'll do better next time, sorry).
I'm just answering this because it's solved, I won't accept one of the other answers because they didn't really help with the memory constraints (and were a little short for my taste). They still got an upvote each, no hard feelings and thanks for taking the time!
I managed to push all of the info into two longs (with the key completely residing in the first one). The first 12 chars are an ISIN which can be compressed into a long because it only uses digits and capital letters, always starts with two capital letters and ends with a digit which can be reconstructed from the other chars. The product of all possible values leaves a little more than 3 bits to spare.
I store all entries from my source file in a long[] (packed ISIN first, other stuff in the second long) and sort them based on the first of two longs.
When I do a query by a key, I transform it to a long, do a binary search (which I'll maybe change to an interpolation search) and return the matching index. The different parts of the value are retrievable by said index - I get the second long from the array, unpack it and return the requested data.
The result: RAM usage dropped from ~110 MB to < 50 MB including Jetty (btw - I used a HashTable before) and lookups are lightning fast.

HashSet of Strings taking up too much memory, suggestions...?

I am currently storing a list of words (around 120,000) in a HashSet, for the purpose of using as a list to check enetered words against to see if they are spelt correctly, and just returning yes or no.
I was wondering if there is a way to do this which takes up less memory. Currently 120,000 words is around 12meg, the actual file the words are read from is around 900kb.
Any suggestions?
Thanks in advance

You could use a prefix tree or trie: http://en.wikipedia.org/wiki/Trie

Check out bloom filters or cuckoo hashing. Bloom filter or cuckoo hashing?
I am not sure if this is the answer for your question but worth looking into these alternatives. bloom filters are mainly used for spell checker kind of use cases.

HashSet is probably not the right structure for this. Use Trie instead.

This might be a bit late but using Google you can easily find the DAWG investigation and C code that I posted a while ago.
http://www.pathcom.com/~vadco/dawg.html
TWL06 - 178,691 words - fits into 494,676 Bytes
The downside of a compressed-shared-node structure is that it does not work as a hash function for the words in your list. That is to say, it will tell you if a word exists, but it will not return an index to related data for a word that does exist.
If you want the perfect and complete hash functionality, in a processor-cache sized structure, you are going to have to read, understand, and modify a data structure called the ADTDAWG. It will be slightly larger than a traditional DAWG, but it is faster and more useful.
http://www.pathcom.com/~vadco/adtdawg.html
All the very best,
JohnPaul Adamovsky

12MB to store 120,000 words is about 100 bytes per word. Probably at least 32 bytes of that is String overhead. If words average 10 letters and they are stored as 2-byte chars, that accounts for another 20 bytes. Then there is the reference to each String in your HashSet, which is probably another 4 bytes. The remaining 44 bytes is probably the HashSet entry and indexing overhead, or something I haven't considered above.
The easiest thing to go after is the overhead of the String objects themselves, which can take far more memory than is required to store the actual character data. So your main approach would be to develop a custom representation that avoids storing a separate object for each string. In the course of doing this, you can also get rid of the HashSet overhead, since all you really need is a simple word lookup, which can be done by a straightforward binary search on an array that will be part of your custom implementation.
You could create your custom implementation as an array of type int with one element for each word. Each of these int elements would be broken into sub-fields that contain a length and an offset that points into a separate backing array of type char. Put both of these into a class that manages them, and that supports public methods allowing you to retrieve and/or convert your data and individual characters given a string index and an optional character index, and to perform the simple searches on the list of words that are needed for your spell check feature.
If you have no more than 16777216 characters of underlying string data (e.g., 120,000 strings times an average length of 10 characters = 1.2 million chars), you can take the low-order 24 bits of each int and store the starting offset of each string into your backing array of char data, and take the high-order 8 bits of each int and store the size of the corresponding string there.
Your char data will have your erstwhile strings crammed together without any delimiters, relying entirely upon the int array to know where each string starts and ends.
Taking the above approach, your 120,000 words (at an average of 10 letters each) would require about 2,400,000 bytes of backing array data and 480,000 bytes of integer index data (120,000 x 4 bytes), for a total of 2,880,000 bytes, which is about a 75 percent savings over the present 12MB amount you have reported above.
The words in the arrays would be sorted alphabetically, and your lookup process could be a simple binary search on the int array (retrieving the corresponding words from the char array for each test), which should be very efficient.
If your words happen to be entirely ASCII data, you could save an additional 1,200,000 bytes by storing the backing data as bytes instead of as chars.
This could get more difficult if you needed to alter these strings. Apparently, in your case (spell checker), you don't need to (unless you want to support user additions to the list, which would be infrequent anyway, and so re-writing the char data and indexes to add or delete words might be acceptable).

One way to save memory to save memory is to use a radix tree. This is better than a trie as the prefixes are not stored redundantly.
As your dictionary is fixed another way is to build a perfect hash function for it. Your hash set does not need buckets (and the associated overhead) as there cannot be collisions. Every implementation of a hash table/hash set that uses open addressing can be used for this (like google collection's ImmutableSet).

The problem is by design: Storing such a huge amount of words in a HashSet for spell-check-reasons isn't a good idea:
You can either use a spell-checker (example: http://softcorporation.com/products/spellcheck/ ), or you can buildup a "auto-wordcompletion" with a prefix tree ( description: http://en.wikipedia.org/wiki/Trie ).
There is no way to reduce memory-usage in this design.

You can also try Radix Tree(Wiki,Implementation) .This some what like trie but more memory efficient.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.