Combining text- and bit-information in a file in Java? - java

Alright, so we need to store a list of words and their respective position in a much bigger text. We've been asked if it's more efficient to save the position represented as text or represented as bits (data streams in Java).
I think that a bitwise representation is best since the text "1024" takes up 4*8=32 bits while only 11 if represented as bits.
The follow up question is should the index be saved in one or two files. Here I thought "perhaps you can't combine text and bitwise-representation in one file?" and that's the reason you'd need two files?
So the question first and foremost is can I store text-information (the word) combined with bitwise-information (it's position) in one file?

Too vague in terms of whats really needed.
If you have up to a few million words + positions, don't even bother thinking about it. Store in whatever format is the simplest to implement; space would only be an issue if you need to sent the data over a low bandwidth network.
Then there is general data compression available, by just wrapping your Input/OutputStreams with deflater or gzip (already built in the JRE) you will get reasonably good compression (50% or more for text). That easily beats what you can quickly write yourself. If you need better compression there is XZ for java (implements LZMA compression), open source.
If you need random access, you're on the wrong track, you will want to carefully design the data layout for the access patterns and storage should be only of tertiary concern.

The number 1024 would at least take 2-4 bytes (so 16-32 bits), as you need to know where the number ends and where it starts, and so it must have a fixed size. If your positions are very big, like 124058936, you would need to use 4 bytes per numbers (which would be better than 9 bytes as a string representation).
Using binary files you'll need of a way to know where the string starts and end, too. You can do this storing a byte before it, with its length, and reading the string like this:
byte[] arr = new byte[in.readByte()]; // in.readByte()*2 if the string is encoded in 16 bits
in.read(arr); // in is a FileInputStream / RandomAccessFile
String yourString = new String(arr, "US-ASCII");
The other possiblity would be terminating your string with a null character (00), but you would need to create your own implementation for that, as no readers support it by default (AFAIK).
Now, is it really worth storing it as binary data? That really depends on how big your positions are (because the strings, if in the text version are separated from their position with a space, would take the same amount of bytes).
My recommendation is that you use the text version, as it will probably be easier to parse and more readable.
About using one or two files, it doesn't really matter. You can combine text and binary in the same file, and it would take the same space (though making it in two separated files will always take a bit more space, and it might make it more messy to edit).

Related

Confusion around how byte array writing works in Java

Let’s say I have a huge set of strings that I want to write into a file as efficient as possible. I don’t care if it’s not human readable.
The first thing that came to my mind was to write the string as raw bytes to a binary file. I tried using DataOutputStream and write the byte array. However when I open my file it is readable.
How does this work? Does it actually write binary under the hood and only my text editor is making it readable?
Is this the most efficient way to do this?
I’d use this for a project where performance is key so I’m looking for the fastest way to write to a file (no need to be human readable).
Thanks in advance.
Files are just a sack of bytes. They always are, even a txt file.
Characters don't really exist, in the sense that computers don't know what they are, not really. They just know numbers.
So, how does that work?
Welcome to the wonderful world of text encoding.
The job of storing, say, the string "Hello!" in a file requires converting the notion of H, e, l, l, o, and ! into bytes first, and then write those bytes into a file.
In order to do that, you first need a lookup table; one that translates characters into numbers. Then, we have to convert those numbers to bytes, and then we can save them to a file.
A common encoding is US-ASCII. US-ASCII contains only 94 characters in its mapping. The 26 letters of the english alphabet in both lower and uppercase form, all digits, a few useful symbols such as !##$%^&*( and space. That's it. US-ASCII simply has no 'mapping' for e.g. é or ☃ or even 😊.
All these characters are mapped to a number between 32 and 126, and so to put this in a text file, just write that number, as bytes can represent anything between 0 and 255, so it 'just fits' (in fact, the high bit is always 0).
But, it's 2021, and we have emoji and we figured out a while ago that as it turns out, there are languages out there that aren't english, amazing, that.
So, the commonly used table is the unicode table. This table represent a liiiiitle more than 94 characters. Nono, this table has a whopping, at time of writing, 143859 characters in its table. Holy moly batman, that's a ton.
Clearly, the numbers that these 143,859 glyphs are mapped to must, at the very least, fall between 0 and 143,859 (it's actually a larger number range; there are gaps for convenience and to leave room for future updates).
You could just state that each number is an int (between 0 and 2^31 - 4 bytes total), and store each character as an int (so, Hello! would turn into a file on disk that is 24 bytes large).
But, a more common encoding is UTF-8. UTF-8 has the property that it stores ASCII-compatible characters as, well, ASCII, because those 94 numbers have the same 'number translation' in unicode as it does in ASCII, AND UTF-8 stores those numbers as just that byte. UTF-8 stores each character in 1, 2, 3, or 4 bytes, depending on what character it is. It's a 'variable length encoding scheme'.
You can look up UTF_8 on e.g. wikipedia if you want to know the deal.
For english-esque text, UTF-8 is extremely efficient, and no worse than ascii (so there is no reason to use ascii). You can do this in java, quite easily:
// Path, Files etc are from java.nio.file
Path p = Paths.get("mytextfile.txt");
Files.writeString(p, "Hello!");
That's all you need; the Files API defaults to UTF_8 (be aware that the old and mostly obsolete APIs such as FileWriter don't, and you should ALWAYS specify charset encoding for those! Or better yet, just don't use em and use java.nio.file instead).
Note that you can shove a unicode snowman or even emoji in there, it'll save fine.
There is no 'binary' variant. The file is bytes. If you open it in a text editor or run cat thatfile.txt, guess what? cat or your editor is reading in the bytes, taking a wild stab in the dark as to what encoding it might be, looking every decoded value up in the character table, and then farm out the work to the font rendering engine to show the characters again. It's just the editor giving you the benefit of showing off the file with bytes:
72, 101, 108, 108, 111, 33
as Hello! because that's a lot easier to read. Open that 'text file' with a hex editor and you'll see that it contains exactly that sequence of numbers I showed (well, in hex, that too is just a rendering convenience).
Still, if you want to store it 'efficiently', the answer is trivial: Use a compression algorithm. You can throw that data through e.g. new ZipOutputStream or use more fancy compressors:
Path p = Paths.get("file.txt.gz");
try (OutputStream out = Files.newOutputStream(p);
ZipOutputStream zip = new ZipOutputStream(out)) {
String shakespeare = "type the complete works of shakespeare here";
zip.write(shakespeare.getBytes(StandardCharsets.UTF_8);
}
You'll find that file.txt.gz will be considerably fewer bytes than the total character count of the combined works of shakespeare. Voila. Efficiency.
You can futz with your compression algorithm; there are many. Some are optimized for specific purposes, most fall on a tradeoff line between 'speed of compression' and 'efficiency of compression'. Many are configurable (compress better at the cost of running longer vs compress quickly, but it won't be quite as efficient). A few basic compression algorithms are baked into java, and for the fancier ones, well, a few have pure java implementations you can use.

What happens if a file doesn't end exactly at the last byte?

For example, if a file is 100 bits, it would be stored as 13 bytes.This means that the first 4 bits of the last byte is the file and the last 4 is not the file (useless data).
So how is this prevented when reading a file using the FileInputStream.read() function in java or similar functions in other programming language?
You'll notice if you ever use assembly, there's no way to actually read a specific bit. The smallest addressable bit of memory is a byte, memory addresses refer to a specific byte in memory. If you ever use a specific bit, in order to access it you have to use bitwise functions like | & ^ So in this situation, if you store 100 bits in binary, you're actually storing a minimum of 13 bytes, and a few bits just default to 0 so the results are the same.
Current file systems mostly store files that are an integral number of bytes, so the issue does not arise. You cannot write a file that is exactly 100 bits long. The reason for this is simple: the file metadata holds the length in bytes and not the length in bits.
This is a conscious design choice by the designers of the file system. They presumably chose the design the way they do out of a consideration that there's very little need for files that are an arbitrary number of bits long.
Those cases that do need a file to contain a non-integral number of bytes can (and need to) make their own arrangements. Perhaps the 100-bit case could insert a header that says, in effect, that only the first 100 bits of the following 13 bytes have useful data. This would of course need special handling, either in the application or in some library that handled that sort of file data.
Comments about bit-lengthed files not being possible because of the size of a boolean, etc., seem to me to miss the point. Certainly disk storage granularity is not the issue: we can store a "100 byte" file on a device that can only handle units of 256 bytes - all it takes is for the file system to note that the file size is 100, not 256, even though 256 bytes are allocated to the file. It could equally well track that the size was 100 bits, if that were useful. And, of course, we'd need I/O syscalls that expressed the transfer length in bits. But that's not hard. The in-memory buffer would need to be slightly larger, because neither the language nor the OS allocates RAM in arbitrary bit-lengths, but that's not tied tightly to file size.

Storing a (string,integer) tuple more efficiently and apply binary search

Introduction
We store tuples (string,int) in a binary file. The string represents a word (no spaces nor numbers). In order to find a word, we apply binary search algorithm, since we know that all the tuples are sorted with respect to the word.
In order to store this, we use writeUTF for the string and writeInt for the integer. Other than that, let's assume for now there are no ways to distinguish between the start and the end of the tuple unless we know them in advance.
Problem
When we apply binary search, we get a position (i.e. (a+b)/2) in the file, which we can read using methods in Random Access File, i.e. we can read the byte at that place. However, since we can be in the middle of the word, we cannot know where this words starts or finishes.
Solution
Here're two possible solutions we came up with, however, we're trying to decide which one will be more space efficient/faster.
Method 1: Instead of storing the integer as a number, we thought to store it as a string (using eg. writeChars or writeUTF), because in that case, we can insert a null character in the end of the tuple. That is, we can be sure that none of the methods used to serialize the data will use the null character, since the information we store (numbers and digits) have higher ASCII value representations.
Method 2: We keep the same structure, but instead we separate each tuple with 6-8 (or less) bytes of random noise (same across the file). In this case, we assume that words have a low entropy, so it's very unlikely they will have any signs of randomness. Even if the integer may get 4 bytes that are exactly the same as those in the random noise, the additional two bytes that follow will not (with high probability).
Which of these methods would you recommend? Is there a better way to store this kind of information. Note, we cannot serialize the entire file and later de-serialize it into memory, since it's very big (and we are not allowed to).
I assume you're trying to optimize for speed & space (in that order).
I'd use a different layout, built from 2 files:
Interger + Index file
Each "record" is exactly 8 bytes long, the lower 4 are the integer value for the record, and the upper 4 bytes are an integer representing the offset for the record in the other file (the characters file).
Characters file
Contiguous file of characters (UTF-8 encoding or anything you choose). "Records" are not separated, not terminated in any way, simple 1 by 1 characters. For example, the records Good, Hello, Morning will look like GoodHelloMorning.
To iterate the dataset, you iterate the integer/index file with direct access (recordNum * 8 is the byte offset of the record), read the integer and the characters offset, plus the character offset of the next record (which is the 4 byte integer at recordNum * 8 + 12), then read the string from the characters file between the offsets you read from the index file. Done!
it's less than 200MB. Max 20 chars for a word.
So why bother? Unless you work on some severely restricted system, load everything into a Map<String, Integer> and get a few orders of magnitude speed up.
But let's say, I'm overlooking something and let's continue.
Method 1: Instead of storing the integer as a number, we thought to store it as a string (using eg. writeChars or writeUTF), because in that case, we can insert a null character
You don't have to as you said that your word contains no numbers. So you can always parse things like 0124some456word789 uniquely.
The efficiency depends on the distribution. You may win a factor of 4 (single digit numbers) or lose a factor of 2.5 (10-digit numbers). You could save something by using a higher base. But there's the storage for the string and it may dominate.
Method 2: We keep the same structure, but instead we separate each tuple with 6-8 (or less) bytes of random noise (same across the file).
This is too wasteful. Using four zeros between the data byte would do:
Find a sequence of at least four zeros.
Find the last zero.
That's the last separator byte.
Method 3: Using some hacks, you could ensure that the number contains no zero byte (either assuming that it doesn't use the whole range or representing it with five bytes). Then a single zero byte would do.
Method 4: As disk is organized in blocks, you should probably split your data into 4 KiB blocks. Then you can add some time header allowing you quick access to the data (start indexes for the 8th, 16th, etc. piece of data). The range between e.g., the 8th and 16th block should be scanned sequentially as it's both simpler and faster than binary search.

Memory and speed efficient search on Strings

I have a bunch of Strings I'd like a fast lookup for. Each String is 22 chars long and is looked up by the first 12 only (the "key" so to say), the full set of Strings is recreated periodically. They are loaded from a file and refreshed when the file changes. I have to deal with too little available memory, other server processes on my VPS need it too and need it more.
How do I best store the Strings and search for them?
My current idea is to store them all one after another inside a char[] (to save RAM), and sort them for faster lookups (I figure the lookup is fastest if I have them presorted so I can use binary or interpolation search). But I'm not exactly sure how I should code it - if anyone is in the mood for a challenging puzzle: here it is...
Btw: It's probably ok to exceed the memory constraints for a while during the recreation / sorting, but it shouldn't be by much or for long.
Thanks!
Update
For the "I want to know specifics" crowd (correct me if I'm wrong in the Java details): The source files contain about 320 000 entries (all ANSI text), I really want to stay (WAY!) below 64 MB RAM usage and the data is only part of my program. Here's some information on sizes of Java types in memory.
My VPS is a 32bit OS, so...
one byte[], all concatenated = 12 + length bytes
one char[], all concatenated = 12 + length * 2 bytes
String = 32 + length * 2 bytes (is Object, has char[] + 3 int)
So I have to keep in memory:
~7 MB if all are stored in a byte[]
~14 MB if all are stored in a char[]
~25 MB if all are stored in a String[]
> 40 MB if they are stored in a HashTable / Map (for which I'd probably have to finetune the initial capacity)
A HashTable is not magical - it helps on insertion, but in principle it's just a very long array of String where the hashCode modulus capacity is used as an index, the data is stored in the next free position after the index and searched lineary if it's not found there on lookup. But for a Hashtable, I'd need the String itself and a substring of the first 12 chars for lookup. I don't want that (or do I miss something here?), sorry folks...
I would probably use a cache solution for that, may be even guava will do. Of course sort them, then binary search. Unfortunately I do not have the time for it :(
Sounds like a HashTable would be the right implementation for this situation.
Searching is done in constant time and refreshing could be done in linear time.
Java Data Structure Big-O (Warning PDF)
I coded a solution myself - but it's a little different than the question I posted because I could use information I didn't publish (I'll do better next time, sorry).
I'm just answering this because it's solved, I won't accept one of the other answers because they didn't really help with the memory constraints (and were a little short for my taste). They still got an upvote each, no hard feelings and thanks for taking the time!
I managed to push all of the info into two longs (with the key completely residing in the first one). The first 12 chars are an ISIN which can be compressed into a long because it only uses digits and capital letters, always starts with two capital letters and ends with a digit which can be reconstructed from the other chars. The product of all possible values leaves a little more than 3 bits to spare.
I store all entries from my source file in a long[] (packed ISIN first, other stuff in the second long) and sort them based on the first of two longs.
When I do a query by a key, I transform it to a long, do a binary search (which I'll maybe change to an interpolation search) and return the matching index. The different parts of the value are retrievable by said index - I get the second long from the array, unpack it and return the requested data.
The result: RAM usage dropped from ~110 MB to < 50 MB including Jetty (btw - I used a HashTable before) and lookups are lightning fast.

HashSet of Strings taking up too much memory, suggestions...?

I am currently storing a list of words (around 120,000) in a HashSet, for the purpose of using as a list to check enetered words against to see if they are spelt correctly, and just returning yes or no.
I was wondering if there is a way to do this which takes up less memory. Currently 120,000 words is around 12meg, the actual file the words are read from is around 900kb.
Any suggestions?
Thanks in advance
You could use a prefix tree or trie: http://en.wikipedia.org/wiki/Trie
Check out bloom filters or cuckoo hashing. Bloom filter or cuckoo hashing?
I am not sure if this is the answer for your question but worth looking into these alternatives. bloom filters are mainly used for spell checker kind of use cases.
HashSet is probably not the right structure for this. Use Trie instead.
This might be a bit late but using Google you can easily find the DAWG investigation and C code that I posted a while ago.
http://www.pathcom.com/~vadco/dawg.html
TWL06 - 178,691 words - fits into 494,676 Bytes
The downside of a compressed-shared-node structure is that it does not work as a hash function for the words in your list. That is to say, it will tell you if a word exists, but it will not return an index to related data for a word that does exist.
If you want the perfect and complete hash functionality, in a processor-cache sized structure, you are going to have to read, understand, and modify a data structure called the ADTDAWG. It will be slightly larger than a traditional DAWG, but it is faster and more useful.
http://www.pathcom.com/~vadco/adtdawg.html
All the very best,
JohnPaul Adamovsky
12MB to store 120,000 words is about 100 bytes per word. Probably at least 32 bytes of that is String overhead. If words average 10 letters and they are stored as 2-byte chars, that accounts for another 20 bytes. Then there is the reference to each String in your HashSet, which is probably another 4 bytes. The remaining 44 bytes is probably the HashSet entry and indexing overhead, or something I haven't considered above.
The easiest thing to go after is the overhead of the String objects themselves, which can take far more memory than is required to store the actual character data. So your main approach would be to develop a custom representation that avoids storing a separate object for each string. In the course of doing this, you can also get rid of the HashSet overhead, since all you really need is a simple word lookup, which can be done by a straightforward binary search on an array that will be part of your custom implementation.
You could create your custom implementation as an array of type int with one element for each word. Each of these int elements would be broken into sub-fields that contain a length and an offset that points into a separate backing array of type char. Put both of these into a class that manages them, and that supports public methods allowing you to retrieve and/or convert your data and individual characters given a string index and an optional character index, and to perform the simple searches on the list of words that are needed for your spell check feature.
If you have no more than 16777216 characters of underlying string data (e.g., 120,000 strings times an average length of 10 characters = 1.2 million chars), you can take the low-order 24 bits of each int and store the starting offset of each string into your backing array of char data, and take the high-order 8 bits of each int and store the size of the corresponding string there.
Your char data will have your erstwhile strings crammed together without any delimiters, relying entirely upon the int array to know where each string starts and ends.
Taking the above approach, your 120,000 words (at an average of 10 letters each) would require about 2,400,000 bytes of backing array data and 480,000 bytes of integer index data (120,000 x 4 bytes), for a total of 2,880,000 bytes, which is about a 75 percent savings over the present 12MB amount you have reported above.
The words in the arrays would be sorted alphabetically, and your lookup process could be a simple binary search on the int array (retrieving the corresponding words from the char array for each test), which should be very efficient.
If your words happen to be entirely ASCII data, you could save an additional 1,200,000 bytes by storing the backing data as bytes instead of as chars.
This could get more difficult if you needed to alter these strings. Apparently, in your case (spell checker), you don't need to (unless you want to support user additions to the list, which would be infrequent anyway, and so re-writing the char data and indexes to add or delete words might be acceptable).
One way to save memory to save memory is to use a radix tree. This is better than a trie as the prefixes are not stored redundantly.
As your dictionary is fixed another way is to build a perfect hash function for it. Your hash set does not need buckets (and the associated overhead) as there cannot be collisions. Every implementation of a hash table/hash set that uses open addressing can be used for this (like google collection's ImmutableSet).
The problem is by design: Storing such a huge amount of words in a HashSet for spell-check-reasons isn't a good idea:
You can either use a spell-checker (example: http://softcorporation.com/products/spellcheck/ ), or you can buildup a "auto-wordcompletion" with a prefix tree ( description: http://en.wikipedia.org/wiki/Trie ).
There is no way to reduce memory-usage in this design.
You can also try Radix Tree(Wiki,Implementation) .This some what like trie but more memory efficient.

Categories