Let’s say I have a huge set of strings that I want to write into a file as efficient as possible. I don’t care if it’s not human readable.
The first thing that came to my mind was to write the string as raw bytes to a binary file. I tried using DataOutputStream and write the byte array. However when I open my file it is readable.
How does this work? Does it actually write binary under the hood and only my text editor is making it readable?
Is this the most efficient way to do this?
I’d use this for a project where performance is key so I’m looking for the fastest way to write to a file (no need to be human readable).
Thanks in advance.
Files are just a sack of bytes. They always are, even a txt file.
Characters don't really exist, in the sense that computers don't know what they are, not really. They just know numbers.
So, how does that work?
Welcome to the wonderful world of text encoding.
The job of storing, say, the string "Hello!" in a file requires converting the notion of H, e, l, l, o, and ! into bytes first, and then write those bytes into a file.
In order to do that, you first need a lookup table; one that translates characters into numbers. Then, we have to convert those numbers to bytes, and then we can save them to a file.
A common encoding is US-ASCII. US-ASCII contains only 94 characters in its mapping. The 26 letters of the english alphabet in both lower and uppercase form, all digits, a few useful symbols such as !##$%^&*( and space. That's it. US-ASCII simply has no 'mapping' for e.g. é or ☃ or even 😊.
All these characters are mapped to a number between 32 and 126, and so to put this in a text file, just write that number, as bytes can represent anything between 0 and 255, so it 'just fits' (in fact, the high bit is always 0).
But, it's 2021, and we have emoji and we figured out a while ago that as it turns out, there are languages out there that aren't english, amazing, that.
So, the commonly used table is the unicode table. This table represent a liiiiitle more than 94 characters. Nono, this table has a whopping, at time of writing, 143859 characters in its table. Holy moly batman, that's a ton.
Clearly, the numbers that these 143,859 glyphs are mapped to must, at the very least, fall between 0 and 143,859 (it's actually a larger number range; there are gaps for convenience and to leave room for future updates).
You could just state that each number is an int (between 0 and 2^31 - 4 bytes total), and store each character as an int (so, Hello! would turn into a file on disk that is 24 bytes large).
But, a more common encoding is UTF-8. UTF-8 has the property that it stores ASCII-compatible characters as, well, ASCII, because those 94 numbers have the same 'number translation' in unicode as it does in ASCII, AND UTF-8 stores those numbers as just that byte. UTF-8 stores each character in 1, 2, 3, or 4 bytes, depending on what character it is. It's a 'variable length encoding scheme'.
You can look up UTF_8 on e.g. wikipedia if you want to know the deal.
For english-esque text, UTF-8 is extremely efficient, and no worse than ascii (so there is no reason to use ascii). You can do this in java, quite easily:
// Path, Files etc are from java.nio.file
Path p = Paths.get("mytextfile.txt");
Files.writeString(p, "Hello!");
That's all you need; the Files API defaults to UTF_8 (be aware that the old and mostly obsolete APIs such as FileWriter don't, and you should ALWAYS specify charset encoding for those! Or better yet, just don't use em and use java.nio.file instead).
Note that you can shove a unicode snowman or even emoji in there, it'll save fine.
There is no 'binary' variant. The file is bytes. If you open it in a text editor or run cat thatfile.txt, guess what? cat or your editor is reading in the bytes, taking a wild stab in the dark as to what encoding it might be, looking every decoded value up in the character table, and then farm out the work to the font rendering engine to show the characters again. It's just the editor giving you the benefit of showing off the file with bytes:
72, 101, 108, 108, 111, 33
as Hello! because that's a lot easier to read. Open that 'text file' with a hex editor and you'll see that it contains exactly that sequence of numbers I showed (well, in hex, that too is just a rendering convenience).
Still, if you want to store it 'efficiently', the answer is trivial: Use a compression algorithm. You can throw that data through e.g. new ZipOutputStream or use more fancy compressors:
Path p = Paths.get("file.txt.gz");
try (OutputStream out = Files.newOutputStream(p);
ZipOutputStream zip = new ZipOutputStream(out)) {
String shakespeare = "type the complete works of shakespeare here";
zip.write(shakespeare.getBytes(StandardCharsets.UTF_8);
}
You'll find that file.txt.gz will be considerably fewer bytes than the total character count of the combined works of shakespeare. Voila. Efficiency.
You can futz with your compression algorithm; there are many. Some are optimized for specific purposes, most fall on a tradeoff line between 'speed of compression' and 'efficiency of compression'. Many are configurable (compress better at the cost of running longer vs compress quickly, but it won't be quite as efficient). A few basic compression algorithms are baked into java, and for the fancier ones, well, a few have pure java implementations you can use.
Related
For example, if a file is 100 bits, it would be stored as 13 bytes.This means that the first 4 bits of the last byte is the file and the last 4 is not the file (useless data).
So how is this prevented when reading a file using the FileInputStream.read() function in java or similar functions in other programming language?
You'll notice if you ever use assembly, there's no way to actually read a specific bit. The smallest addressable bit of memory is a byte, memory addresses refer to a specific byte in memory. If you ever use a specific bit, in order to access it you have to use bitwise functions like | & ^ So in this situation, if you store 100 bits in binary, you're actually storing a minimum of 13 bytes, and a few bits just default to 0 so the results are the same.
Current file systems mostly store files that are an integral number of bytes, so the issue does not arise. You cannot write a file that is exactly 100 bits long. The reason for this is simple: the file metadata holds the length in bytes and not the length in bits.
This is a conscious design choice by the designers of the file system. They presumably chose the design the way they do out of a consideration that there's very little need for files that are an arbitrary number of bits long.
Those cases that do need a file to contain a non-integral number of bytes can (and need to) make their own arrangements. Perhaps the 100-bit case could insert a header that says, in effect, that only the first 100 bits of the following 13 bytes have useful data. This would of course need special handling, either in the application or in some library that handled that sort of file data.
Comments about bit-lengthed files not being possible because of the size of a boolean, etc., seem to me to miss the point. Certainly disk storage granularity is not the issue: we can store a "100 byte" file on a device that can only handle units of 256 bytes - all it takes is for the file system to note that the file size is 100, not 256, even though 256 bytes are allocated to the file. It could equally well track that the size was 100 bits, if that were useful. And, of course, we'd need I/O syscalls that expressed the transfer length in bits. But that's not hard. The in-memory buffer would need to be slightly larger, because neither the language nor the OS allocates RAM in arbitrary bit-lengths, but that's not tied tightly to file size.
Introduction
We store tuples (string,int) in a binary file. The string represents a word (no spaces nor numbers). In order to find a word, we apply binary search algorithm, since we know that all the tuples are sorted with respect to the word.
In order to store this, we use writeUTF for the string and writeInt for the integer. Other than that, let's assume for now there are no ways to distinguish between the start and the end of the tuple unless we know them in advance.
Problem
When we apply binary search, we get a position (i.e. (a+b)/2) in the file, which we can read using methods in Random Access File, i.e. we can read the byte at that place. However, since we can be in the middle of the word, we cannot know where this words starts or finishes.
Solution
Here're two possible solutions we came up with, however, we're trying to decide which one will be more space efficient/faster.
Method 1: Instead of storing the integer as a number, we thought to store it as a string (using eg. writeChars or writeUTF), because in that case, we can insert a null character in the end of the tuple. That is, we can be sure that none of the methods used to serialize the data will use the null character, since the information we store (numbers and digits) have higher ASCII value representations.
Method 2: We keep the same structure, but instead we separate each tuple with 6-8 (or less) bytes of random noise (same across the file). In this case, we assume that words have a low entropy, so it's very unlikely they will have any signs of randomness. Even if the integer may get 4 bytes that are exactly the same as those in the random noise, the additional two bytes that follow will not (with high probability).
Which of these methods would you recommend? Is there a better way to store this kind of information. Note, we cannot serialize the entire file and later de-serialize it into memory, since it's very big (and we are not allowed to).
I assume you're trying to optimize for speed & space (in that order).
I'd use a different layout, built from 2 files:
Interger + Index file
Each "record" is exactly 8 bytes long, the lower 4 are the integer value for the record, and the upper 4 bytes are an integer representing the offset for the record in the other file (the characters file).
Characters file
Contiguous file of characters (UTF-8 encoding or anything you choose). "Records" are not separated, not terminated in any way, simple 1 by 1 characters. For example, the records Good, Hello, Morning will look like GoodHelloMorning.
To iterate the dataset, you iterate the integer/index file with direct access (recordNum * 8 is the byte offset of the record), read the integer and the characters offset, plus the character offset of the next record (which is the 4 byte integer at recordNum * 8 + 12), then read the string from the characters file between the offsets you read from the index file. Done!
it's less than 200MB. Max 20 chars for a word.
So why bother? Unless you work on some severely restricted system, load everything into a Map<String, Integer> and get a few orders of magnitude speed up.
But let's say, I'm overlooking something and let's continue.
Method 1: Instead of storing the integer as a number, we thought to store it as a string (using eg. writeChars or writeUTF), because in that case, we can insert a null character
You don't have to as you said that your word contains no numbers. So you can always parse things like 0124some456word789 uniquely.
The efficiency depends on the distribution. You may win a factor of 4 (single digit numbers) or lose a factor of 2.5 (10-digit numbers). You could save something by using a higher base. But there's the storage for the string and it may dominate.
Method 2: We keep the same structure, but instead we separate each tuple with 6-8 (or less) bytes of random noise (same across the file).
This is too wasteful. Using four zeros between the data byte would do:
Find a sequence of at least four zeros.
Find the last zero.
That's the last separator byte.
Method 3: Using some hacks, you could ensure that the number contains no zero byte (either assuming that it doesn't use the whole range or representing it with five bytes). Then a single zero byte would do.
Method 4: As disk is organized in blocks, you should probably split your data into 4 KiB blocks. Then you can add some time header allowing you quick access to the data (start indexes for the 8th, 16th, etc. piece of data). The range between e.g., the 8th and 16th block should be scanned sequentially as it's both simpler and faster than binary search.
Alright, so we need to store a list of words and their respective position in a much bigger text. We've been asked if it's more efficient to save the position represented as text or represented as bits (data streams in Java).
I think that a bitwise representation is best since the text "1024" takes up 4*8=32 bits while only 11 if represented as bits.
The follow up question is should the index be saved in one or two files. Here I thought "perhaps you can't combine text and bitwise-representation in one file?" and that's the reason you'd need two files?
So the question first and foremost is can I store text-information (the word) combined with bitwise-information (it's position) in one file?
Too vague in terms of whats really needed.
If you have up to a few million words + positions, don't even bother thinking about it. Store in whatever format is the simplest to implement; space would only be an issue if you need to sent the data over a low bandwidth network.
Then there is general data compression available, by just wrapping your Input/OutputStreams with deflater or gzip (already built in the JRE) you will get reasonably good compression (50% or more for text). That easily beats what you can quickly write yourself. If you need better compression there is XZ for java (implements LZMA compression), open source.
If you need random access, you're on the wrong track, you will want to carefully design the data layout for the access patterns and storage should be only of tertiary concern.
The number 1024 would at least take 2-4 bytes (so 16-32 bits), as you need to know where the number ends and where it starts, and so it must have a fixed size. If your positions are very big, like 124058936, you would need to use 4 bytes per numbers (which would be better than 9 bytes as a string representation).
Using binary files you'll need of a way to know where the string starts and end, too. You can do this storing a byte before it, with its length, and reading the string like this:
byte[] arr = new byte[in.readByte()]; // in.readByte()*2 if the string is encoded in 16 bits
in.read(arr); // in is a FileInputStream / RandomAccessFile
String yourString = new String(arr, "US-ASCII");
The other possiblity would be terminating your string with a null character (00), but you would need to create your own implementation for that, as no readers support it by default (AFAIK).
Now, is it really worth storing it as binary data? That really depends on how big your positions are (because the strings, if in the text version are separated from their position with a space, would take the same amount of bytes).
My recommendation is that you use the text version, as it will probably be easier to parse and more readable.
About using one or two files, it doesn't really matter. You can combine text and binary in the same file, and it would take the same space (though making it in two separated files will always take a bit more space, and it might make it more messy to edit).
I understand the structure of a java .class file, but when I want to interpret the raw hex data I get a bit lost.
This is a hex dump of a class file, excluding the header and constant pool.
I understand the header to be the magic number, minor_version and major_version. It seems the next value should be the access flags.
Which value would that be in this chart? 000000b0? I thought it would be a simple number not a hex value.
Which value is this_class, the index into the constant pool where the class details can be determined?
The 000000b0 is not part of the data. It's the memory address where the following 16 bytes are located.
The two-digit hex numbers are the actual data. Read them from left to right. Each row is in two groups of eight, purely to asist in working out memory addresses etc.
So to answer your question indirectly, you can work out where the access flags are by simply counting past the number of bytes used by the magic number, minor version and major version. The access flags will come next. Likewise, to find any other values (such as this_class), you have to work out what their offset is and look at that location in the data.
You say that you expected a "simple number not a hex vaue", but that doesn;t really make sense as hex values are simple numbers. They're simply represented in base-16 instead of base-10. There are plenty of resources online that will teach you how to convert between the two.
I am invoking DB2 stored procedure created using COBOL from my java application.
input macro (type varchar):
01 SP1-INPUTS.
05 FIELD-1 PIC X(03).
05 FIELD-2 PIC S9(09) COMP.
05 FIELD-3 PIC S9(15)V9(02) COMP-3.
05 FIELD-3X REDEFINES FIELD-3 PIC X(09)
To test the stored procedure, I know only value for FIELD-1. For other fields, to fill the packed portions how many zeros should I put? Please see the code which I wrote and confused in passing dummy values.
String field1="abc";
String field2="000000000"; // 9 zeroes, correct?
String field3="00...0" // should I give 18 zeroes or 9 zeroes?
How much characters totally for the input macro ?
COBOL does not have strings, it has fixed-length fields with no terminators.
So for FIELD-1 you have three "characters". A PICture of X can allow any of the 256 possible bit-values, but would typically contain human-readable values.
FIELD-2 is a binary field. You could consider it to be a four-byte integer.
However, the way it is defined there, COMP, with an S, it has a maximum value of 999,999,999 positive and a minimum value of 999,999,999 negative.
If it were defined as COMP-5, it could contain the full range for all bits in those four bytes. Note that compiler option TRUNC(BIN) would modify the behaviour of COMP to COMP-5 anyway, so you need to check with the Mainframe side what compile option is used for TRUNC (other values are STD and OPT).
On an IBM Mainframe a native binary field is Big Endian. On your local machine a native binary value would be Little Endian. For instance, a value of 16906090 would be stored as X'01020304'.
FIELD-3 is packed-decimal. It is nine bytes long, but each byte contains two decimal digits, except for the low-order (right-most) byte which contains one digit followed by a sign specifier (the PICture uses S, so the sign should be C for positive and D for negative, base-16). There is an implied decimal point (the V) and two decimal places.
FIELD-3X is another X field. It could again contain any bit-pattern in each of the bytes. You'd need to know what the intended use of that field was (there's not even the slightest clue from the names, as with the others, which is a bad way to do things).
The design is wrong. It is completely nuts to send binary and packed-decimal fields from the Mainframe to somewhere else, or vice versa. If all the fields were defined as X or 9 without the USAGE (implied) of COMP or COMP-3, then you'd have a very easy ride.
There's a required full-stop/period missing on the final definition, but that's likely a pasto.
01 SP1J-INPUTS.
05 a-meaningful-name PIC X(03).
05 another-meaningful-name PIC S9(09)
SIGN LEADING SEPARATE.
05 a-third-meaningful-name PIC +9(15).9(02).
05 yet-annother-meaningful-name
REDEFINES a-third-meaningful-name
PIC X(19).
This shows two ways to deal with the sign, either with the SIGN clause, or by using a numeric-edited definition. The . in the numeric-edited definition is an actual decimal-point, rather than an implied one, with the V.
All the data is now "text" or "character" data, which should be easy for you to deal with. EBCDIC (IBM Mainframe encoding) to ASCII (your local machine encoding, likely) is easy and can be done at data level, not field level.
For your to-and-fro communication, the above would be much easier for you, much less error-prone and more easily auditable. Internally the COBOL program can easily convert to/from those for its own internal use.
If you don't get them to change your interface to "character" then you'll have all sorts of extra coding and testing to do, for no advantage.
And example of that layout with ABC, 123456789 (negative) and 123456789012345.67 (positive) would be
ABC-123456789+123456789012345.67
Note that in addition to no field-delimiters, there are no data/record delimiters either. No "null"s.
There is an alternative to the actual decimal-point, which would be to provide a scaling factor. Further, you could "hard-code" the scaling in your program.
I assume the above data would be easy for you to both accept and create. Please try to get your interface changed. If they refuse, document with your boss the impact in extra code, and that the extra code is just to be able to "understand" the data before you even get to think about using it. Which is silly.
To create the easy-format data for you, the COBOL program needs to do this:
MOVE FIELD-1 TO a-meaningful-name
MOVE FIELD-2 TO another-meaningful-name
MOVE FIELD-3 TO a-third-meaningful-name
To receive the easy-format data from you, the COBOL program needs to do this:
MOVE a-meaningful-name TO FIELD-1
MOVE another-meaningful-name TO FIELD-2
MOVE a-third-meaningful-name TO FIELD-3
If the REDEFINES has a purpose, it would require specific code for the fourth field, but that's difficult for me to guess, but not difficult to code once the actual need is known.
Nothing onerous, and vastly simpler than what you have to code otherwise.