We usually compress data with DeflateOutputStream (or GZIPOutputStream) and decompress them with InflateInputStream (or GZIPInputStream).
But we have DeflateInputStream and InflateOutputStream since Java 1.6. What are the usages of these two classes?
You can use them if you need to process data in its compressed (deflated) format, or if you have compressed data in memory that you need to decompress and output. This could come in handy if you need to store compressed data in some location that does not handle streams very well (such as a database field), or if you have obtained compressed data from a non-stream source and want to decompress it to a stream destination.
Related
My problem is this. I have a snappy compressed avro file of 2GB with about 1000 avro records stored on HDFS. I know I can write code to "open up this avro file" and print out each avro record. My question is, is there a way in java to say, open up this avro file, iterate through each record and output into a text file the "start position" and "end position" of each record within that avro file such that... I could have a java function call "readRecord(startposition, endposition)" that could take the startposition and endposition to quickly read out one specific avro record without having to iterate through the whole file?
I don't have time to provide you an off-the-shelf implementation but I think that I can provide you some hints.
Let's start with the Avro Specification: Object Container Files
Basically a Avro file is a suite of self-contained blocks containing one or more records (you can configure the size block and a record will never be split across two blocks). At the beginning of each block you find:
A long indicating the count of objects in this block.
A long indicating the size in bytes of the serialized objects in the current block, after any codec is applied
The serialized objects. If a codec is specified, this is compressed by that codec.
The file's 16-byte sync marker.
The documentation explicitly states "Thus, each block's binary data can be efficiently extracted or skipped without deserializing the contents. The combination of block size, object counts, and sync markers enable detection of corrupt blocks and help ensure data integrity.".
You cannot directly seek to a specific record, but you can seek to a given block then iterate over its objects. It is not exactly what you need, but seems close enough. I believe that you won't be able to do much better than that with Avro containers. You can still tweak the block size to bound maximum the number of iteration within a block. When compression is used, it is applied at block level so it won't be an issue.
I believe that a such reader can be implemented using only public Avro API (FileDataReader provides seek and sync methods etc.)
You could compress each record individually. This won't give you as good a compression ratio, but it would be random access.
I suggest using a ZIP or JAR format.
give each record a notional file name, could be just a number.
write the serialized data as the contents of the file to the JAR.
When you want random access
open the JAR
lookup the entry by name.
read it and deserialize.
This will compress the data in the most efficient manner possible for each entry.
I have a Java project with a huge set of XML files (>500). Reading this files at runtime leads to performance issues.
Is there an option to load all the XML files to RAM and read from there instead of the disk?
I know there are products like RamDisk but this one is a commercial tool.
Can I copy XML files to main memory and read from main memory using any existing Java API / libraries?
I would first try memory mapped files, as provided by RandomAccessFile and FileChannel in standard java library. This way OS will be able to keep the frequently used file content in memory, effectively achieving what you want.
You can use In-Memory databases to store intermediate files (XML files). This will give the speed of using ram and db together.
For reference use the following links:
http://www.mcobject.com/in_memory_database
Usage of H2 as in memory database:
http://www.javatips.net/blog/2014/07/h2-in-memory-database-example
Use java.io.RandomAccessFile class. It behaves like a large array of bytes stored in the file system. Instances of this class support both reading and writing to a random access file.
Also I would suggest using a MemoryMappedFile, to read the file directly from the disk instead of loading it in memory.
RandomAccessFile file = new RandomAccessFile("wiki.txt", "r");
FileChannel channel = file.getChannel();
MappedByteBuffer buf = channel.map(FileChannel.MapMode.READ_WRITE, 0, 1024*50);
And then you can read the buffer as usual.
have you considered creating an object structure for these files and serializing them, java object serialization and deserialization is much faster than parsing an XML, this is again considering that these 500 or so XML files don't get modified between reads.
here is an article which talks about serializing and deserializing.
if the concern is to load file content into memory, then consider ByteArrayInputStream, ByteArrayOutputStream classes maybe even use ByteBuffer, these can store the bytes in memory
Java object serialization/deserialization is not faster than XML writing and parsing in general. When large numbers of objects are involved Java serialization/deserialization can actually be very inefficient, because it tracks each individual object (so that repeated references aren't serialized more than once). This is great for networks of objects, but for simple tree structures it adds a lot of overhead with no gains.
Your best approach is probably to just use a fast technique for processing the XML (such as javax.xml.stream.XMLStreamReader). Unless the files are huge, that 30-40 seconds time to load the XML files is way out of line - you're probably using an inefficient approach to processing the XML, such as loading them into a DOM. You can also try reading multiple files in parallel (such as by using Java 8 parallel Streams).
Looks like your main issue is large number of files and RAM is not an issue. Can you confirm?
Is it possible that you do a preprocessing step where you append all these files using some kind of separator and create a big file? This way you can increase the block size of your reads and avoid the performance penalty of disk seeks.
Have you thought about compressing the XML files and reading in those compressed XML files? Compressed XML could be as little as 3-5% the size of the original or better. You can uncompress it when it is visible to users and then store it compressed again for further reading.
Here is a library I found that might help:
zip4j
It all depends, whether you read the data more than once or not.
Assuming we use some sort of Java-based-RamDisk (it would actually be some sort of Buffer or Byte-array).
Further assume the time to process the data takes less than reading from. So you have to read it at least one single time. So it would make no difference if you'd read it first from disk-to-memory and then process it from memory.
If you would read a file more than once, you could read all the files into memory (various options, Buffer, Byte-Arrays, custom FileSystem, ...).
In case processing takes longer than reading (which seems not to be the case), you could pre-fetch the files from disk using a separate thread - and process the data from memory using another thread.
In my program, I am reading a series of text files from the disk. With each text file, I process out some data and store the results as JSON on the disk. In this design, each file has its own JSON file. In addition to this, I also store some of the data in a separate JSON file, which stores relevant data from multiple files. My problem is that the shared JSON grows larger and larger with every file parsed, and eventually uses too much memory. I am on a 32-bit machine and have 4 GB of RAM, and cannot increase the memory size of the Java VM anymore.
Another constraint to consider is that I often refer back to the old JSON. For instance, say I pull out ObjX from FileY. In pseudo code, the following happens (using Jackson for JSON serialization/deserialization):
// In the main method.
FileYJSON = parse(FileY);
ObjX = FileYJSON.get(some_key);
sharedJSON.add(ObjX);
// In sharedJSON object
List objList;
function add(obj)
if (!objList.contains(obj))
objList.add(obj);
The only thing I can think to do is use streaming JSON, but the problem is that I frequently need to access the JSON that came before, so I don't know that stream will work. Also my data types on not only strings, which prevents me from using Jackson's streaming capabilities (I believes). Does anyone know of a good solution?
If you're getting to the point where your data structures are so large that you're running out of memory, you'll have to start using something else. I would recommend that you use a database, which will significantly speed up data retrieval and storage. It will also make the limit of your data structure the size of your hard drive, instead of the size of your RAM.
Try this page for an introduction to Java and Databases.
I can't believe that you really need nearly 4GB RAM only for text files and JSON.
I see three possible solutions.
Switch to plain text if it's possible. That is not that memory hungry.
Just open and close the files as you need them. You can order the files to a specific naming convention, like the first two/three/... digits of their hashes, and open them as you need them.
If you have so many data, you could maybe switch to a database. That would save a lot of resources.
I would prefer option 3 if it's possible for you.
you can make api and get responce.body from it
I'm trying to uncompress data that was compressed using the ZLIB library written by Jean-loup Gailly back in the 1990s. I think it is a popular library (I see a lot of programs that ship the zlib32.dll file it uses) so I hope someone will be familiar enough with it to help me. I am using the compress() function directly which from what I read uses rfc-1951 DEFLATE format.
Here is a segment of the code I am using to read some compressed data from a stream and uncompress it:
InputStream is = new ByteArrayInputStream(buf);
//GZIPInputStream gzis = new GZIPInputStream(is);
InflaterInputStream iis = new InflaterInputStream(is);
byte[] buf2 = new byte[uncompressedDataLength];
iis.read(buf2);
The iis.read(buf2) function throws an internal exception of "Data Format Error". I tried using GZIPInputStream also, but that also throws the same exception.
The "buf" variable is type byte[] and I have confirmed by debugging that it is the same as what my C program gets back from the ZLIB compress() function (the actual data comes from a server over TCP). "uncompressedDataLength" is the known size of the uncompressed data that was also provided by the C program (server).
Has anyone tried reading/writing data using this library and then reading/writing the same data on the Android using Java?
I did find a "pure Java port of ZLIB" referenced in a few places, and if I need to I can try that, but I would rather use the built-in/OS functions if possible.
The data formats deflate, zlib and gzip in play here are all related.
The base is the deflate compressed data format, defined in RFC 1951.
As it is often quite useless in its pure form, we usually use a wrapping format around it.
The gzip compressed data format (RFC 1952) is intended for compression of files. It consists of a header which has space for a file name and some attributes, a deflate data stream, and a CRC-32 check sum (4 bytes) at the end. (There is also support of multiple such files in one stream in the specification, but I think this isn't used as often.)
The zlib compressed data format, defined in RFC 1950: It consists of a smaller header (2 or 6 bytes), a deflate data stream, and an Adler-32 check sum (4 bytes) at the end. (The Adler-32 check sum is intended to be faster to calculate than the CRC-32 check sum used in gzip.) It is intended for compressed transmission of data inside some other protocols, or compressed storage inside other file formats. For example, it is used inside the PNG file format.
The zlib library supports all these formats. Java's java.util.zip is build on zlib (as part of the VM's implementation/native calls), and exposes access to these with several classes:
The Deflater and Inflater classes implement - depending on the nowrap argument to the constructor - either the zlib or the deflate data formats.
DeflaterOutputStream/DeflaterInputStream/InflaterInputStream/InflaterOutputStream build on a Deflater/Inflater. The documentation doesn't say clearly whether the default Inflater/Deflater implements zlib or deflate, but the source shows that it uses the default Deflater or Inflater constructor, which implements zlib.
GZipOutputStream/GZipInputStream implement, as the name says, the gzip format.
I had a look at the source code of zlib's compress function, and it seems to use the zlib format. So your code should do the right thing. Make sure there is no missing data, or additional data which is not part of the compressed data block before or after it.
Disclaimer: This is the state for Java SE, I suppose it is similar for Android, but I can't guarantee this.
The jzlib library you found (I suppose), which is a Java reimplementation of zlib, also implements all these data formats (gzip was added in the latest update). For interactive use (on the compressing side) it is preferable, since it allows some flushing actions which are not possible with java.util's classes (other than using some workaround like changing the compression level), and it also might be faster since it avoids native calls (which always have some overhead).
PS: The zip (or pkzip) file format is also related: It uses deflate internally for each file inside the archive.
I've to make a code to upload/download a file on remote machine. But when i upload the file new line is not saved as well as it automatically inserts some binary characters. Also I'm not able to save the file in its actual format, I've to save it as "filename.ser". I'm using serialization-deserialization concept of java.
Thanks in advance.
How exactly are you transmitting the files? If you're using implementations of InputStream and OutputStream, they work on a byte-by-byte level so you should end up with a binary-equal output.
If you're using implementations of Reader and Writer, they convert the bytes to characters according to some character mapping, and then perform the reverse process when saving. Depending on the platform encodings of the various machines (and possibly other effects if you're not specifying the charset explicitly), you could well end up with differences in the binary file.
The fact that you mention newlines makes me think that you're using Readers to send strings (and possibly that you're stitching the strings back together yourself by manually adding newlines). If you want the files to be binary equal, then send them as a stream of bytes and store that stream verbatim. If you want them to be equal as strings in a given character set, then use Readers and Writers but specify the character set explicitly. If you want them to be transmitted as strings in the platform default set (not very useful), then accept that they're not going to be binary equal as files.
(Also, your question really doesn't provide much information to solve it. To me, it basically reads "I wrote some code to do X, and it doesn't work. Where did I go wrong?" You seem to assume that your code is correct by not listing it, but at the same time recognise that it's not...)