Currently using the Apache Commons Compress package which uses about 60% of the overall heap and takes around 6 minutes to decompress about 500 files each 4-5Mb when decompressing BZip2 files.
My main problem is I can't find anything to compare this performance to, I have found AT4J but implementing this as per the documentation leads to an ArrayIndexOutOfBoundsException while trying to read one of the files into the buffer. For the few files it did manage to process the performance was pretty similar, and the fact that AT4J includes the compressor classes from Commons Compress to give 'an extra option' implies this is expected.
Does anyone know of any other Java libraries for decompressing BZip2 files and if so whether they are any comparison to Apache?
Thanks in advance.
This benchmark of different compression techniques suggest they got 6 MB/s decompressing BZip2
https://tukaani.org/lzma/benchmarks.html
This suggests that your 2.2 GB of data should take about 6 minutes even with a native library.
If you want to speed this up, I suggest using multiple threads or using gzip which is much faster.
I have lots of quests about Java NIO. I have read many articles where the discussion was digged deeper about it. But I really don't know in which aspects the NIO is quicker than IO.
Also I have observed that downloading a 100MB file with Java NIO code is at least 10 times faster than downloading with Java IO code.
Now my question regarding the fact is:
Suppose I am downloading a file is 1KB. In this case, will NIO code still be ten times faster for a 1KB file?
Generally speaking, NIO is faster than classic Java IO because it reduces the amount of in-memory copying. However, a ten-fold improvement in speed is implausible, even for large files. And when we are talking about downloading files (rather than reading / writing them to disk), the performance is likely to be dominated by the bandwidth and end-to-end latency to the machine you are loading from.
Finally, you are likely to find that the relative speedup of NIO for small files will be even less ... because of the overheads of establishing network connections, sending requests, processing headers and so on.
Note: I have seen similar questions but all referring to large files. This is for small amounts reading and writing constantly, and many files will be written to and read from at once, so performance will be a issue.
Currently, I'm using a Random Access File for an "account" it's fast with basic I/O:
raf.write();
I have seen random access files with file channels and buffered I/O what is the fastest(again for small data.), and could you please supply a example of your proof.
If you want correctness across multiple read/write processes, you are going to sacrifice performance either to non-buffered APIs like RandomAccessFile, or else to inter-process locking.
You can't validly compare to what you could achieve within a single process without contention.
You could investigate MappedByteBuffer, but be aware it brings its own problems in its wake.
I personally would look into using a database. That's what they're for.
In my application I need to write a lot of data into DB. To speed it up I was thinking about writing to sequential binary files in real time and then doing bulk inserts into DB. There are different logging libraries that can be configured to create new file every x seconds or MB, but they are slowing down the system significantly under heavy load and they work with string messages.
Are there any performant libraries for binary files?
If I were you, I'd look into the possibility of using JDBC batch inserts. The relevant methods are PreparedStatement.addBatch() and Statement.executeBatch().
Here is a tutorial that discusses them: http://viralpatel.net/blogs/2012/03/batch-insert-in-java-jdbc.html
In my experience (with PostgreSQL), they are a lot faster than single inserts. It could well be the case that they'll be fast enough for your purposes.
Kafka is designed to be able to act as a unified platform for handling all the real-time data feeds.
https://kafka.apache.org/08/design.html
So for some research work, I need to analyze a ton of raw movement data (currently almost a gig of data, and growing) and spit out quantitative information and plots.
I wrote most of it using Groovy (with JFreeChart for charting) and when performance became an issue, I rewrote the core parts in Java.
The problem is that analysis and plotting takes about a minute, whereas loading all of the data takes about 5-10 minutes. As you can imagine, this gets really annoying when I want to make small changes to plots and see the output.
I have a couple ideas on fixing this:
Load all of the data into a SQLite database.
Pros: It'll be fast. I'll be able to run SQL to get aggregate data if I need to.
Cons: I have to write all that code. Also, for some of the plots, I need access to each point of data, so loading a couple hundred thousand files, some parts may still be slow.
Java RMI to return the object. All the data gets loaded into one root object, which, when serialized, is about 200 megs. I'm not sure how long it would take to transfer a 200meg object through RMI. (same client).
I'd have to run the server and load all the data but that's not a big deal.
Major pro: this should take the least amount of time to write
Run a server that loads the data and executes a groovy script on command within the server vm. Overall, this seems like the best idea (for implementation time vs performance as well as other long term benefits)
What I'd like to know is have other people tackled this problem?
Post-analysis (3/29/2011): A couple months after writing this question, I ended up having to learn R to run some statistics. Using R was far, far easier and faster for data analysis and aggregation than what I was doing.
Eventually, I ended up using Java to run preliminary aggregation, and then ran everything else in R. R was also much easier to make beautiful charts than using JFreeChart.
Databases are very scalable, if you are going to have massive amounts of data. In MS SQL we currently group/sum/filter about 30GB of data in 4 minutes (somewhere around 17 million records I think).
If the data is not going to grow very much, then I'd try out approach #2. You can make a simple test application that creates a 200-400mb object with random data and test the performance of transferring it before deciding if you want to go that route.
Before you make a decision its probably worth understanding what is going on with your JVM as well as your physical system resources.
There are several factors that could be at play here:
jvm heap size
garbage collection algorithms
how much physical memory you have
how you load the data - is it from a file that is fragmented all over the disk?
do you even need to load all of the data at once - can it be done it batches
if you are doing it in batches you can vary the batch size and see what happens
if your system has multiple cores perhaps you could look at using more than one thread at a time to process/load data
if using multiple cores already and disk I/O is the bottleneck, perhaps you could try loading from different disks at the same time
You should also look at http://java.sun.com/javase/technologies/hotspot/vmoptions.jsp if you aren't familiar with the settings for the VM.
If your data have a relational properties, there are nothing more natural than storing it at some SQL database. There you can solve your biggest problem -- performance, costing "just" to write your appropriate SQL code.
Seems very plain to me.
I'd look into analysis using R. It's a statistical language with graphing capabilities. It could put you ahead, especially if that's the kind of analysis you intend to do. Why write all that code?
I would recommend running a profiler to see what part of the loading process is taking the most time and if there's a possible quick win optimization. You can download an evaluation license of JProfiler or YourKit.
Ah, yes: large data structures in Java. Good luck with that, surviving "death by garbage collection" and all. What java seems to do best is wrapping a UI around some other processing engine, although it does free developers from most memory management tasks -- for a price. If it were me, I would most likely do the heavy crunching in Perl (having had to recode several chunks of a batch system in perl instead of java in a past job for performance reasons), then spit the results back to your existing graphing code.
However, given your suggested choices, you probably want to go with the SQL DB route. Just make sure that it really is faster for a few sample queries, watch the query-plan data and all that (assuming your system will log or interactively show such details)
Edit,(to Jim Ferrans) re: java big-N faster than perl (comment below): the benchmarks you referenced are primarily little "arithmetic" loops, rather than something that does a few hundred MB of IO and stores it in a Map / %hash / Dictionary / associative-array for later revisiting. Java I/O might have gotten better, but I suspect all the abstractness still makes it comparitively slow, and I know the GC is a killer. I haven't checked this lately, I don't process multi-GB data files on a daily basis at my current job, like I used to.
Feeding the trolls (12/21): I measured Perl to be faster than Java for doing a bunch of sequential string processing. In fact, depending on which machine I used, Perl was between 3 and 25 times faster than Java for this kind of work (batch + string). Of course, the particular thrash-test I put together did not involve any numeric work, which I suspect Java would have done a bit better, nor did it involve caching a lot of data in a Map/hash, which I suspect Perl would have done a bit better. Note that Java did much better at using large numbers of threads, though.