Speed difference when opening `*.zip` files. java.util.zip.*` vs. `org.apache.tools.zip.*` - java

The context
I needed to write a quick tool to directly extract ASCII data files from a collection of *.zip files into Matlab memory. The files are large, so I wanted to leave the storage alone.
Using the Matlab Java integration, this was relatively straightforward. H/T to this excellent answer, option 7 is easy to implement in Matlab using Java objects.
The strange timing result
The oddness comes from when I was trying to optimize for speed. This leads to the following timing script.
zipPath = "\\full\path\to\zipfile.zip";
zipJavaFile = java.io.File(zipPath);
tic
zipFile_apache = org.apache.tools.zip.ZipFile(zipJavaFile);
toc %Typical time value: 2 sec. Typical range: 2 -- 8 seconds.
tic
zipFile_util = java.util.zip.ZipFile(zipJavaFile);
toc %Typical time value: 0.007 sec. Typical range: 0.002 -- 0.015 seconds.
The timing differences between the java.util.zip and org.apache.tools.zip libraries is very large, milliseconds vs. seconds.
For the current task, changing one line of code caused a full data read time, spread among multiple files, to improve from 200s --> 5s.
The general trend is persistent if I rerun the timing script, trying to prime any OS caching. It is also persistent if I change the order of the test cases.
Both toolsets lead to correct data extraction, with no noticeable difference in speed after the lines shown.
The question
I would have expected similar timing performance between two Java Zip toolsets. I cannot find any other reference to this large of a time difference.
I don't understand where the time difference comes from. In a slightly different task (e.g. without two parallel libraries) I may have been stuck with the slower performance due to this ignorance.
Why is there such a speed difference between these libraries, in this context?

Related

Find the most executed section of java source code

I am using netbeans IDE 7.4. I want to find the lines of code that the most of running time is spent. I heard a little about profilers that can used to thread monitoring and etc... .
But I don't know (exactly!) how to find the section{s} of code that frequently used in my program. I want to find out the mechanism and equipments provided by JVM for that- not only using the third party packages(profilers and etc...).
You can profile the CPU with visualVM and you 'll find which methods are CPU consumming. You have to make filter (regex) to focus on your classes.
Suppose line L of your code, if removed, would cut the overall wall-clock time in its thread by 50%. Then if you, at a random time, dump the stacks of all running threads, locate your thread, and disregard all the levels that are not your code, there is a 50% chance that you will see line L in the remaining lines.
So if you do this 10 times, you will see line L about 5 times, give or take.
In fact, any line of your code, if you see it on more than one stack sample, if you can remove or bypass it, will save you a healthy fraction of time, guaranteed.
What's more, this method (while crude) will find any speedup that profilers can find, and more that they can't.
The math behind it can be found here.
Example: A worker thread is spending 80% of its time doing I/O and allocating memory in the process of parsing XML to build a data structure. You can see that the XML comes from a data structure in a different piece of code in the same thread. It's big code - you wouldn't have known this without the samples pointing it out. You only have to take two or three samples before you see it twice. Just bypass the XML - 5x speedup.

looking for an alternative to python for processing speed: 2D quantum particle

I have a program written in Python that accurately shows the time evolution of a quantum particle in both 1 and 2 dimensional wells. I am too lazy to post the entire thing online, but I will be happy to email the source to anyone willing to take a look.
My question is this: Is there a faster way? This thing should look like it's going crazy in its box, not calmly gliding around. When you run the program, choose "yes" on the realtime option to get a diagnostic of the performance. It runs at about 3 dt steps (on the order of 10-6 to 10-18 seconds) per actual real second. Needless to say, by the time this program shows me what has happened to the particle after 1 second of real time, I will be old and grey. Any suggestions?
It runs at about 3 dt steps (on the order of 10^-6 to 10^-18 seconds) per actual real second. Needless to say, by the time this program shows me what has happened to the particle after 1 second of real time, I will be old and grey. Any suggestions?
If you are lucky, you might get a factor of 10 to 100 speed-up by changing language implementations or languages. But it sounds like you want many orders of magnitude faster performance. For that you would need:
a fundamental change in the algorithms you are using, and / or
using a computation platform with lots of hardware parallelism.
This kind of computational problem doesn't have simple solutions.

Microbenchmarking with Big Data

I'm currently working on my thesis project designing a cache implementation to be used with a shortest-path graph algorithm. The graph algorithm is rather inconsistent with runtimes, so it's too troublesome to benchmark the entire algorithm. I must concentrate on benchmarking the cache only.
The caches I need to benchmark are about a dozen or so implementations of a Map interface. These caches are designed to work well with a given access pattern (the order in which keys are queried from the algorithm above). However, in a given run of a "small" problem, there are a few hundred billion queries. I need to run almost all of them to be confident about the results of the benchmark.
I'm having conceptual problems about loading the data into memory. It's possible to create a query log, which would just be an on-disk ordered list of all the keys (they're 10-character string identifiers) that were queried in one run of the algorithm. This file is huge. The other thought I had would be to break the log up into chunks of 1-5 million queries, and benchmark in the following manner:
Load 1-5 million keys
Set start time to current time
Query them in order
Record elapsed time (current time - start time)
I'm unsure of what effects this will have with caching. How could I perform a warm-up period? Loading the file will likely clear any data that was in L1 or L2 caches for the last chunk. Also, what effect does maintaining a 1-5 million element string array have (and does even iterating it skew results)?
Keep in mind that the access pattern is important! For example, there are some hash tables with move-to-front heuristics, which re-orders the internal structure of the table. It would be incorrect to run a single chunk multiple times, or run chunks out of order. This makes warming CPU caches and HotSpot a bit more difficult (I could also keep a secondary dummy cache that's used for warming but not timing).
What are good practices for microbenchmarks with a giant dataset?
If I understand the problem correctly, how about loading the query log on one machine, possibly in chunks if you don't have enough memory, and streaming it to the machine running the benchmark, across a dedicated network (a crossover cable, probably), so you have minimal interference between the system under test, and the test code/data ...?
Whatever solution you use, you should try multiple runs so you can assess repeatability - if you don't get reasonable repeatability then you can at least detect that your solution is unsuitable!
Updated: re: batching and timing - in practice, you'll probably end up with some form of fine-grained batching, at least, to get the data over the network efficiently. If your data falls into natural large 'groups' or stages then I would time those individually to check for anomalies, but would rely most strongly on an overall timing. I don't see much benefit from timing small batches of thousands (given that you are running through many millions).
Even if you run everything on one machine with a lot of RAM, it might be worth loading the data in one JVM and the code under test on another, so that garbage collection on the cache JVM is not (directly) affected by the large heap required to hold the query log.

Artificial generation of cpu load in Java

Is there a simple way to generate a constant CPU load in Java? Like generate CPU load at 60%.
Kind of late to the party, but just wanted to share that I created a small open-source client library called FakeLoad which can be used to generate different kinds of system load like CPU, memory, and disk I/O on the fly.
For instance, generating a CPU load of 60% for 30 seconds with FakeLoad can be done like this:
// Creation
FakeLoad fakeload = FakeLoads.create()
.lasting(30, TimeUnit.SECONDS)
.withCpu(60);
// Execution
FakeLoadExecutor executor = FakeLoadExecutors.newDefaultExecutor();
executor.execute(fakeload);
It doesn't provide perfect precision, but it is able to generate quite constant and accurate loads. It is available on Maven Central, so feel free to give it a try :)
Have not tested this, but it might roughly work, to make your application work and sleep in the correct ratio. Something like (pseudocode):
load = 60;
do forever
time = current_system_time_ms() + load
while (current_system_time_ms() < time)
// just consume some time for 60 ms
end
SLEEP(100 - load); // sleep for 40 ms
end
Ok, you asked for a simple way, but ... :)
I think that's not even possible, because:
the way the JVM interprets code (provided it interprets code at all) is implementation-dependent
the compiler, and the JVM, may optimize code (in an implementation-dependent manner, of course) so that you may run different bytecodes than the ones in a given .class file.

On the fly calculation of Fourier transformation in Java

I want to write a program in Java that uses fast Fourier transformation.
The program reads data every 5 milliseconds seconds from sensors and is supposed to do something with the data every 200 milliseconds based on the data from the last five seconds.
Is there a good library in Java that provides a way to do Fourier transformation without recalculating all five seconds every time?
Hard real time problems are not the proper application of Java. There are too many variables such as Garbage collection and Threads not guaranteed to happen within a given interval to make this possible. If close enough is acceptable it will work. The performance of your software as far as timing will also depend on the OS and hardware you are using and what other programs are also running on that box.
There is a Real Time Java, that does have a special API for the issues I mention above. You do not indicate that you are using that. It is also a different animal in a lot of respects than plain Java.

Categories