I'm writing an application that needs to deserialize quickly millions of messages from a single file.
What the application does is essentially to get one message from the file, do some work and then throw away the message. Each message is composed of ~100 fields (not all of them are always parsed but I need them all because the user of the application can decide on which fields he wants to work on).
In this moment the application consists in a loop that in each iteration just executes using a readDelimitedFrom() call.
Is there a way to optimize the problem to fit better this case (splitting in multiple files, etc...). In addition, in this moment due to the number of messages and the dimension of each message, I need to gzip the file (and it is fairly effective in reducing the size since the value of the fields are quite repetitive) - this though reduces the performance.
If CPU time is your bottleneck (which is unlikely if you are loading directly from HDD with cold cache, but could be the case in other scenarios), then here are some ways you can improve throughput:
If possible, use C++ rather than Java, and reuse the same message object for each iteration of the loop. This reduces the amount of time spent on memory management, as the same memory will be reused each time.
Instead of using readDelimitedFrom(), construct a single CodedInputStream and use it to read multiple messages like so:
// Do this once:
CodedInputStream cis = CodedInputStream.newInstance(input);
// Then read each message like so:
int limit = cis.pushLimit(cis.readRawVarint32());
builder.mergeFrom(cis);
cis.popLimit(limit);
cis.resetSizeCounter();
(A similar approach works in C++.)
Use Snappy or LZ4 compression rather than gzip. These algorithms still get reasonable compression ratios but are optimized for speed. (LZ4 is probably better, though Snappy was developed by Google with Protobufs in mind, so you might want to test both on your data set.)
Consider using Cap'n Proto rather than Protocol Buffers. Unfortunately, there is no Java version yet, but EDIT: There is capnproto-java and also implementations in many other languages. In the languages it supports it has been shown to be quite a bit faster. (Disclosure: I am the author of Cap'n Proto. I am also the author of Protocol Buffers v2, which is the version Google released open source.)
I expect that the majority of the time spent by your CPU is in garbage collection. I would look to replace the default garbage collector with one better suited for your use case of short lived objects.
If you do decide to write this in C++ - use an Arena to create the first message before parsing: https://developers.google.com/protocol-buffers/docs/reference/arenas
Related
Regarding the dataflow model of computation, I'm doing a PoC to test a few concepts using apache beam with the direct-runner (and java sdk). I'm having trouble creating a pipeline which reads a "big" csv file (about 1.25GB) and dumping it into an output file without any particular transformation like in the following code (I'm mainly concerned with testing IO bottlenecks using this dataflow/beam model because that's of primary importance for me):
// Example 1 reading and writing to a file
Pipeline pipeline = Pipeline.create();
PCollection<String> output = ipeline
.apply(TextIO.read().from("BIG_CSV_FILE"));
output.apply(
TextIO
.write()
.to("BIG_OUTPUT")
.withSuffix("csv").withNumShards(1));
pipeline.run();
The problem that I'm having is that only smaller files do work, but when the big file is used, no output file is being generated (but also no error/exception is shown either, which makes debugging harder).
I'm aware that on the runners page of the apache-beam project (https://beam.apache.org/documentation/runners/direct/), it is explicitly stated under the memory considerations point:
Local execution is limited by the memory available in your local environment. It is highly recommended that you run your pipeline with
data sets small enough to fit in local memory. You can create a small
in-memory data set using a Create transform, or you can use a Read
transform to work with small local or remote files.
This above suggests I'm having a memory problem (but sadly isn't being explicitly stated on the console, so I'm just left wondering here). I'm also concerned with their suggestion that the dataset should fit into memory (why isn't it reading from the file in parts instead of fitting the whole file/dataset into memory?)
A 2nd consideration I'd like to also add into this conversation would be (in case this is indeed a memory problem): How basic is the implementation of the direct runner? I mean, it isn't hard to implement a piece of code that reads from a big file in chunks, and also outputs to a new file (also in chunks), so that at no point in time the memory usage becomes a problem (because neither file is completely loaded into memory - only the current "chunk"). Even if the "direct-runner" is more of a prototyping runner to test semantics, would it be too much to expect that it should deal nicely with huge files? - considering that this is a unified model built for the ground up to deal with streaming where window size is arbitrary and huge data accumulation/aggregation before sinking it is a standard use-case.
So more than a question I'd deeply appreciate your feedback/comments regarding any of these points: have you notice IO constraints using the direct-runner? Am I overlooking some aspect or is the direct-runner really so naively implemented? Have you verified that by using a proper production runner like flink/spark/google cloud dataflow, this constraint disapears?
I'll eventually test with other runners like the flink or the spark one, but it feels underwhelming that the direct-runner (even if it is intended only for prototyping purposes) is having trouble with this first test I'm running on - considering the whole dataflow idea is based around ingesting, processing, grouping and distributing huge amounts of data under the umbrella of an unified batch/streaming model.
EDIT (to reflect Kenn's feedback):
Kenn, thanks for those valuable points and feedback, they have been of great help in pointing me towards relevant documentation. By your suggestion I've found out by profiling the application that the problem is indeed a java heap related one (that somehow is never shown on the normal console - and only seen on the profiler). Even though the file is "only" 1.25GB in size, internal usage goes beyond 4GB before dumping the heap, suggesting the direct-runner isn't "working by chunks" but is indeed loading everything in memory (as their doc says).
Regarding your points:
1- I believe that serialization and shuffling can very well still be achieved through a "chunk by chunk" implementation. Maybe I had a false expectation of what the direct-runner should be capable of, or I didn't fully grasp its intended reach, for now I'll refrain of doing non-functional type of tests while using the direct-runner.
2 - Regarding sharding. I believe the NumOfShards controls the parallelism (and amount of output files) at the write stage (processing before that should still be fully parallel, and only at the time of writing, will it use as many workers -and generate as many files- as explicitly provided). Two reasons to believe this are: first, the CPU profiler always show 8 busy "direct-runner-workers" -mirroring the amount of logical cores that my PC has-, independently on if I set 1 shard or N shards. The 2nd reason is what I understand from the documentation here (https://beam.apache.org/releases/javadoc/2.0.0/org/apache/beam/sdk/io/WriteFiles.html) :
By default, every bundle in the input PCollection will be processed by
a FileBasedSink.WriteOperation, so the number of output will vary
based on runner behavior, though at least 1 output will always be
produced. The exact parallelism of the write stage can be controlled
using withNumShards(int), typically used to control how many files
are produced or to globally limit the number of workers connecting to
an external service. However, this option can often hurt performance:
it adds an additional GroupByKey to the pipeline.
One interesting thing here is that "additional GroupByKey added to the pipeline" is kind of undesired in my use case (I only desire results in 1 file, without any regard for order or grouping),
so probbly adding an extra "flatten" files step, after having the N sharded output files generated is a better approach.
3 - your suggestion for profiling was spot on, thanks.
Final Edit the direct runner is not intended for performance testing, only prototyping and well formedness of the data. It doen't have any mechanism of spliting and dividing work by partitions, and handles everything in memory
There are a few issues or possibilities. I will answer in priority order.
The direct runner is for testing with very small data. It is engineered for maximum quality assurance, with performance not much of a priority. For example:
it randomly shuffles data to make sure you are not depending on ordering that will not exist in production
it serializes and deserializes data after each step, to make sure the data will be transmitted correctly (production runners will avoid serialization as much as possible)
it checks whether you have mutated elements in forbidden ways, which would cause you data loss in production
The data you are describing is not very big, and the DirectRunner can process it eventually in normal circumstances.
You have specified numShards(1) which explicitly eliminates all parallelism. It will cause all of the data to be combined and processed in a single thread, so it will be slower than it could be, even on the DirectRunner. In general, you will want to avoid artificially limiting parallelism.
If there is any out of memory error or other error preventing processing, you should see a lot message. Otherwise, it will be helpful to look at profiling and CPU utilization to determine if processing is active.
This question has been indirectly answered by Kenn Knowles above. The direct runner is not intended for performance testing, only prototyping and well formedness of the data. It doen't have any mechanism of spliting and dividing work by partitions, and handles every dataset in memory. Performance testing should be carried on by using other runners (like Flink Runner), - those will provide data splitting and the type of infrastructure needed to deal with high IO bottlenecks.
UPDATE: adding to the point adressed by this question, there is a related question here: How to deal with (Apache Beam) high IO bottlenecks?
Whereas the question here revolves around figuring out if the direct runner can deal with huge datasets (which we already established here that it is not possible); the provided link above points to a discussion of weather production runners (like flink/spark/cloud dataflow) can deal natively out of the box with huge datasets (the short answer is yes, but please check yourself on the link for a deeper discussion).
What options are there for processing large files quickly, multiple times?
I have a single file (min 1.5 GB, but can be upwards of 10-15 GB) that needs to be read multiple times - on the order of hundreds to thousands of times. The server has a large amount of RAM (64+ GB) and plenty of processors (24+).
The file will be sequential, read-only. Files are encrypted (sensitive data) on disk. I also use MessagePack to deserialize them into objects during the read process.
I cannot store the objects created from the file into memory - too large of an expansion (1.5 GB file turns into 35 GB in-memory object array). File can't be stored as a byte array (limited by Java's array length of 2^32-1).
My initial thought is to use a memory mapped file, but that has its own set of limitations.
The idea is to get the file off the disk and into memory for processing.
The large volume of data is for a machine learning algorithm, that requires multiple reads. During the calculation of each file pass, there's a considerable amount of heap usage by the algorithm itself, which is unavoidable, hence the requirement to read it multiple times.
The problem you have here is that you cannot mmap() the way the system call of the same name does; the syscall can map up to 2^64, FileChannel#map() cannot map more than 2^30 reliably.
However, what you can do is wrap a FileChannel into a class and create several "map ranges" covering all the file.
I have done "nearly" such a thing except more complicated: largetext. More complicated because I have to do the decoding process to boot, and the text which is loaded must be so into memory, unlike you who reads bytes. Less complicated because I have a define JDK interface to implement and you don't.
You can however use nearly the same technique using Guava and a RangeMap<Long, MappedByteBuffer>.
I implement CharSequence in this project above; I suggest that you implement a LargeByteMapping interface instead, from which you can read whatever parts you want; or, well, whatever suits you. Your main problem will be to define that interface. I suspect what CharSequence does is not what you want.
Meh, I may even have a go at it some day, largetext is quite exciting a project and this looks like the same kind of thing; except less complicated, ultimately!
One could even imagine a LargeByteMapping implementation where a factory would create such mappings with only a small part of that into memory and the rest written to a file; and such an implementation would also use the principle of locality: the latest queried part of the file into memory would be kept into memory for faster access.
See also here.
EDIT I feel some more explanation is needed here... A MappedByteBuffer will NOT EAT HEAP SPACE!!
It will eat address space only; it is nearly the equivalent of a ByteBuffer.allocateDirect(), except it is backed by a file.
And a very important distinction needs to be made here; all of the text above supposes that you are reading bytes, not characters!
Figure out how to structure the data. Get a good book about NoSQL and find the appropriate Database (Wide-Column, Graph, etc.) for your scenario. That's what I'd do. You'd not only have sophisticated query methods on your data but also mangling the data using distribute map-reduced implementations doing whatever you want. Maybe that's what you want (you even dropped the bigdata bomb)
How about creating "a dictionary" as the bridge between your program and the target file? Your program will call the dictionary then dictionary will refer you to the big fat file.
I need to parse (and transform and write) a large binary file (larger than memory) in Java. I also need to do so as efficiently as possible in a single thread. And, finally, the format being read is very structured, so it would be good to have some kind of parser library (so that the code is close to the complex specification).
The amount of lookahead needed for parsing should be small, if that matters.
So my questions are:
How important is nio v io for a single threaded, high volume application?
Are there any good parser libraries for binary data?
How well do parsers support streaming transformations (I want to be able to stream the data being parsed to some output during parsing - I don't want to have to construct an entire parse tree in memory before writing things out)?
On the nio front my suspicion is that nio isn't going to help much, as I am likely disk limited (and since it's a single thread, there's no loss in simply blocking). Also, I suspect io-based parsers are more common.
Let me try to explain if and how Preon addresses all of the concerns you mention:
I need to parse (and transform and write) a large binary file (larger
than memory) in Java.
That's exactly why Preon was created. You want to be able to process the entire file, without loading it into memory entirely. Still, the program model gives you a pointer to a data structure that appears to be in memory entirely. However, Preon will try to load data as lazily as it can.
To explain what that means, imagine that somewhere in your data structure, you have a collection of things that are encoded in a binary representation with a constant size; say that every element will be encoded in 20 bytes. Then Preon will first of all not load that collection in memory at all, and if you're grabbing data beyond that collection, it will never touch that region of your encoded representation at all. However, if you would pick the 300th element of that collection, it would (instead of decoding all elements up to the 300th element), calculate the offset for that element, and jump there immediately.
From the outside, it is as though you have a reference to a list that is fully populated. From the inside, it only goes out to grab an element of the list if you ask for it. (And forget about it immediately afterward, unless you instruct Preon to do things differently.)
I also need to do so as efficiently as possible in a single thread.
I'm not sure what you mean by efficiently. It could mean efficiently in terms of memory consumption, or efficiently in terms of disk IO, or perhaps you mean it should be really fast. I think it's fair to say that Preon aims to strike a balance between an easy programming model, memory use and a number of other concerns. If you really need to traverse all data in a sequential way, then perhaps there are ways that are more efficient in terms of computational resources, but I think that would come at the cost of "ease of programming".
And, finally, the format being read is very structured, so it would be
good to have some kind of parser library (so that the code is close to
the complex specification).
The way I implemented support for Java byte code, is to just read the byte code specification, and then map all of the structures they mention in there directly to Java classes with annotations. I think Preon comes pretty close to what you're looking for.
You might also want to check out preon-emitter, since it allows you to generate annotated hexdumps (such as in this example of the hexdump of a Java class file) of your data, a capability that I haven't seen in any other library. (Hint: make sure you hover with your mouse over the hex numbers.)
The same goes for the documentation it generates. The aim has always been to mak sure it creates documentation that could be posted to Wikipedia, just like that. It may not be perfect yet, but I'm not unhappy with what it's currently capable of doing. (For an example: this is the documentation generated for Java's class file specification.)
The amount of lookahead needed for parsing should be small, if that matters.
Okay, that's good. In fact, that's even vital for Preon. Preon doesn't support lookahead. It does support looking back though. (That is, sometimes part the encoding mechanism is driven by data that was read before. Preon allows you to declare dependencies that point back to data read before.)
Are there any good parser libraries for binary data?
Preon! ;-)
How well do parsers support streaming transformations (I want to be
able to stream the data being parsed to some output during parsing - I
don't want to have to construct an entire parse tree in memory before
writing things out)?
As I outlined above, Preon does not construct the entire data structure in memory before you can start processing it. So, in that sense, you're good. However, there is nothing in Preon supporting transformations as first class citizens, and it's support for encoding is limited.
On the nio front my suspicion is that nio isn't going to help much, as
I am likely disk limited (and since it's a single thread, there's no
loss in simply blocking). Also, I suspect io-based parsers are more
common.
Preon uses NIO, but only it's support for memory mapped files.
On NIO vs IO you are right, going with IO should be the right choice - less complexity, stream oriented etc.
For a binary parsing library - checkout Preon
Using a Memory Mapped File you can read through it without worrying about your memory and it's fast.
I think you are correct re NIO vs IO unless you have little endian data as NIO can read little endian natively.
I am not aware of any fast binary parsers, generally you want to call the NIO or IO directly.
Memory mapped files can help with writing from a single thread as you don't have to flush it as you write. (But it can be more cumbersome to use)
You can stream the data how you like, I don't forsee any problems.
I am looking for a simple way to store and retrieve millions of xml files. Currently everything is done in a filesystem, which has some performance issues.
Our requirements are:
Ability to store millions of xml-files in a batch-process. XML files may be up to a few megs large, most in the 100KB-range.
Very fast random lookup by id (e.g. document URL)
Accessible by both Java and Perl
Available on the most important Linux-Distros and Windows
I did have a look at several NoSQL-Platforms (e.g. CouchDB, Riak and others), and while those systems look great, they seem almost like beeing overkill:
No clustering required
No daemon ("service") required
No clever search functionality required
Having delved deeper into Riak, I have found Bitcask (see intro), which seems like exactly what I want. The basics described in the intro are really intriguing. But unfortunately there is no means to access a bitcask repo via java (or is there?)
Soo my question boils down to
is the following assumption right: the Bitcask model (append-only writes, in-memory key management) is the right way to store/retrieve millions of documents
are there any viable alternatives to Bitcask available via Java? (BerkleyDB comes to mind...)
(for riak specialists) Is Riak much overhead implementation/management/resource wise compared to "naked" Bitcask?
I don't think that Bitcask is going to work well for your use-case. It looks like the Bitcask model is designed for use-cases where the size of each value is relatively small.
The problem is in Bitcask's data file merging process. This involves copying all of the live values from a number of "older data file" into the "merged data file". If you've got millions of values in the region of 100Kb each, this is an insane amount of data copying.
Note the above assumes that the XML documents are updated relatively frequently. If updates are rare and / or you can cope with a significant amount of space "waste", then merging may only need to be done rarely, or not at all.
Bitcask can be appropriate for this case (large values) depending on whether or not there is a great deal of overwriting. In particular, there is not reason to merge files unless there is a great deal of wasted space, which only occurs when new values arrive with the same key as old values.
Bitcask is particularly good for this batch load case as it will sequentially write the incoming data stream straight to disk. Lookups will take one seek in most cases, although the file cache will help you if there is any temporal locality.
I am not sure on the status of a Java version/wrapper.
I need to serialize a huge amount of data (around 2gigs) of small objects into a single file in order to be processed later by another Java process. Performance is kind of important. Can anyone suggest a good method to achieve this?
Have you taken a look at google's protocol buffers? Sounds like a use case for it.
I don't know why Java Serialization got voted down, it's a perfectly viable mechanism.
It's not clear from the original post, but is all 2G of data in the heap at the same time? Or are you dumping something else?
Out of the box, Serialization isn't the "perfect" solution, but if you implement Externalizable on your objects, Serialization can work just fine. Serializations big expense is figuring out what to write and how to write it. By implementing Externalizable, you take those decisions out of its hands, thus gaining quite a boost in performance, and a space savings.
While I/O is a primary cost of writing large amounts of data, the incidental costs of converting the data can also be very expensive. For example, you don't want to convert all of your numbers to text and then back again, better to store them in a more native format if possible. ObjectStream has methods to read/write the native types in Java.
If all of your data is designed to be loaded in to a single structure, you could simply do ObjectOutputStream.writeObject(yourBigDatastructure), after you've implemented Externalizable.
However, you could also iterate over your structure and call writeObject on the individual objects.
Either way, you're going to need some "objectToFile" routine, perhaps several. And that's effectively what Externalizable provides, as well as a framework to walk your structure.
The other issue, of course, is versioning, etc. But since you implement all of the serialization routines yourself, you have full control over that as well.
A simplest approach coming immediately to my mind is using memory-mapped buffer of NIO (java.nio.MappedByteBuffer). Use the single buffer (approximately) corresponding to the size of one object and flush/append them to the output file when necessary. Memory-mapped buffers are very effecient.
Have you tried java serialization? You would write them out using an ObjectOutputStream and read 'em back in using an ObjectInputStream. Of course the classes would have to be Serializable. It would be the low effort solution and, because the objects are stored in binary, it would be compact and fast.
I developped JOAFIP as database alternative.
Apache Avro might be also usefull. It's designed to be language independent and has bindings for the popular languages.
Check it out.
protocol buffers : makes sense. here's an excerpt from their wiki : http://code.google.com/apis/protocolbuffers/docs/javatutorial.html
Getting More Speed
By default, the protocol buffer compiler tries to generate smaller files by using reflection to implement most functionality (e.g. parsing and serialization). However, the compiler can also generate code optimized explicitly for your message types, often providing an order of magnitude performance boost, but also doubling the size of the code. If profiling shows that your application is spending a lot of time in the protocol buffer library, you should try changing the optimization mode. Simply add the following line to your .proto file:
option optimize_for = SPEED;
Re-run the protocol compiler, and it will generate extremely fast parsing, serialization, and other code.
You should probably consider a database solution--all databases do is optimize their information, and if you use Hibernate, you keep your object model as is and don't really even think about your DB (I believe that's why it's called hibernate, just store your data off, then bring it back)
If performance is very importing then you need write it self. You should use a compact binary format. Because with 2 GB the disk I/O operation are very important. If you use any human readable format like XML or other scripts you resize the data with a factor of 2 or more.
Depending on the data it can be speed up if you compress the data on the fly with a low compression rate.
A total no go is Java serialization because on reading Java check on every object if it is a reference to an existing object.