Background
There is a well-known tool called Wireshark. I've been using it for ages. It is great, but performance is the problem. Common usage scenario includes several data preparation steps in order to extract a data subset to be analyzed later. Without that step it takes minutes to do filtering (with big traces Wireshark is next to unusable).
The actual idea is to create a better solution, fast, parallel and efficient, to be used as a data aggregator/storage.
Requirements
The actual requirement is to use all power provided by modern hardware. I should say there is a room for different types of optimization and I hope I did a good job on upper layers, but technology is the main question right now. According to the current design there are several flavors of packet decoders (dissectors):
interactive decoders: decoding logic can be easily changed in runtime. Such approach can be quite useful for protocol developers -- decoding speed is not that critical, but flexibility and fast results are more important
embeddable decoders: can be used as a library.This type is supposed to have good performance and be flexible enough to use all available CPUs and cores
decoders as a service: can be accessed through a clean API. This type should provide best of the breed performance and efficiency
Results
My current solution is JVM-based decoders. The actual idea is to reuse the code, eliminate porting, etc, but still have good efficiency.
Interactive decoders: implemented on Groovy
Embeddable decoders: implemented on Java
Decoders as a service: Tomcat + optimizations + embeddable decoders wrapped into a servlet (binary in, XML out)
Problems to be solved
Groovy provides way to much power and everything, but lucks expressiveness in this particular case
Decoding protocol into a tree structure is a dead end -- too many resources are simply wasted
Memory consumption is somewhat hard to control. I did several optimizations but still not happy with profiling results
Tomcat with various bells and whistles still introduces to much overhead (mainly connection handling)
Am I doing right using JVM everywhere? Do you see any other good and elegant way to achieve the initial goal: get easy-to-write highly scalable and efficient protocol decoders?
The protocol, format of the results, etc are not fixed.
I've found several possible improvements:
Interactive decoders
Groovy expressiveness can be greatly improved, by extending Groovy syntax using
AST Transformations. So it would be possible to simplify decoders authoring still providing good performance. AST (stands for Abstract Syntax Tree) is a compile-time technique.
When the Groovy compiler compiles Groovy scripts and classes, at some
point in the process, the source code will end up being represented in
memory in the form of a Concrete Syntax Tree, then transformed into an
Abstract Syntax Tree. The purpose of AST Transformations is to let
developers hook into the compilation process to be able to modify the
AST before it is turned into bytecode that will be run by the JVM.
I do not want to reinvent the wheel introducing yet another language to define/describe a protocol structure (it is enough to have ASN.1). The idea is to simplify decoders development in order to provide some fast prototyping technique. Basically, some kind of DSL is to be introduced.
Further reading
Embeddable decoders
Java can introduce some additional overhead. There are several libraries to address that issue:
HPPC
Trove
Javolution
Commons-primitives
Frankly speaking I do not see any other option except Java for this layer.
Decoders as a service
No Java is needed on this layer. Finally I have a good option to go but price is quite high. GWan looks really good.
Some additional porting will be required, but it is definitely worth it.
This problem seems to share the same characteristic of many high-performance I/O implementation problems, which is that the number of memory copies dominates performance. The scatter-gather interface patterns for asynchronous I/O follow from this principle. With scatter-gather, blocks of memory are operated on in place. As long as the protocol decoders take block streams as input rather than byte streams, you'll have eliminated a lot of the performance overhead of moving memory around to preserve the byte stream abstraction. The byte stream is a very good abstraction for saving engineering time, but not so good for high-performance I/O.
In a related issue, I'd beware of the JVM just because of the basic type String. I can't say I'm familiar with how String is implemented in the JVM, but I do imagine that there's not a way of making a string out of a block list without doing a memory copy. On the other hand, a native kind of string that could, and which interoperated with the JVM String compatibly could be a way of splitting the difference.
The other aspect of this problem that seems relevant is that of formal languages. In the spirit of not copying blocks of memory, you also don't want to be scanning the same block of memory over and over. Since you want to make run-time changes, that means you probably don't want to use a precompiled state machine, but rather a recursive descent parser that can dispatch to an appropriate protocol interpreter at each level of descent. There are some complications involved when an outer layer does not specify the type of an inner layer. Those complications are worse when you don't even get the length of the inner content, since then you're relying on the inner content to be well formed to prevent runaway. Nevertheless, it's worth putting some attention into understand how many times a single block will be scanned.
Network traffic is growing (some analytics), so there will be a need to process more and more data per second.
The only way to achieve that is to use more CPU power, but CPU frequency is stable. Only number of cores is growing. It looks like the only way is to use available cores more efficiently and scale better.
Related
I was wondering how the serialization of MicroStream works in detail.
Since it is described as "Super-Fast" it has to rely on code-generation, right? Or is it based on reflections?
How would it perform in comparison to the Protobuf-Serialization, which relies on Code-generation that directly reads out of the java-fields and writes them into a bytebuffer and vice-versa.
Using reflections would drastically decrease the performance when serializing objects on a huge scale, wouldn't it?
I'm looking for a fast way to transmit and persist objects for a multiplayer-game and every millisecond counts. :)
Thanks in advance!
PS: Since I don't have enough reputation, I can not create the "microstream"-tag. https://microstream.one/
I am the lead developer of MicroStream.
(This is not an alias account. I really just created it. I'm reading on StackOverflow for 10 years or so but never had a reason to create an account. Until now.)
On every initialization, MicroStream analyzes the current runtime's versions of all required entity and value type classes and derives optimized metadata from them.
The same is done when encountering a class at runtime that was unknown so far.
The analysis is done per reflection, but since it is only done once for every handled class, the reflection performance cost is negligible.
The actual storing and loading or serialization and deserialization is done via optimized framework code based on the created metadata.
If a class layout changes, the type analysis creates a mapping from the field layout that the class' instances are stored in to that of the current class.
Automatically if possible (unambiguous changes or via some configurable heuristics), otherwise via a user-provided mapping. Performance stays the same since the JVM does not care if it (simplified speaking) copies a loaded value #3 to position #3 or to position #5. It's all in the metadata.
ByteBuffers are used, more precisely direct ByteBuffers, but only as an anchor for off-heap memory to work on via direct "Unsafe" low-level operations. If you are not familiar with "Unsafe" operations, a short and simple notion is: "It's as direct and fast as C++ code.". You can do anything you want very fast and close to memory, but you are also responsible for everything. For more details, google "sun.misc.Unsafe".
No code is generated. No byte code hacking, tacit replacement of instances by proxies or similar monkey business is used. On the technical level, it's just a Java library (including "Unsafe" usage), but with a lot of properly devised logic.
As a side note: reflection is not as slow as it is commonly considered to be. Not any more. It was, but it has been optimized pretty much in some past Java version(s?).
It's only slow if every operation has to do all the class analysis, field lookups, etc. anew (which an awful lot of frameworks seem to do because they are just badly written). If the fields are collected (set accessible, etc.) once and then cached, reflection is actually surprisingly fast.
Regarding the comparison to Protobuf-Serialization:
I can't say anything specific about it since I haven't used Protocol Buffers and I don't know how it works internally.
As usual with complex technologies, a truly meaningful comparison might be pretty difficult to do since different technologies have different optimization priorities and limitations.
Most serialization approaches give up referential consistency but only store "data" (i.e. if two objects reference a third, deserialization will create TWO instances of that third object.
Like this: A->C<-B ==serialization==> A->C1 B->C2.
This basically breaks/ruins/destroys object graphs and makes serialization of cyclic graphs impossible, since it creates and endlessly cascading replication. See JSON serialization, for example. Funny stuff.)
Even Brian Goetz' draft for a Java "Serialization 2.0" includes that limitation (see "Limitations" at http://cr.openjdk.java.net/~briangoetz/amber/serialization.html) (and another one which breaks the separation of concerns).
MicroStream does not have that limitation. It handles arbitrary object graphs properly without ruining their references.
Keeping referential consistency intact is by far not "trying to do too much", as he writes. It is "doing it properly". One just has to know how to do it properly. And it even is rather trivial if done correctly.
So, depending on how many limitations Protobuf-Serialization has ("pacts with the devil"), it might be hardly or even not at all comparable to MicroStream in general.
Of course, you can always create some performance comparison tests for your particular requirements and see which technology suits you best. Just make sure you are aware of the limitations a certain technology imposes on you (ruined referential consistency, forbidden types, required annotations, required default constructor / getters / setters, etc.).
MicroStream has none*.
(*) within reason: Serializing/storing system-internals (e.g. Thread) or non-entities (like lambdas or proxy instances) is, while technically possible, intentionally excluded.
Recently I see a lot of code in few projects using stream for filtering objects, like:
library.stream()
.map(book -> book.getAuthor())
.filter(author -> author.getAge() >= 50)
.map(Author::getSurname)
.map(String::toUpperCase)
.distinct()
.limit(15)
.collect(toList()));
Is there any advantages of using that instead of direct HQL/SQL query to the database returning already the filtered results.
Isn't the second aproach much faster?
If the data originally comes from a DB it is better to do the filtering in the DB rather than fetching everything and filtering locally.
First, Database management systems are good at filtering, it is part of their main job and they are therefore optimized for it. The filtering can also be sped up by using indexes.
Second, fetching and transmitting many records and to unmarshal the data into objects just to throw away a lot of them when doing local filtering is a waste of bandwidth and computing resources.
On a first glance: streams can be made to run in parallel; just by changing code to use parallelStream(). (disclaimer: of course it depends on the specific context if just changing the stream type will result in correct results; but yes, it can be that easy).
Then: streams "invite" to use lambda expressions. And those in turn lead to usage of invoke_dynamic bytecode instructions; sometimes gaining performance advantages compared to "old-school" kind of writing such code. (and to clarify the misunderstanding: invoke_dynamic is a property of lambdas, not streams!)
These would be reasons to prefer "stream" solutions nowadays (from a general point of view).
Beyond that: it really depends ... lets have a look at your example input. This looks like dealing with ordinary Java POJOs, that already reside in memory, within some sort of collection. Processing such objects in memory directly would definitely be faster than going to some off-process database to do work there!
But, of course: when the above calls, like book.getAuthor() would be doing a "deep dive" and actually talk to an underlying database; then chances are that "doing the whole thing in a single query" gives you better performance.
The first thing is to realize, that you can't tell from just this code, what statement is issued against the database. It might very well, that all the filtering, limiting and mapping is collected, and upon the invocation of collect all that information is used to construct a matching SQL statement (or whatever query language is used) and send to the database.
With this in mind there are many reasons why streamlike APIs are used.
It is hip. Streams and lambdas are still rather new to most java developers, so they feel cool when they use it.
If something like in the first paragraph is used it actually creates a nice DSL to construct your query statements. Scalas Slick and .Net LINQ where early examples I know about, although I assume somebody build something like it in LISP long before I was born.
The streams might be reactive streams and encapsulate a non-blocking API. While these APIs are really nice because they don't force you to block resources like threads while you are waiting for results. Using them requires either tons of callbacks or using a much nicer stream based API to process the results.
They are nicer to read the imperative code. Maybe the processing done in the stream can't [easily/by the author] be done with SQL. So the alternatives aren't SQL vs Java (or what ever language you are using), but imperative Java or "functional" Java. The later often reads nicer.
So there are good reasons to use such an API.
With all that said: It is, in almost all cases, a bad idea to do any sorting/filtering and the like in your application, when you can offload it to the database. The only exception I can currently think of is when you can skip the whole roundtrip to the database, because you already have the result locally (e.g. in a cache).
Well, your question should ideally be - Is it better to do reduction / filtering operations in the DB or fetch all records and do it in Java using Streams?
The answer isn't straightforward and any stats that give a "concrete" answer will not generalize to all cases.
The operations you are talking about are better done in the DB itself, because that is what DBs are designed for, very fast handling of data. Of course usually in case of relational databases, there will be some "book-keeping and locks" being used to ensure that independent transactions don't end up making the data inconsistent, but even with that, DBs do a pretty good job in filtering data, especially large data sets.
One case where I would prefer filtering data in Java code rather than in DB would be if you need to filter different features from the same data. For example, right now you are getting only the Author's surname. If you wanted to get all books written by the author, ages of authors, children of author, place of birth etc. Then it makes sense to get only one "read-only" copy from the DB and use parallel streams to get different information from the same data set.
Unless measured and proven for a specific scenario either could be good or equally bad. The reason you usually want to take these kind of queries to the database is because (among other things):
DB can handle much larger data then your java process
Queries in a database can be indexed (making them much faster)
On the other hand, if your data is small, using a Stream the way you did is effective. Writing such a Stream pipeline is very readable (once you talk Streams good enough).
Hibernate and other ORMs are usually way more useful for writing entities rather than reading, because they allow developers to offload ordering of specific writes to framework that almost never will "get that wrong".
Now, for reading and reporting, on the other hand (and considering we are talking DB here) an SQL query is likely to be better because there will not be any frameworks in-between, and you will be able to tune query performance in terms of database that will be invoking this query rather than in terms of your framework of choice, which gives more flexibility to how that tuning can be done, sort of.
Is there a method where I can iterate a Collection and only retrieve just a subset of attributes without loading/unloading the each of the full object to cache? 'Cos it seems like a waste to load/unload the WHOLE (possibly big) object when I need only some attribute(s), especially if the objects are big. It might cause unnecessary cache conflicts when loading such unnecessary data, right?
When I meant to 'load to cache' I mean to 'process' that object via the processor. So there would be objects of ex: 10 attributes. In the iterating loop I only use 1 of those. In such a scenario, I think its a waste to load all the other 9 attributes to the processor from the memory. Isn't there a solution to only extract the attributes without loading the full object?
Also, does something like Google's Guava solve the problem internally?
THANK YOU!
It's not usually the first place to look, but it's not certainly impossible that you're running into cache sharing problems. If you're really convinced (from realistic profiling or analysis of hardware counters) that this is a bottleneck worth addressing, you might consider altering your data structures to use parallel arrays of primitives (akin to column-based database storage in some DB architectures). e.g. one 'column' as a float[], another as a short[], a third as a String[], all indexed by the same identifier. This structure allows you to 'query' individual columns without loading into cache any columns that aren't currently needed.
I have some low-level algorithmic code that would really benefit from C's struct. I ran some microbenchmarks on various alternatives and found that parallel arrays was the most effective option for my algorithms (that may or may not apply to your own).
Note that a parallel-array structure will be considerably more complex to maintain and mutate than using Objects in java.util collections. So I'll reiterate - I'd only take this approach after you've convinced yourself that the benefit will be worth the pain.
There is no way in Java to manage loading to processor caches, and there is no way to change how the JVM works with objects, so the answer is no.
Java is not a low-level language and hides such details from the programmer.
The JVM will decide how much of the object it loads. It might load the whole object as some kind of read-ahead optimization, or load only the fields you actually access, or analyze the code during JIT compilation and do a combination of both.
Also, how large do you worry your objects are? I have rarely seen classes with more than a few fields, so I would not consider that big.
I am a bit new to profiling applications for improving performance. I have selected YourKit as my profiler. There is no doubt that YourKit provides very interesting statistics. Where I am getting stuck is what to do with these statistics.
For instance, Consider a method that operates on a JAXB POJO. The method iterates through the POJO to access a tag/element that is deeply nested inside the XML. This requires 4 layers of for loops to get to the element/tag as shown below :
List<Bundle> bundles = null;
List<Item> items = null;
for(Info info : data) {
bundles = info.getBundles();
for(Bundle bundle : bundles) {
items = bundle.getItems();
//.. more loops like this till we get to the required element
}
}
YourKit tells me that the above code is a 'hot-spot' and 80 objects are getting garbage collected for each call to the method containing this code. The above code is just an example and not the only part where I am getting stuck. Most of the times I have no clue about what to do with the information given by the profiler. What can I possibly do to reduce the number of temporary objects in the above code? Are there any well defined principles for imporoving the performance of an application? What statistics to look for when profiling an application and what implications does each kind of statistics have?
Edit :
The main objective for profiling the application is to increase the throughput and response time. The current throughput is only 10 percent of the required throughput!
Focus on the statistics relevant to your performance goal. You are interested in minimal response time, so look at how much each method contributes to response time, and focus on those that contribute much (for single threaded processing, that's simply elapsed time during method call, summed over all invocation of that method). I am not sure what YourKit defines as hot spots (check the docs), but it's probably the methods with highest cummulative elapsed time, so hot spots are a good thing to look at. In constrast, object allocation has no direct impact on response time, and is irrelavant in your case (unless you have identified that the garbage collector contributes a significant proportion of cpu time, which it usually doesn't).
I absolutely agree with the given answers.
I would like to add that considering your concrete example, you actually can make an improvement by using xpath api to access the specific location in the XML.
In situations where you don't need to actually iterate the entire DOM, this should be your first choice since it is declarative and hence more expressive and less error prone.
It would often give you superior performance as well (For very complex queries it may not be the case, but you seem to have a simple scenario).
A way to improve the loop would be to change your schema and essentially flatten the model, of course this depends on whether you can change the schema. This way the generated Java will not require 4 layers of looping. Of course at the end of the day you need to ask yourself is the code really a problem - 80 objects are getting GCed so? Is your application running slow? Are you experiencing memory issues? Remember premature optimization is the root of all evil!
Profiling and optimization is a complex beast and depends on may things (Java version, 32 vs 64 bit os, etc...). Furthermore the optimization might not always require code changes, for example you could resolve problems by changing your GC policy on the JVM - for example there are GC policies that are more effective in situations where your code is creating many small objects that need to be GCed frequently. If you had specifics maybe it would be easier to help you however your question seems too broad. In fact there are many books written on topic which might be worth a read.
I am working in a highly distributed environment. A lot of network access and a lot of db access.
I have some classes that are send over and over the network, and are serialized and de-serialized.
Most of the classes are quite simple in their nature, like :
class A{
long a;
long b;
}
And some are more complex (Compound - Collections).
There are some people in the company I work that claim that all the classes should implement Externalizable rather than Serializable , and that would make a major impact on the performance of the application.
Although the impact on the performance is very difficult to measure, since the application is so big and so distributed and not fully ready, I can't really simulate a full load right now.
So maybe some of you know some interesting article that would reveal anything to me. Or maybe you can share some thoughts.
My basic intuition was that is would not make any difference serializing and deserializing simple classes (like the one above) over the network/db, lets say when the IO process of the whole app are around 10%. ( I mean 90% of the time the system is doing other stuff than IO )
My basic intuition was that is would not make any difference serializing and deserializing simple classes (like the one above) over the network/db, lets say when the IO process of the whole app are around 10%. ( I mean 90% of the time the system is doing other stuff than IO )
Your intuition sounds reasonable. But what exactly is taking 10% of the time? Is it just the serialization / deserialization? Or does the 10% include the real (clock) time to do the I/O?
EDIT
If you have actual profiling measurements to back up your "10% to 15%" clock time doing serialization + deserialization + I/O, then logic tells you that the maximum performance improvement you can get will be less than that. If you can separate the I/O from the serialization / deserialization, you can refine that upper bound. My guess is that the actual improvement will be less than 5%.
I suggest that you create a small benchmark to send and receive one of your data types using serialization and externalization and see what percentage difference it actually makes.
It must be said that there is a (relatively) significant overhead in generic serialization versus optimally implemented externalization. A lot of this due to the general properties of serialization.
There is the overhead of marshaling / unmarshaling the type descriptors for each class used in the object being transmitted.
There is the overhead of adding each marshaled object to a hash table so that the serialization faithfully records cycles, etc.
However, serialization / deserialization is only a small part of the total I/O overheads, and these are only a small part of your application.
This is a pretty good website comparing many different Java serialization mechanisms.
http://github.com/eishay/jvm-serializers/wiki
I would ask them to come up with some measurements to support their claims. Then everybody will have a basis for a rational discussion. At present you don't have one. Note that it is those with the claims that should produce the supporting evidence: don't get sucked into being responsible for proving them wrong.
Java serialization is flexible and standard, but its not designed to be fast, especially for simple objects. If you want speed I suggest you try hessian or protobuf. These can be 5x faster for simple objects. Or youc an write a custom serializer which can be as much as 10x faster.
For us, custom serialization is the way to go. We let Java do what it does well, or at least good enough, for free, and provide custom support for what it does badly. This is a lot less code than full Externalizable support.
I can't imagine in what case custom serialization couldn't be done but using Externalizable could (referring to Roman's comment on Peter L's answer). Specifically, I'm referring to, e.g., implementation of writeObject/readObject.
Generic serialization can be fast
see http://java-is-the-new-c.blogspot.de/2013/10/still-using-externalizable-to-get.html
It depends on your concrete system if serialization performance is significant. I have seen systems which gained a lot of performance by speeding up serialization. It's not only about CPU but also about latencies. E.g. if a distributed system does a lot of blocking request/responses (requestor waiting for result), serializations adds up to overall request response time which can be significant as its (1) encode request (2) decode request (3) encode response (4) decode response. So 4 (de-)serialization's happing per request/response