I am sending large arrays(>100) elements to my Java backend every second.
Spring(jackson) is converting this array and mapping it to a local String[].
I can map this to a String value for better performance.
Is deserializing in such scenarios a major time consumption activity or not a big deal? If not, when does this become a big deal?
So, basically I am trying to understand the difference between mapping to String vs String[] for big values like an array of 100 elements.
FWIW, internally it is using Jackson parser. And I have to scale this to support concurrent users sending such serialized array data.
Is deserializing in such scenarios a major time consumption activity or not a big deal?
That's a vague question, so it deserves a vague answer - it depends.
In some cases it could be insignificant, in others is could be a major bottleneck. It depends on your application and how often you perform this serialization/deserialization.
It's a bit like asking whether 1$ is a lot. Paying 1$ for something important once a year when you're already spending 1,000,000$ on other things is insignificant.
Paying 1$ every second when the rest of your expenses are 1$ per day is probably a lot.
If you want to improve the performance of your application, you should start by measuring. See what takes too much time and/or resources, and optimize that.
Related
I am trying to index a record of 5 billion, or even more, rows using lucene.
Does the time of indexing increase exponentially as the record set increases?
My initial indexing of 10 million records happened very quickly, but when I tried to index more than 100 million records, it took more time than I expected, with respect to the 10 million record indexing time.
Is it because it is indexing it against more document hence time is increasing exponentially? Or what could be the reason behind this behavior, and is there any way to optimize it (please note,currently all fields in all the documents are of type StringField, will chaning it to IntField help me in this direction?).
My second question would be how will the search performance be in case of indexing 5 billion records. Any ideas on that?
Let me know if you need more information from my end on this.
Our current use case seems somewhat similar to yours: 1.6 billion rows, most fields are exact matches, periodic addition of files/rows, regular searching. Our initial indexing is not distributed or parallelized in any way, currently, and takes around 9 hours. I only offer that number to give you a very vague sense of what your indexing experience may be.
To try and answer your questions:
Our indexing time does not grow exponentially with the number of rows already indexed, though it does slow down very gradually. For us, perhaps 20% slower by the end, though it could also be specific to our data.
If you are experiencing significant slow-down, I support femtoRgon's suggestion that you profile to see what's eating the time. Lucene has never been the slowest/weakest component in our system.
Yes, you can write to your index in parallel, and you may see improved throughput. Whether it helps or not depends on where your bottlenecks are, of course. Consider using Solr - it may ease your efforts here.
We use a mixture of StringField, LongField, and TextField. It seems unlikely that the type of field is causing your slowdown on its own.
These answers are all anecdotal, but perhaps they'll be of some use to you.
This page is now quite dated, but if you exhaust all your other options, may provide hints of which levers you can pull that might tweak performance: How to make indexing faster
Have you profiled to see what is actually causing your performance issues? You could find something unexpected is eating up all that time. When I profiled a similar performance issue I thought was caused by lucene, turned out the problem was mostly string concatenations.
As to whether you should use StringField or IntField (or TextField, or whatever), you should determine that based on what is in the field on how you are going to search it. If you might want to search the field as a range of numeric values, it should be an IntField, not a StringField. By the way, StringField indexes the entire value as a single term, and skips analysis, so this is also the wrong field for full text, for which you should use a TextField. Basically, using a StringField for everything would seem very much like a bad code smell to me, and could cause performance issues at index time, but I would definitely expect the much larger problems would appear when you start trying to search.
As far as "how will the search performance be with 5 billion values", that's far too vague a question to even attempt to answer. No idea. Try it and see.
It is extremely difficult to illustrate the complexity of frameworks (hibernate, spring, apache-commons, ...)
The only thing I could think of was to compare the file sizes of the jar libraries or even better, the number of classes contained in the jar files.
Of course this is not a mathematical sound proof of complexity. But at least it should make clear that some frameworks are lightweight compared to others.
Of course it would take quiet some time to calculate statistics. In an attempt to save time I was wondering if perhaps somebody did so already ?
EDIT:
Yes, there are a lot of tools to calculate the complexity of individual methods and classes. But this question is about third party jar files.
Also please note that 40% of phrases in my original question stress the fact that everybody is well aware of the fact that complexity is hard to measure and that file size and nr of classes may indeed not be sufficient. So, it is not necessary to elaborate on this any further.
There are tools out there that can measure the complexity of code. However this is more of a psychological question as you cannot mathematically define the term 'complex code'. And obviously giving two random persons some piece of code will give you very different answers.
In general the issue with complexity arises from the fact that a human brain cannot process more than a certain number of lines of code simultaneously (actually functional pieces, but normal lines of code should be exactly that). The exact number of lines that one can hold and understand in memory at the same time of course varies based on many factors (including time of day, day of the week and status of your coffee machine) and therefore completely depends on the audience. However less number of lines of code that you have to keep in your 'internal memory register' for one task is better, therefore this should be the general factor when trying to determine the complexity of an API.
There is however a pitfall with this way of calculating complexity, as many APIs offer you a fast way of solving a problem (easy entry level), but this solution later turns out to cause several very complex coding decisions, that on overall makes your code very difficult to understand. In contrast other APIs require you to do a very complex setup that is hard to understand at first, but the rest of your code will be extremely easy because of that initial setup.
Therefore a good way of measuring API complexity is to define a task to solve by that API that is representative and big enough, and then measure the average amount of simultaneous lines of code one has to keep in mind to implement that task.And once you're done, please publish the result in a scientific paper of your choice. ;)
I have this list of TCP/UDP port numbers and their string description:
http://en.wikipedia.org/wiki/List_of_TCP_and_UDP_port_numbers
now this is in the form of an HashMap with portnumber as the key and string description as the value. it might not be so big but i have to lookup for port description in real time when the packets are coming and as you can imagine, this requires efficient retrieval otherwise it slows down the processing considerably.
Initially i thought of implementing huge switch case/break logic or if, else if but that sounded too shabby so i came up with this hashMap.
Now i want to know does Java has something like caching mechanism to speed up if the queries are always the same? like mostly the queried ports will be 80, 443, 23, 22 etc and rarely other services type packets might arrive.
My Options:
Should i make couple of else-if checks in the start for most
common types and then revert to this hashMap if not found earlier
Should i continue with this hashMap to do the search for me
should i revert to some other clever way of doing this??
Please suggest.
Have you measured how long this takes ? I suspect that a lookup in a hash map with a reasonable number of buckets is going to be negligible compared to whatever else you're doing.
As always with these sort of questions, it's well worth measuring the supposed performance issue before working on it. Premature optimisation is the root of all evil, as they say.
it slows down the processing considerably.
A lookup of a HashMap typically takes about 50 ns. Given reading from a socket with data typically takes 10,000 - 20,000 ns, I suspect this isn't the problem you think it is.
If you want really fast lookup use an array as this can be faster.
String[] portToName = new String[65536];
The HashMap has a guaranteed O(1) access time for get operations. The way you're doing it right now is perfect from any point of view.
Maintaining an if/else if structure would be error prone and useless in terms of speedup (for a large list it would actually be worse, with an O(n) asympt time).
I have a List of Strings I need to store locally (assume the list can run between 10 items to 100 items). I want to know if I should write the lists into a Flat database or use Serialization to flatten the object containing the list? Which is more expensive (CPU-wise)? What are the conditions that make one more expensive than the other?
Thanks!!
Especially since they are Strings, just write them out one per line to a file. Simple, fast, and far easier to test.
I have a List of Strings I need to store locally (assume the list can run between 10 items to 100 items).
Assuming that the total length of the strings is small (e.g. less than 10K), the user-space CPU time used to do the saving is likely to be a few milliseconds using either serialization or a flat file. In other words, it will be so fast that the user won't notice the difference.
You should be looking at the other reasons for choosing between the two alternatives (and others):
How easy is it to write the code.
How many extra dependencies does the alternative pull in.
Human readability / editability of the saved data file ... in case you need to do this.
How easy / hard it would be to change the "schema" of the stuff saved to file ... in case you need to do this.
Whether you can update one string without rewriting the whole file ... if this is relevant.
Support for other things such as atomic update, transactions, complex queries, etc ... if these are relevant.
And if, despite what I said above, you still want to know which will be faster (and by how much), then benchmark it. The real world performance will depend on factors that you haven't specified.
Here are a couple of important references on how to write a Java benchmark so that it gives meaningful results.
How NOT to write a Java micro-benchmark
Robust Java benchmarking, Part 1: Issues.
Robust Java benchmarking, Part 2: Statistics and solutions
And you can experiment to answer this part of your question:
What are the conditions that make one more expensive than the other?
(See above)
I am not sure about the expense but I believe since the object representation many a times contains whole lot of meta data (and structure) which might result in creating a big big object size than the original intended data. Example to this may be when you store a xml structure in a DOM object - it takes about 4X size in memory than the original data.
Based on above, I think serializing as an object might be more expensive. You may also want to consider the consumption of the end product. If you want the produced file to be human readable you will have to serialize the String data for readability.
could you please suggest me (novice in Android/JAVA) what`s the most efficient way to deal with a relatively large amounts of data?
I need to compute some stuff for each of the 1000...5000 of elements in say a big datatype (x1,y1,z1 - double, flag1...flagn - boolean, desc1...descn - string) quite often (once a sec), that is why I want to do is as fast as possible.
What way would be the best? To declare a multidimensional array, or produce an array for each element (x1[i], y1[i]...), special class, some sort of JavaBean? Which one is the most efficient in terms of speed etc? Which is the most common way to deal with that sort of thing in Java?
Many thanks in advance!
Nick, you've asked a very generally questions. I'll do my best to answer it, but please be aware if you want anything more specific, you're going to need to drill down your question a bit.
Some back-envolope-calculations show that for and array of 5000 doubles you'll use 8 bytes * 5000 = 40,000 bytes or roughly 40 kB of memory. This isn't too bad as memory on most android devices is on the order of mega or even giga bytes. A good 'ol ArrayList should do just fine for storing this data. You could probably make things a little faster by specifying the ArrayLists length when you constructor. That way the Arraylist doesn't have to dynamically expand every time you add more data to it.
Word of caution though. Since we are on a memory restricted device, what could potentially happen is if you generate a lot of these ArrayLists rapidly in succession, you might start triggering the garbage collector a lot. This could cause your app to slow down (the whole device actually). If you're really going to be generating lots of data, then don't store it in memory. Store it off on disk where you'll have plenty of room and won't be triggering the garbage collector all the time.
I think that the efficiency with which you write the computation you need to do on each element is way more important than the data structure you use to store it. The difference between using an array for each element or an array of objects (each of which is the instance of a class containing all elements) should practically be negligible. Use whatever data structures you feel most comfortable with and focus on writing efficient algorithms.