I have a Huge data file and I only need specific data from this file, and later on, I will be using these data frequently.
So which of these two methods would be more efficient :
save this data in global variables (maybe LinkedList) and use them every time I need
save them in a file, and read the file every time I need the data
I should mention that these data could be a huge amount of integers.
Which of the mentioned two ways would give better performance with respect to speed and memory ?
If the file I/O overhead is not an issue for you: Save them in a file and create an index file mapping keys to file positions so you do not have to read your huge file.
If the data fits in your RAM and you want to be able to access it quickly - go by the first approach (but maybe without an index file) but read the data into memory at startup or when needed the first time.
As long as it fits in memory, working with memory is surely some orders of magnitude faster. But do not use LinkedList - it has a huge overhead. And do not use any standard Collection at all since it means boxing and blows the memory overhead by a factor 3 at least.
You could use int[] or a specialized collection for primitive types.
I'd recommend using a file via java.nio.IntBuffer. This way the data reside primarily on the disk but get mapped into memory too.
Probably the first one.
But there really isn't enough information there to answer you properly.
Firstly a linked list is fine if you only ever traverse it in order. However, if you need random access to it (5th element, then 100th, then 12th, then 45th...), it's lousy, and you'd be better with an ArrayList or something. Secondly, if you're storing lots of ints, if you use one of the standard Java collections, each int will be boxed, which may present a performance overhead.
Then you haven't said what 'huge' means. Thousands? Millions?
So, yeah, you need to say what kind of numbers you're dealing with, and what the access patterns are likely to be. And is the 'filtering' step a one-off--or is it done quite frequently?
It depends on system spec, if you are designing your app for one machine - the task is simple, elsewhere you should take into account memory and/or disk space limit on client's computer.
I think you cannot compare these two attitudes performance, as each one has it's own benefits and drawbacks. I'm certain that there are some algorithms available that you could further investigate, connected with reading part of a file into the memory, or creating a cache (when you read a number from a file, store it in memory, so next time you load it - it will be stored in memory).
Related
I am in the middle of a Java project which will be using a 'big dictionary' of words. By 'dictionary' I mean certain numbers (int) assigned to Strings. And by 'big' I mean a file of the order of 100 MB. The first solution that I came up with is probably the simplest possible. At initialization I read in the whole file and create a large HashMap which will be later used to look strings up.
Is there an efficient way to do it without the need of reading the whole file at initialization? Perhaps not, but what if the file is really large, let's say in the order of the RAM available? So basically I'm looking for a way to look things up efficiently in a large dictionary stored in memory.
Thanks for the answers so far, as a result I've realised I could be more specific in my question. As you've probably guessed the application is to do with text mining, in particular representing text in a form of a sparse vector (although some had other inventive ideas :)). So what is critical for usage is to be able to look strings up in the dictionary, obtain their keys as fast as possible. Initial overhead of 'reading' the dictionary file or indexing it into a database is not as important as long as the string look-up time is optimized. Again, let's assume that the dictionary size is big, comparable to the size of RAM available.
Consider ChronicleMap (https://github.com/OpenHFT/Chronicle-Map) in a non-replicated mode. It is an off-heap Java Map implementation, or, from another point of view, a superlightweight NoSQL key-value store.
What it does useful for your task out of the box:
Persistance to disk via memory mapped files (see comment by MichaĆ Kosmulski)
Lazy load (disk pages are loaded only on demand) -> fast startup
If your data volume is larger than available memory, operating system will unmap rarely used pages automatically.
Several JVMs can use the same map, because off-heap memory is shared on OS level. Useful if you does the processing within a map-reduce-like framework, e. g. Hadoop.
Strings are stored in UTF-8 form, -> ~50% memory savings if strings are mostly ASCII (as maaartinus noted)
int or long values takes just 4(8) bytes, like if you have primitive-specialized map implementation.
Very little per-entry memory overhead, much less than in standard HashMap and ConcurrentHashMap
Good configurable concurrency via lock striping, if you already need, or are going to parallelize text processing in future.
At the point your data structure is a few hundred MB to orders of RAM, you're better off not initializing a data structure at run-time, but rather using a database which supports indexing(which most do these days). Indexing is going to be one of the only ways you can ensure the fastest retrieval of text once you're file gets so large and you're running up against the -Xmx settings of your JVM. This is because if your file is as large, or much larger than your maximum size settings, you're inevitably going to crash your JVM.
As for having to read the whole file at initialization. You're going to have to do this eventually so that you can efficiently search and analyze the text in your code. If you know that you're only going to be searching a certain portion of your file at a time, you can implement lazy loading. If not, you might as well bite the bullet and load your entire file into the DB in the beggenning. You can implement parallelism in this process, if there are other parts of your code execution that doesn't depend on this.
Please let me know if you have any questions!
As stated in a comment, a Trie will save you a lot of memory.
You should also consider using bytes instead of chars as this saves you a factor of 2 for plain ASCII text or when using your national charset as long as it has no more than 256 different letters.
At the first glance, combining this low-level optimization with tries makes no sense, as with them the node size is dominated by the pointers. But there's a way if you want to go low level.
So what is critical for usage is to be able to look strings up in the dictionary, obtain their keys as fast as possible.
Then forget any database, as they're damn slow when compared to HashMaps.
If it doesn't fit into memory, the cheapest solution is usually to get more of it. Otherwise, consider loading only the most common words and doing something slower for the others (e.g., a memory mapped file).
I was asked to point to a good tries implementation, especially off-heap. I'm not aware of any.
Assuming the OP needs no mutability, especially no mutability of keys, it all looks very simple.
I guess, the whole dictionary could be easily packed into a single ByteBuffer. Assuming mostly ASCII and with some bit hacking, an arrow would need 1 byte per arrow label character and 1-5 bytes for the child pointer. The child pointer would be relative (i.e., difference between the current node and the child), which would make most of them fit into a single byte when stored in a base 128 encoding.
I can only guess the total memory consumption, but I'd say, something like <4 bytes per word. The above compression would slow the lookup down, but still nowhere near what a single disk access needs.
It sounds too big to store in memory. Either store it in a relational database (easy, and with an index on the hash, fast), or a NoSQL solution, like Solr (small learning curve, very fast).
Although NoSQL is very fast, if you really want to tweak performance, and there are entries that are far more frequently looked up than others, consider using a limited size cache to hold the most recently used (say) 10000 lookups.
What options are there for processing large files quickly, multiple times?
I have a single file (min 1.5 GB, but can be upwards of 10-15 GB) that needs to be read multiple times - on the order of hundreds to thousands of times. The server has a large amount of RAM (64+ GB) and plenty of processors (24+).
The file will be sequential, read-only. Files are encrypted (sensitive data) on disk. I also use MessagePack to deserialize them into objects during the read process.
I cannot store the objects created from the file into memory - too large of an expansion (1.5 GB file turns into 35 GB in-memory object array). File can't be stored as a byte array (limited by Java's array length of 2^32-1).
My initial thought is to use a memory mapped file, but that has its own set of limitations.
The idea is to get the file off the disk and into memory for processing.
The large volume of data is for a machine learning algorithm, that requires multiple reads. During the calculation of each file pass, there's a considerable amount of heap usage by the algorithm itself, which is unavoidable, hence the requirement to read it multiple times.
The problem you have here is that you cannot mmap() the way the system call of the same name does; the syscall can map up to 2^64, FileChannel#map() cannot map more than 2^30 reliably.
However, what you can do is wrap a FileChannel into a class and create several "map ranges" covering all the file.
I have done "nearly" such a thing except more complicated: largetext. More complicated because I have to do the decoding process to boot, and the text which is loaded must be so into memory, unlike you who reads bytes. Less complicated because I have a define JDK interface to implement and you don't.
You can however use nearly the same technique using Guava and a RangeMap<Long, MappedByteBuffer>.
I implement CharSequence in this project above; I suggest that you implement a LargeByteMapping interface instead, from which you can read whatever parts you want; or, well, whatever suits you. Your main problem will be to define that interface. I suspect what CharSequence does is not what you want.
Meh, I may even have a go at it some day, largetext is quite exciting a project and this looks like the same kind of thing; except less complicated, ultimately!
One could even imagine a LargeByteMapping implementation where a factory would create such mappings with only a small part of that into memory and the rest written to a file; and such an implementation would also use the principle of locality: the latest queried part of the file into memory would be kept into memory for faster access.
See also here.
EDIT I feel some more explanation is needed here... A MappedByteBuffer will NOT EAT HEAP SPACE!!
It will eat address space only; it is nearly the equivalent of a ByteBuffer.allocateDirect(), except it is backed by a file.
And a very important distinction needs to be made here; all of the text above supposes that you are reading bytes, not characters!
Figure out how to structure the data. Get a good book about NoSQL and find the appropriate Database (Wide-Column, Graph, etc.) for your scenario. That's what I'd do. You'd not only have sophisticated query methods on your data but also mangling the data using distribute map-reduced implementations doing whatever you want. Maybe that's what you want (you even dropped the bigdata bomb)
How about creating "a dictionary" as the bridge between your program and the target file? Your program will call the dictionary then dictionary will refer you to the big fat file.
I'm looking for the fastest approach, in Java, to store ~1 billion records of ~250 bytes each (storage will happen only once) and then being able to read it multiple times in a non-sequential order.
The source records are being generated into simple java value objects and I would like to read them back in the same format.
For now my best guess is to store these objects, using a fast serialization library such as Kryo, in a flat file and then to use Java FileChannel to make direct random access to read the records at specific positions in the file (when storing the data, I will keep in a hashmap (also to be saved on disk) with the position in the file of each record so that I know where to read it).
Also, there is no need to optimize disk space. My key concern is to optimize read performance, while having a reasonable write performance (that, again, will happen only once).
Last precision: while the records are all of the same type (same Java value object), their size (in bytes) is variable (e.g. it contains strings).
Is there any better approach than what I mentioned above? Any hint or suggestion would be greatly appreciated !
Many thanks,
Thomas
You can use Apache Lucene, it will take care of everything you have mentioned above :)
It is super fast, you can search results more quickly then ever.
Apache Lucene persist objects in files and indexes them. We have used it in couple of apps and it is super fast.
You could just use an embedded Derby database. It's written in Java and you can actually run it up embedded within your process so there is no overhead of inter-process or networked communication. It will store the data and allow you to query it/etc handling all the complexity and indexing for you.
I have a List of Strings I need to store locally (assume the list can run between 10 items to 100 items). I want to know if I should write the lists into a Flat database or use Serialization to flatten the object containing the list? Which is more expensive (CPU-wise)? What are the conditions that make one more expensive than the other?
Thanks!!
Especially since they are Strings, just write them out one per line to a file. Simple, fast, and far easier to test.
I have a List of Strings I need to store locally (assume the list can run between 10 items to 100 items).
Assuming that the total length of the strings is small (e.g. less than 10K), the user-space CPU time used to do the saving is likely to be a few milliseconds using either serialization or a flat file. In other words, it will be so fast that the user won't notice the difference.
You should be looking at the other reasons for choosing between the two alternatives (and others):
How easy is it to write the code.
How many extra dependencies does the alternative pull in.
Human readability / editability of the saved data file ... in case you need to do this.
How easy / hard it would be to change the "schema" of the stuff saved to file ... in case you need to do this.
Whether you can update one string without rewriting the whole file ... if this is relevant.
Support for other things such as atomic update, transactions, complex queries, etc ... if these are relevant.
And if, despite what I said above, you still want to know which will be faster (and by how much), then benchmark it. The real world performance will depend on factors that you haven't specified.
Here are a couple of important references on how to write a Java benchmark so that it gives meaningful results.
How NOT to write a Java micro-benchmark
Robust Java benchmarking, Part 1: Issues.
Robust Java benchmarking, Part 2: Statistics and solutions
And you can experiment to answer this part of your question:
What are the conditions that make one more expensive than the other?
(See above)
I am not sure about the expense but I believe since the object representation many a times contains whole lot of meta data (and structure) which might result in creating a big big object size than the original intended data. Example to this may be when you store a xml structure in a DOM object - it takes about 4X size in memory than the original data.
Based on above, I think serializing as an object might be more expensive. You may also want to consider the consumption of the end product. If you want the produced file to be human readable you will have to serialize the String data for readability.
My engine is executing 1,000,000 of simulations on X deals. During each simulation, for each deal, a specific condition may be verified. In this case, I store the value (which is a double) into an array. Each deal will have its own list of values (i.e. these values are indenpendant from one deal to another deal).
At the end of all the simulations, for each deal, I run an algorithm on his List<Double> to get some outputs. Unfortunately, this algorithm requires the complete list of these values, and thus, I am not able to modify my algorithm to calculate the outputs "on the fly", i.e. during the simulations.
In "normal" conditions (i.e. X is low, and the condition is verified less than 10% of the time), the calculation ends correctly, even if this may be enhanced.
My problem occurs when I have many deals (for example X = 30) and almost all of my simulations verify my specific condition (let say 90% of simulations). So just to store the values, I need about 900,000 * 30 * 64bits of memory (about 216Mb). One of my future requirements is to be able to run 5,000,000 of simulations...
So I can't continue with my current way of storing the values. For the moment, I used a "simple" structure of Map<String, List<Double>>, where the key is the ID of the element, and List<Double> the list of values.
So my question is how can I enhance this specific part of my application in order to reduce the memory usage during the simulations?
Also another important note is that for the final calculation, my List<Double> (or whatever structure I will be using) must be ordered. So if the solution to my previous question also provide a structure that order the new inserted element (such as a SortedMap), it will be really great!
I am using Java 1.6.
Edit 1
My engine is executing some financial calculations indeed, and in my case, all deals are related. This means that I cannot run my calculations on the first deal, get the output, clean the List<Double>, and then move to the second deal, and so on.
Of course, as a temporary solution, we will increase the memory allocated to the engine, but it's not the solution I am expecting ;)
Edit 2
Regarding the algorithm itself. I can't give the exact algorithm here, but here are some hints:
We must work on a sorted List<Double>. I will then calculate an index (which is calculated against a given parameter and the size of the List itself). Then, I finally return the index-th value of this List.
public static double algo(double input, List<Double> sortedList) {
if (someSpecificCases) {
return 0;
}
// Calculate the index value, using input and also size of the sortedList...
double index = ...;
// Specific case where I return the first item of my list.
if (index == 1) {
return sortedList.get(0);
}
// Specific case where I return the last item of my list.
if (index == sortedList.size()) {
return sortedList.get(sortedList.size() - 1);
}
// Here, I need the index-th value of my list...
double val = sortedList.get((int) index);
double finalValue = someBasicCalculations(val);
return finalValue;
}
I hope it will help to have such information now...
Edit 3
Currently, I will not consider any hardware modification (too long and complicated here :( ). The solution of increasing the memory will be done, but it's just a quick fix.
I was thinking of a solution that use a temporary file: Until a certain threshold (for example 100,000), my List<Double> stores new values in memory. When the size of List<Double> reaches this threshold, I append this list in the temporary file (one file per deal).
Something like that:
public void addNewValue(double v) {
if (list.size() == 100000) {
appendListInFile();
list.clear();
}
list.add(v);
}
At the end of the whole calculation, for each deal, I will reconstruct the complete List<Double> from what I have in memory and also in the temporary file. Then, I run my algorithm. I clean the values for this deal, and move to the second deal (I can do that now, as all the simulations are now finished).
What do you think of such solution? Do you think it is acceptable?
Of course I will lose some time to read and write my values in an external file, but I think this can be acceptable, no?
Your problem is algorithmic and you are looking for a "reduction in strength" optimization.
Unfortunately, you've been too coy in the the problem description and say "Unfortunately, this algorithm requires the complete list of these values..." which is dubious. The simulation run has already passed a predicate which in itself tells you something about the sets that pass through the sieve.
I expect the data that meets the criteria has a low information content and therefore is amenable to substantial compression.
Without further information, we really can't help you more.
You mentioned that the "engine" is not connected to a database, but have you considered using a database to store the lists of elements? Possibly an embedded DB such as SQLite?
If you used int or even short instead of string for the key field of your Map, that might save some memory.
If you need a collection object that guarantees order, then consider a Queue or a Stack instead of your List that you are currently using.
Possibly think of a way to run deals sequentially, as Dommer and Alan have already suggested.
I hope that was of some help!
EDIT:
Your comment about only having 30 keys is a good point.
In that case, since you have to calculate all your deals at the same time, then have you considered serializing your Lists to disk (i.e. XML)?
Or even just writing a text file to disk for each List, then after the deals are calculated, loading one file/List at a time to verify that List of conditions?
Of course the disadvantage is slow file IO, but, this would reduced your server's memory requirement.
Can you get away with using floats instead of doubles? That would save you 100Mb.
Just to clarify, do you need ALL of the information in memory at once? It sounds like you are doing financial simulations (maybe credit risk?). Say you are running 30 deals, do you need to store all of the values in memory? Or can you run the first deal (~900,000 * 64bits), then discard the list of double (serialize it to disk or something) and then proceed with the next? I thought this might be okay as you say the deals are independent of one another.
Apologies if this sounds patronising; I'm just trying to get a proper idea of the problem.
The flippant answer is to get a bunch more memory. Sun JVM's can (almost happily) handle multi gigabyte heaps and if it's a batch job then longer GC pauses might not be a massive issue.
You may decide that this not a sane solution, the first thing to attempt would be to write a custom list like collection but have it store primitive doubles instead of the object wrapper Double objects. This will help save the per object overhead you pay for each Double object wrapper. I think the Apache common collections project had primitive collection implementations, these might be a starting point.
Another level would be to maintain the list of doubles in a nio Buffer off heap. This has the advantage that the space being used for the data is actually not considered in the GC runs and could in theory could lead you down the road of managing the data structure in a memory mapped file.
From your description, it appears you will not be able to easily improve your memory usage. The size of a double is fixed, and if you need to retain all results until your final processing, you will not be able to reduce the size of that data.
If you need to reduce your memory usage, but can accept a longer run time, you could replace the Map<String, List<Double>> with a List<Double> and only process a single deal at a time.
If you have to have all the values from all the deals, your only option is to increase your available memory. Your calculation of the memory usage is based on just the size of a value and the number of values. Without a way to decrease the number of values you need, no data structure will be able to help you, you just need to increase your available memory.
From what you tell us it sounds like you need 10^6 x 30 processors (ie number of simulations multiplied by number of deals) each with a few K RAM. Perhaps, though, you don't have that many processors -- do you have 30 each of which has sufficient memory for the simulations for one deal ?
Seriously: parallelise your program and buy an 8-core computer with 32GB RAM (or 16-core w 64GB or ...). You are going to have to do this sooner or later, might as well do it now.
There was a theory that I read awhile ago where you would write the data to disk and only read/write a chunk what you. Of course this describes virtual memory, but the difference here is that the programmer controls the flow and location rathan than the OS. The advantage there is that the OS is only allocated so much virtual memory to use, where you have access to the whole HD.
Or an easier option is just to increase your swap/paged memory, which I think would be silly but would help in your case.
After a quick google it seems like this function might help you if you are running on Windows:
http://msdn.microsoft.com/en-us/library/aa366537(VS.85).aspx
You say you need access to all the values, but you cannot possibly operate on all of them at once? Can you serialize the data such that you can store it in a single file. Each record set apart either by some delimiter, key value, or simply the byte count. Keep a byte counter either way. Let that be a "circular file" composed of a left file and a right file operating like opposing stacks. As data is popped(read) off the left file it is processed and pushed(write) into the right file. If your next operation requires a previously processed value reverse the direction of the file transfer. Think of your algorithm as residing at the read/write head of your hard drive. You have access as you would with a list just using different methods and at much reduced speed. The speed hit will be significant but if you can optimize your sequence of serialization so that the most likely accessed data is at the top of the file in order of use and possibly put the left and right files on different physical drives and your page file on a 3rd drive you will benefit from increased hard disk performance due to sequential and simultaneous reads and writes. Of course its a bit harder than it sounds. Each change of direction requires finalizing both files. Logically something like,
if (current data flow if left to right) {send EOF to right_file; left_file = left_file - right_file;} Practically you would want to leave all data in place where it physically resides on the drive and just manipulate the beginning and ending addresses for the files in the master file table. Literally operating like a pair of hard disk stacks. This will be a much slower, more complicated process than simply adding more memory, but very much more efficient than separate files and all that overhead for 1 file per record * millions of records. Or just put all your data into a database. FWIW, this idea just came to me. I've never actually done it or even heard of it done. But I imagine someone must have thought of it before me. If not please let me know. I could really use the credit on my resume.
One solution would be to format the doubles as strings and then add them in a (fast) Key Value store which is ordering by-design.
Then you would only have to read sequentially from the store.
Here is a store that 'naturally' sorts entries as they are inserted.
And they boast that they are doing it at the rate of 100 million entries per second (searching is almost twice as fast):
http://forum.gwan.com/index.php?p=/discussion/comment/897/#Comment_897
With an API of only 3 calls, it should be easy to test.
A fourth call will provide range-based searches.