Related
I am pretty sure a modified/similar discussion might have already been done here but I want to present the exact problem i am facing with possible solution from my side. Then I want to hear from you guys that what would be better approach or how can I approve my logic.
PROBLEM
I have a huge file which contains lines. Each line is in following format <weight>,<some_name>. Now what I have to do is to add the weight of all the objects which has same name. The problem is
I don't know how frequent some_name exist in the file. it could appear only once or all of the millions could be it
It is not ordered
I am using File Stream (java specific, but it doesn't matter)
SOLUTION 1: Assuming that I have huge ram, What i am planning to do is to read file line by line and use the name as key in my hash_map. If its already there, sum it up otherwise add. It will cost me m ram (m = numer of lines in file) but overall processing would be fast
SOLUTION 2: Assuming that I don't have huge ram, I am going to do in batches. Read first 10,000 in hashtable, sum it up and dump it into the file. Do the for rest of the file. Once done processing file, I will start reading processed files and will repease this process to sum it up all.
What do you guys suggest here ?
Beside your suggestions, Can I do parallel file reading of the file ? I have access to FileInputStream here, Can i work with fileInputStream to make reading of file more efficient ?
The second approach is not going to help you: in order to produce the final output, you need sufficient amount of RAM to hold all keys from the file, along with a single Integer representing the count. Whether you're going to get to it in one big step or by several iterations of 10K rows at a time does not change the footprint that you would need at the end.
What would help is partitioning the keys in some way, e.g. by the first character of the key. If the name starts in a letter, process the file 26 times, the first time taking only the weights for keys starting in 'A' and ignoring all other keys, the second time taking only 'B's, and so on. This will let you end up with 26 files that do not intersect.
Another valid approach would be using an external sorting algorithm to transform an unordered file to an ordered one. This would let you walk the ordered file, calculate totals as you go, and write them to an output, even without the need for an in-memory table.
As far as optimizing the I/O goes, I would recommend using the newBufferedReader(Path path,Charset c) method of the java.nio.file.Files class: it gives you a BufferedReader that is optimized for reading efficiency.
Is the file static when you do this computation? If so, then you could disk sort the file based on the name and add up the consecutive entries.
So I've got these huge text files that are filled with a single comma delimited record per line. I need a way to process the files line by line, removing lines that meet certain criteria. Some of the removals are easy, such as one of the fields is less than a certain length. The hardest criteria is that these lines all have timestamps. Many records are identical except for their timestamps and I have to remove all records but one that are identical and within 15 seconds of one another.
So I'm wondering if some others can come up with the best approach for this. I did come up with a small program in Java that accomplishes the task, using JodaTime for the timestamp stuff which makes it really easy. However, the initial way I coded the program was running into OutofMemory Heap Space errors. I refactored the code a bit and it seemed ok for the most part but I do still believe it has some memory issues as once in awhile the program just seems to get hung up. That and it just seems to take way too long. I'm not sure if this is a memory leak issue, a poor coding issue, or something else entirely. And yes I tried increasing the Heap Size significantly but still was having issues.
I will say that the program needs to be in either Perl or Java. I might be able to make a python script work too but I'm not overly familiar with python. As I said, the timestamp stuff is easiest (to me) in Java because of the JodaTime library. I'm not sure how I'd accomplish the timestamp stuff in Perl. But I'm up for learning and using whatever would work best.
I will also add the files being read in vary tremendously in size but some big ones are around 100Mb with something like 1.3 million records.
My code essentially reads in all the records and puts them into a Hashmap with the keys being a specific subset of the data from a record that similar records would share. So a subset of the record not including the timestamps which would be different. This way you'd end up with some number of records with identical data but that occurred at different times. (So completely identical minus the timestamps).
The value of each key then, is a Set of all records that have the same subset of data. Then I simply iterate through the Hashmap, taking each set and iterating through it. I take the first record and compare its times to all the rest to see if they're within 15 seconds. If so the record is removed. Once that set is finished it's written out to a file until all the records have been gone through. Hopefully that makes sense.
This works but clearly the way I'm doing it is too memory intensive. Anyone have any ideas on a better way to do it? Or, a way I can do this in Perl would actually be good because trying to insert the Java program into the current implementation has caused a number of other headaches. Though perhaps that's just because of my memory issues and poor coding.
Finally, I'm not asking someone to write the program for me. Pseudo code is fine. Though if you have ideas for Perl I could use more specifics. The main thing I'm not sure how to do in Perl is the time comparison stuff. I've looked a little into Perl libraries but haven't seen anything like JodaTime (though I haven't looked much). Any thoughts or suggestions are appreciated. Thank you.
Reading all the rows in is not ideal, because you need to store the whole lot in memory.
Instead you could read line by line, writing out the records that you want to keep as you go. You could keep a cache of the rows you've hit previously, bounded to be within 15 seconds of the current program. In very rough pseudo-code, for every line you'd read:
var line = ReadLine()
DiscardAnythingInCacheOlderThan(line.Date().Minus(15 seconds);
if (!cache.ContainsSomethingMatchingCriteria()) {
// it's a line we want to keep
WriteLine(line);
}
UpdateCache(line); // make sure we store this line so we don't write it out again.
As pointed out, this assumes that the lines are in time stamp order. If they aren't, then I'd just use UNIX sort to make it so they are, as that'll quite merrily handle extremely large files.
You might read the file and output just the line numbers to be deleted (to be sorted and used in a separate pass.) Your hash map could then contain just the minimum data needed plus the line number. This could save a lot of memory if the data needed is small compared to the line size.
I have big file with row like ID|VALUE in one pass.
In case of ID repeat, line must be ignored.
How to effectively make this checking?
added:
ID is long(8 bytes). I need a solution that uses minimum of memory.
Thank's for help guys. I was able to increase heap space and use Set now.
You can store the data in a TLongObjectHashMap or use a TLongHashSet. These classes store primitive based information efficiently.
5 million long values will use < 60 MB in a TLongHashSet however a TLongObjectHashMap will also store your values efficiently.
To find out more about these classes
http://www.google.co.uk/search?q=TLongHashSet
http://www.google.co.uk/search?q=TLongObjectHashMap
You'll have to store ID's somewhere anyway in order to detect duplicates. Here I'd use a HashSet<String> and its contains method.
You have to read the entire file, one line at a time. You have to keep a Set of IDs and compare the incoming one to the values already in the Set. If a value appears, skip that line.
You wrote the use case yourself; there's no magic here.
This looks like a typical database task to me. If you have a database used in your app, you could utilize that to do your task. Create a table with a UNIQUE INTEGER field and start adding rows; you'll get an exception on the duplicated IDs. The database engine will take care of cursor windowing and caching so it fits in your memory budget. Then just drop that table when you're done.
There are two basic solutions;
First, as suggested by duffymo and Andreas_D above you can store all the values in a Set. This gives you O(n) time complexity and O(n) memory usage.
Second, if O(n) memory is too much, you can do it in O(1) memory by sacrificing speed. For each line in the file, read all other lines before it and discard if the ID appears before the current line.
What about probabilistic algorithms?
The Bloom filter ... is a space-efficient probabilistic data structure that is used to test whether an element is a member of a set. False positives are possible, but false negatives are not.
My engine is executing 1,000,000 of simulations on X deals. During each simulation, for each deal, a specific condition may be verified. In this case, I store the value (which is a double) into an array. Each deal will have its own list of values (i.e. these values are indenpendant from one deal to another deal).
At the end of all the simulations, for each deal, I run an algorithm on his List<Double> to get some outputs. Unfortunately, this algorithm requires the complete list of these values, and thus, I am not able to modify my algorithm to calculate the outputs "on the fly", i.e. during the simulations.
In "normal" conditions (i.e. X is low, and the condition is verified less than 10% of the time), the calculation ends correctly, even if this may be enhanced.
My problem occurs when I have many deals (for example X = 30) and almost all of my simulations verify my specific condition (let say 90% of simulations). So just to store the values, I need about 900,000 * 30 * 64bits of memory (about 216Mb). One of my future requirements is to be able to run 5,000,000 of simulations...
So I can't continue with my current way of storing the values. For the moment, I used a "simple" structure of Map<String, List<Double>>, where the key is the ID of the element, and List<Double> the list of values.
So my question is how can I enhance this specific part of my application in order to reduce the memory usage during the simulations?
Also another important note is that for the final calculation, my List<Double> (or whatever structure I will be using) must be ordered. So if the solution to my previous question also provide a structure that order the new inserted element (such as a SortedMap), it will be really great!
I am using Java 1.6.
Edit 1
My engine is executing some financial calculations indeed, and in my case, all deals are related. This means that I cannot run my calculations on the first deal, get the output, clean the List<Double>, and then move to the second deal, and so on.
Of course, as a temporary solution, we will increase the memory allocated to the engine, but it's not the solution I am expecting ;)
Edit 2
Regarding the algorithm itself. I can't give the exact algorithm here, but here are some hints:
We must work on a sorted List<Double>. I will then calculate an index (which is calculated against a given parameter and the size of the List itself). Then, I finally return the index-th value of this List.
public static double algo(double input, List<Double> sortedList) {
if (someSpecificCases) {
return 0;
}
// Calculate the index value, using input and also size of the sortedList...
double index = ...;
// Specific case where I return the first item of my list.
if (index == 1) {
return sortedList.get(0);
}
// Specific case where I return the last item of my list.
if (index == sortedList.size()) {
return sortedList.get(sortedList.size() - 1);
}
// Here, I need the index-th value of my list...
double val = sortedList.get((int) index);
double finalValue = someBasicCalculations(val);
return finalValue;
}
I hope it will help to have such information now...
Edit 3
Currently, I will not consider any hardware modification (too long and complicated here :( ). The solution of increasing the memory will be done, but it's just a quick fix.
I was thinking of a solution that use a temporary file: Until a certain threshold (for example 100,000), my List<Double> stores new values in memory. When the size of List<Double> reaches this threshold, I append this list in the temporary file (one file per deal).
Something like that:
public void addNewValue(double v) {
if (list.size() == 100000) {
appendListInFile();
list.clear();
}
list.add(v);
}
At the end of the whole calculation, for each deal, I will reconstruct the complete List<Double> from what I have in memory and also in the temporary file. Then, I run my algorithm. I clean the values for this deal, and move to the second deal (I can do that now, as all the simulations are now finished).
What do you think of such solution? Do you think it is acceptable?
Of course I will lose some time to read and write my values in an external file, but I think this can be acceptable, no?
Your problem is algorithmic and you are looking for a "reduction in strength" optimization.
Unfortunately, you've been too coy in the the problem description and say "Unfortunately, this algorithm requires the complete list of these values..." which is dubious. The simulation run has already passed a predicate which in itself tells you something about the sets that pass through the sieve.
I expect the data that meets the criteria has a low information content and therefore is amenable to substantial compression.
Without further information, we really can't help you more.
You mentioned that the "engine" is not connected to a database, but have you considered using a database to store the lists of elements? Possibly an embedded DB such as SQLite?
If you used int or even short instead of string for the key field of your Map, that might save some memory.
If you need a collection object that guarantees order, then consider a Queue or a Stack instead of your List that you are currently using.
Possibly think of a way to run deals sequentially, as Dommer and Alan have already suggested.
I hope that was of some help!
EDIT:
Your comment about only having 30 keys is a good point.
In that case, since you have to calculate all your deals at the same time, then have you considered serializing your Lists to disk (i.e. XML)?
Or even just writing a text file to disk for each List, then after the deals are calculated, loading one file/List at a time to verify that List of conditions?
Of course the disadvantage is slow file IO, but, this would reduced your server's memory requirement.
Can you get away with using floats instead of doubles? That would save you 100Mb.
Just to clarify, do you need ALL of the information in memory at once? It sounds like you are doing financial simulations (maybe credit risk?). Say you are running 30 deals, do you need to store all of the values in memory? Or can you run the first deal (~900,000 * 64bits), then discard the list of double (serialize it to disk or something) and then proceed with the next? I thought this might be okay as you say the deals are independent of one another.
Apologies if this sounds patronising; I'm just trying to get a proper idea of the problem.
The flippant answer is to get a bunch more memory. Sun JVM's can (almost happily) handle multi gigabyte heaps and if it's a batch job then longer GC pauses might not be a massive issue.
You may decide that this not a sane solution, the first thing to attempt would be to write a custom list like collection but have it store primitive doubles instead of the object wrapper Double objects. This will help save the per object overhead you pay for each Double object wrapper. I think the Apache common collections project had primitive collection implementations, these might be a starting point.
Another level would be to maintain the list of doubles in a nio Buffer off heap. This has the advantage that the space being used for the data is actually not considered in the GC runs and could in theory could lead you down the road of managing the data structure in a memory mapped file.
From your description, it appears you will not be able to easily improve your memory usage. The size of a double is fixed, and if you need to retain all results until your final processing, you will not be able to reduce the size of that data.
If you need to reduce your memory usage, but can accept a longer run time, you could replace the Map<String, List<Double>> with a List<Double> and only process a single deal at a time.
If you have to have all the values from all the deals, your only option is to increase your available memory. Your calculation of the memory usage is based on just the size of a value and the number of values. Without a way to decrease the number of values you need, no data structure will be able to help you, you just need to increase your available memory.
From what you tell us it sounds like you need 10^6 x 30 processors (ie number of simulations multiplied by number of deals) each with a few K RAM. Perhaps, though, you don't have that many processors -- do you have 30 each of which has sufficient memory for the simulations for one deal ?
Seriously: parallelise your program and buy an 8-core computer with 32GB RAM (or 16-core w 64GB or ...). You are going to have to do this sooner or later, might as well do it now.
There was a theory that I read awhile ago where you would write the data to disk and only read/write a chunk what you. Of course this describes virtual memory, but the difference here is that the programmer controls the flow and location rathan than the OS. The advantage there is that the OS is only allocated so much virtual memory to use, where you have access to the whole HD.
Or an easier option is just to increase your swap/paged memory, which I think would be silly but would help in your case.
After a quick google it seems like this function might help you if you are running on Windows:
http://msdn.microsoft.com/en-us/library/aa366537(VS.85).aspx
You say you need access to all the values, but you cannot possibly operate on all of them at once? Can you serialize the data such that you can store it in a single file. Each record set apart either by some delimiter, key value, or simply the byte count. Keep a byte counter either way. Let that be a "circular file" composed of a left file and a right file operating like opposing stacks. As data is popped(read) off the left file it is processed and pushed(write) into the right file. If your next operation requires a previously processed value reverse the direction of the file transfer. Think of your algorithm as residing at the read/write head of your hard drive. You have access as you would with a list just using different methods and at much reduced speed. The speed hit will be significant but if you can optimize your sequence of serialization so that the most likely accessed data is at the top of the file in order of use and possibly put the left and right files on different physical drives and your page file on a 3rd drive you will benefit from increased hard disk performance due to sequential and simultaneous reads and writes. Of course its a bit harder than it sounds. Each change of direction requires finalizing both files. Logically something like,
if (current data flow if left to right) {send EOF to right_file; left_file = left_file - right_file;} Practically you would want to leave all data in place where it physically resides on the drive and just manipulate the beginning and ending addresses for the files in the master file table. Literally operating like a pair of hard disk stacks. This will be a much slower, more complicated process than simply adding more memory, but very much more efficient than separate files and all that overhead for 1 file per record * millions of records. Or just put all your data into a database. FWIW, this idea just came to me. I've never actually done it or even heard of it done. But I imagine someone must have thought of it before me. If not please let me know. I could really use the credit on my resume.
One solution would be to format the doubles as strings and then add them in a (fast) Key Value store which is ordering by-design.
Then you would only have to read sequentially from the store.
Here is a store that 'naturally' sorts entries as they are inserted.
And they boast that they are doing it at the rate of 100 million entries per second (searching is almost twice as fast):
http://forum.gwan.com/index.php?p=/discussion/comment/897/#Comment_897
With an API of only 3 calls, it should be easy to test.
A fourth call will provide range-based searches.
Points:
We process thousands of flat files in a day, concurrently.
Memory constraint is a major issue.
We use thread for each file process.
We don't sort by columns. Each line (record) in the file is treated as one column.
Can't Do:
We cannot use unix/linux's sort commands.
We cannot use any database system no matter how light they can be.
Now, we cannot just load everything in a collection and use the sort mechanism. It will eat up all the memory and the program is gonna get a heap error.
In that situation, how would you sort the records/lines in a file?
It looks like what you are looking for is
external sorting.
Basically, you sort small chunks of data first, write it back to the disk and then iterate over those to sort all.
As other mentionned, you can process in steps.
I would like to explain this with my own words (differs on point 3) :
Read the file sequentially, process N records at a time in memory (N is arbitrary, depending on your memory constraint and the number T of temporary files that you want).
Sort the N records in memory, write them to a temp file. Loop on T until you are done.
Open all the T temp files at the same time, but read only one record per file. (Of course, with buffers). For each of these T records, find the smaller, write it to the final file, and advance only in that file.
Advantages:
The memory consumption is as low as you want.
You only do the double of disk accesses comparing to a everything-in-memory policy. Not bad! :-)
Exemple with numbers:
Original file with 1 million records.
Choose to have 100 temp files, so read and sort 10 000 records at a time, and drop these in their own temp file.
Open the 100 temp file at a time, read the first record in memory.
Compare the first records, write the smaller and advance this temp file.
Loop on step 5, one million times.
EDITED
You mentionned a multi-threaded application, so I wonder ...
As we seen from these discussions on this need, using less memory gives less performance, with a dramatic factor in this case. So I could also suggest to use only one thread to process only one sort at a time, not as a multi-threaded application.
If you process ten threads, each with a tenth of the memory available, your performance will be miserable, much much less than a tenth of the initial time. If you use only one thread, and queue the 9 other demands and process them in turn, you global performance will be much better, you will finish the ten tasks much faster.
After reading this response :
Sort a file with huge volume of data given memory constraint
I suggest you consider this distribution sort. It could be huge gain in your context.
The improvement over my proposal is that you don't need to open all the temp files at once, you only open one of them. It saves your day! :-)
You can read the files in smaller parts, sort these and write them to temporrary files. Then you read two of them sequentially again and merge them to a bigger temporary file and so on. If there is only one left you have your file sorted. Basically that's the Megresort algorithm performed on external files. It scales quite well with aribitrary large files but causes some extra file I/O.
Edit: If you have some knowledge about the likely variance of the lines in your files you can employ a more efficient algorithm (distribution sort). Simplified you would read the original file once and write each line to a temporary file that takes only lines with the same first char (or a certain range of first chars). Then you iterate over all the (now small) temporary files in ascending order, sort them in memory and append them directly to the output file. If a temporary file turns out to be too big for sorting in memory, you can reapeat the same process for this based on the 2nd char in the lines and so on. So if your first partitioning was good enough to produce small enough files, you will have only 100% I/O overhead regardless how large the file is, but in the worst case it can become much more than with the performance wise stable merge sort.
In spite of your restriction, I would use embedded database SQLITE3. Like yourself, I work weekly with 10-15 millions of flat file lines and it is very, very fast to import and generate sorted data, and you only need a little free of charge executable (sqlite3.exe). For example: Once you download the .exe file, in a command prompt you can do this:
C:> sqlite3.exe dbLines.db
sqlite> create table tabLines(line varchar(5000));
sqlite> create index idx1 on tabLines(line);
sqlite> .separator '\r\n'
sqlite> .import 'FileToImport' TabLines
then:
sqlite> select * from tabLines order by line;
or save to a file:
sqlite> .output out.txt
sqlite> select * from tabLines order by line;
sqlite> .output stdout
I would spin up an EC2 cluster and run Hadoop's MergeSort.
Edit: not sure how much detail you would like, or on what. EC2 is Amazon's Elastic Compute Cloud - it lets you rent virtual servers by the hour at low cost. Here is their website.
Hadoop is an open-source MapReduce framework designed for parallel processing of large data sets. A job is a good candidate for MapReduce when it can be split into subsets that can be processed individually and then merged together, usually by sorting on keys (ie the divide-and-conquer strategy). Here is its website.
As mentioned by the other posters, external sorting is also a good strategy. I think the way I would decide between the two depends on the size of the data and speed requirements. A single machine is likely going to be limited to processing a single file at a time (since you will be using up available memory). So look into something like EC2 only if you need to process files faster than that.
You could use the following divide-and-conquer strategy:
Create a function H() that can assign each record in the input file a number. For a record r2 that will be sorted behind a record r1 it must return a larger number for r2 than for r1. Use this function to partition all the records into separate files that will fit into memory so you can sort them. Once you have done that you can just concatenate the sorted files to get one large sorted file.
Suppose you have this input file where each line represents a record
Alan Smith
Jon Doe
Bill Murray
Johnny Cash
Lets just build H() so that it uses the first letter in the record so you might get up to 26 files but in this example you will just get 3:
<file1>
Alan Smith
<file2>
Bill Murray
<file10>
Jon Doe
Johnny Cash
Now you can sort each individual file. Which would swap "Jon Doe" and "Johnny Cash" in <file10>. Now, if you just concatenate the 3 files you'll have a sorted version of the input.
Note that you divide first and only conquer (sort) later. However, you make sure to do the partitioning in a way that the resulting parts which you need to sort don't overlap which will make merging the result much simpler.
The method by which you implement the partitioning function H() depends very much on the nature of your input data. Once you have that part figured out the rest should be a breeze.
If your restriction is only to not use an external database system, you could try an embedded database (e.g. Apache Derby). That way, you get all the advantages of a database without any external infrastructure dependencies.
Here is a way to do it without heavy use of sorting in-side Java and without using DB.
Assumptions : You have 1TB space and files contain or start with unique number, but are unsorted
Divide the files N times.
Read those N files one by one, and create one file for each line/number
Name that file with corresponding number.While naming keep a counter updated to store least count.
Now you can already have the root folder of files marked for sorting by name or pause your program to give you the time to fire command on your OS to sort the files by names. You can do it programmatically too.
Now you have a folder with files sorted with their name, using the counter start taking each file one by one, put numbers in your OUTPUT file, close it.
When you are done you will have a large file with sorted numbers.
I know you mentioned not using a database no matter how light... so, maybe this is not an option. But, what about hsqldb in memory... submit it, sort it by query, purge it. Just a thought.
You can use SQL Lite file db, load the data to the db and then let it sort and return the results for you.
Advantages: No need to worry about writing the best sorting algorithm.
Disadvantage: You will need disk space, slower processing.
https://sites.google.com/site/arjunwebworld/Home/programming/sorting-large-data-files
You can do it with only two temp files - source and destination - and as little memory as you want.
On first step your source is the original file, on last step the destination is the result file.
On each iteration:
read from the source file into a sliding buffer a chunk of data half size of the buffer;
sort the whole buffer
write to the destination file the first half of the buffer.
shift the second half of the buffer to the beginning and repeat
Keep a boolean flag that says whether you had to move some records in current iteration.
If the flag remains false, your file is sorted.
If it's raised, repeat the process using the destination file as a source.
Max number of iterations: (file size)/(buffer size)*2
You could download gnu sort for windows: http://gnuwin32.sourceforge.net/packages/coreutils.htm Even if that uses too much memory, it can merge smaller sorted files as well. It automatically uses temp files.
There's also the sort that comes with windows within cmd.exe. Both of these commands can specify the character column to sort by.
File sort software for big file https://github.com/lianzhoutw/filesort/ .
It is based on file merge sort algorithm.
If you can move forward/backward in a file (seek), and rewrite parts of the file, then you should use bubble sort.
You will have to scan lines in the file, and only have to have 2 rows in memory at the moment, and then swap them if they are not in the right order. Repeat the process until there are no files to swap.