I have a file which has 100,000 lines and each line is a list of space separated 1000 integers(ranging from 0 to 1,000,000). Now I need to to make an API which when given two inputs a and b tells me if there are two numbers present in same line in file where b comes after a in terms of index. Total size of file is ~700 MB.
Since it is an API I cannot read every time from file by creating a stream, as I have to take care of response time and disk reads are slow. And I cannot load everything in memory since the file is too big.
Any suggestions on what is an optimal way?
Note - I created an API by loading everything to memory and making a hashmap of number -> set of line it belongs and then tried to search it. It works for smaller files, but when I try to start the server with larger file , the server does not starts(I am new to JAVA too, can anyone help me on where to see the logs on why it is not starting?. I am just doing java -jar $DIR/target/test.jar in my bash script)
I think here you have a lot of numbers (100M) and if you want to keep them all in memory you should prepare to use Gbs of ram. Good news is that highest number is 1M, thus making a lot of numbers repeating.
I would probably represent the file with a graph. Each node contains a number (1-1000000) so you have 1 million nodes, fast indexed for O(1) access (nodes could be easily implemented as cell of array). Then each node X is connected to a node Y if Y appear at right of X in any line of the file.
Solution involves finding a connectivity of two nodes in the graph. I'm not an expert here, and I would implement a dfs like algorithm paying attention to avoid cycles. Due to this avoiding, finding algorithm will touch at max 1 million nodes, making complexity low.
About space: each line should produce 999 connections, that is (multiplied by 100k lines) = almost 100 million connections. If each connection is 4 bytes (but you can improve as all you need is 20 bit to store 1 million) then you have 400Mb of memory for connections.
So with 400Mb of ram you can make your API answer very fast.
Related
Problem
[Here follows a description of what the app should do under which constrains]
I want a data-structure that searches if a string exists in a 250,000 word-list, while using only a fair amount of ram and keeping the time it takes to load this data-structure into ram small (let's say 0-8 seconds). The time it takes to find a word should also be quick (let's say 0 to 0.5 second), but ram usage is more important. It should also be possible to create multiple games (more on what this game is about at the title "use") without needing significant more memory.
It would also be highly valuable to know which words start with a string, but not enough so to sacrifice load-time by many seconds.
Use
It is for an Android offline game. Limited ram is available. The maximum amount of ram an Application can use according to this post is between 16-32mb ram depending on the device. My empty Android Application already uses about 17mb (using Memory Monitor in Android Studio). My android device caps the ram usage off at 26mb, leaving me at about 8mb of free space for my whole Activity.
Options I tried
They all seem doomed in different ways.
Hashmap - Read all words into a hash-map object.
1.1 initialize speed: slow to read each word into the Hash-map with 23 seconds.
1.2 ram usage: uses significant amount of ram, although I forgot how much exactly.
1.3 search speed: Finding if a word existed in the list was quick of course.
1.4 narrowing down on possible words (optional): slow, needs to go through the whole hash-map and delete them one by one. Also because it's using deletion, multiple games won't be able to be played using the same instance of the hash-map. Too much memory would be taken when adding more games, making narrowing down on the possible words therefor impossible.
Trie - Implement a RadixTree &
You can see my implementation here.
2.1 initialize speed: slow to read each word into the RadixTree with 47 seconds.
2.2 ram usage: uses significant amount of ram, so much that Android is suspending threads a couple of times.
2.3 search speed: Finding if a word existed in the list was quick.
2.4 narrowing down on possible words (optional): Ultra fast since only a reference to a node in the tree is needed to then find all possible words as its children. You can play a lot of games with narrowing down the possible words since an extra game requires only a reference to a node in the tree!
Scanner - Go through the word-file sequentially
3.1 initialize speed: none.
3.2 ram usage: none.
3.3 search speed: about 20 seconds.
3.4 narrowing down on possible words (optional): can't be done realistically.
simple code:
String word;
String wordToFind = "example";
boolean foundWord = false;
while (wordFile.hasNextLine()) {
word = wordFile.nextLine();
if(word.equals(wordToFind)) {
foundWord = true;
break;
}
}
test.close();
Options I thought of:
Long-binary-search-tree: Converting the word-list to a list of longs then reading these in and doing a binary search on them.
1.1 initialize speed: probably the same as a hash-map or little less with about 20 seconds. However I hope calling Array.sort() does not take too much time, no idea as of yet.
1.2 ram usage: if you only account 12 letter words or lower with a 26 letter alphabet you need 5 bits (2^5= 32) to encode a string. An array of longs would need then 250,000*8 bits = around 2mb. Which is not too much.
1.3 search speed: Arrays.binarySearch()
1.4 narrowing down on possible words (optional): Narrowing down on possible words could be possible but I am not sure how. According to a comment on this post.
Hashmap with storage - Creating a hashfunction that maps a word to an index number of the word-list file. Then accessing the file at this specific location and look from here to find if a word exists. You can make use of the ordering of the alphabet to determine if you can still find the word since the word-list is in natural order.
2.1 initialize speed: not needed (since I need to put every word at the right index beforehand.)
2.2 ram usage: none.
2.3 search speed: fast.
2.4 narrowing down on possible words (optional): not possible.
Specific questions I have
Are the options I have thought of in the "Options I have thought of" section viable options or are there things I missed yet which would make them not possible to implement?
Are there options I have not thought of which are better/equal in performance?
End remarks
I have been stuck at this for about a week now. So any new ideas are more than welcome. If any of my assumption above are incorrect I would also be pleased to hear about them.
I made this post this way so others could learn from them as well, either by seeing my mistakes or seeing what does work in the answers.
This sounds like an ideal use for a bloom filter. If you're willing to allow the risk of something being falsely considered a word, you can condense your wordlist into an amount of memory as small or as large as you're willing to make it.
I had this same issue and ended up going with an "on-disk" trie. That is, I encode the data structure into a single file using byte offsets instead of pointers (packing the nodes in reverse order, with the "root" node being the last written).
It is fast to load by simply reading the file into a byte array, with trie traversal using offset values the same way it would pointers.
My 200K word set fits in 1.7 MB (uncompressed) with a 4 byte value in each word terminating node.
Perhaps I'm doing this the wrong way:
I have a 4GB (33million lines of text) file, where each line has a string in it.
I'm trying to create a trie -> The algorithm works.
The problem is that Node.js has a process memory limit of 1.4GB, so the moment I process 5.5 million lines, it crashes.
To get around this, I tried the following:
Instead of 1 Trie, I create many Tries, each having a range of the alphabet.
For example:
aTrie ---> all words starting with a
bTrie ---> all words starting with b...
etc...
But the problem is, I still can't keep all the objects in memory while reading the file, so each time I read a line, I load / unload a trie from disk. When there is a change I delete the old file, and write the updated trie from memory to disk.
This is SUPER SLOW! Even on my macbook pro with SSD.
I've considered writing this in Java, but then the problem of converting JAVA objects to json comes up (same problem with using C++ etc).
Any suggestions ?
You may extend memory size limit that the node process uses by specifying the option below;
ps: size in mb's.
node --max_old_space_size=4096
for more options please see:
https://github.com/thlorenz/v8-flags/blob/master/flags-0.11.md
Instead of using 26 Tries you could use a hash function to create an arbitrary number of sub-Tries. This way, the amount of data you have to read from disk is limited to the size of your sub-Trie that you determine. In addition, you could cache the recently used sub-Tries in memory and then persist the changes to disk asynchronously in the background if IO is still a problem.
I am pretty sure a modified/similar discussion might have already been done here but I want to present the exact problem i am facing with possible solution from my side. Then I want to hear from you guys that what would be better approach or how can I approve my logic.
PROBLEM
I have a huge file which contains lines. Each line is in following format <weight>,<some_name>. Now what I have to do is to add the weight of all the objects which has same name. The problem is
I don't know how frequent some_name exist in the file. it could appear only once or all of the millions could be it
It is not ordered
I am using File Stream (java specific, but it doesn't matter)
SOLUTION 1: Assuming that I have huge ram, What i am planning to do is to read file line by line and use the name as key in my hash_map. If its already there, sum it up otherwise add. It will cost me m ram (m = numer of lines in file) but overall processing would be fast
SOLUTION 2: Assuming that I don't have huge ram, I am going to do in batches. Read first 10,000 in hashtable, sum it up and dump it into the file. Do the for rest of the file. Once done processing file, I will start reading processed files and will repease this process to sum it up all.
What do you guys suggest here ?
Beside your suggestions, Can I do parallel file reading of the file ? I have access to FileInputStream here, Can i work with fileInputStream to make reading of file more efficient ?
The second approach is not going to help you: in order to produce the final output, you need sufficient amount of RAM to hold all keys from the file, along with a single Integer representing the count. Whether you're going to get to it in one big step or by several iterations of 10K rows at a time does not change the footprint that you would need at the end.
What would help is partitioning the keys in some way, e.g. by the first character of the key. If the name starts in a letter, process the file 26 times, the first time taking only the weights for keys starting in 'A' and ignoring all other keys, the second time taking only 'B's, and so on. This will let you end up with 26 files that do not intersect.
Another valid approach would be using an external sorting algorithm to transform an unordered file to an ordered one. This would let you walk the ordered file, calculate totals as you go, and write them to an output, even without the need for an in-memory table.
As far as optimizing the I/O goes, I would recommend using the newBufferedReader(Path path,Charset c) method of the java.nio.file.Files class: it gives you a BufferedReader that is optimized for reading efficiency.
Is the file static when you do this computation? If so, then you could disk sort the file based on the name and add up the consecutive entries.
Given a harddrive with 120GB, 100 of which are filled with the strings of length 256 and 2 GB Ram how do I sort those strings in Java most efficiently?
How long will it take?
A1. You probably want to implement some form of merge-sort.
A2: Longer than it would if you had 256GB RAM on your machine.
Edit: stung by criticism, I quote from Wikipedia's article on merge sort:
Merge sort is so inherently sequential that it is practical to run it using slow tape drives as input and output devices. It requires very
little memory, and the memory required does not depend on the number
of data elements.
For the same reason it is also useful for sorting data on disk that is
too large to fit entirely into primary memory. On tape drives that can
run both backwards and forwards, merge passes can be run in both
directions, avoiding rewind time.
Here is how I'd do it:
Phase 1 is to split the 100Gb into 50 partitions of 2Gb, read each of the 50 partitions into memory, sort using quicksort, and write out. You want the sorted partitions at the top end of the disc.
Phase 2 is to then merge the 50 sorted partitions. This is the tricky bit because you don't have enough space on the disc to store the partitions AND the final sorted output. So ...
Do a 50-way merge to fill the first 20Gb at the bottom end of disc.
Slide the remaining data in the 50 partitions to the top to make another 20Gb of free space contiguous with the end of the first 20Gb.
Repeat steps 1. and 2. until completed.
This does a lot of disc IO, but you can make use of your 2Gb of memory for buffering in the copying and merging steps to get data throughput by minimizing the number of disc seeks, and do large data transfers.
EDIT - #meriton has proposed a clever way to reduce copying. Instead of sliding, he suggests that the partitions be sorted into reverse order and read backwards in the merge phase. This would allows the algorithm to release disc space used by partitions (phase 2, step 2) by simply truncating the partition files.
The potential downsides of this are increased disk fragmentation, and loss of performance due reading the partitions backwards. (On the latter point, reading a file backwards on Linux / UNIX requires more syscalls, and the FS implementation may not be able to do "read-ahead" in the reverse direction.)
Finally, I'd like to point out that any theoretically predictions of the time taken by this algorithm (and others) are largely guesswork. The behaviour of these algorithms on a real JVM + real OS + real discs are just too complex for "back for the envelope" calculations to give reliable answers. A proper treatment would require actual implementation, tuning and benchmarking.
I am basically repeating Krystian's answer, but elaborating:
Yes you need to do this more-or-less in place, since you have little RAM available. But naive in-place sorts would be a disaster here just due to the cost of moving strings around.
Rather than actually move strings around, just keep track of which strings should swap with which others and actually move them, once, at the end, to their final spot. That is, if you had 1000 strings, make an array of 1000 ints. array[i] is the location where string i should end up. If array[17] == 133 at the end, it means string 17 should end up in the spot for string 133. array[i] == i for all i to start. Swapping strings, then, is just a matter of swapping two ints.
Then, any in-place algorithm like quicksort works pretty well.
The running time is surely dominated by the final move of the strings. Assuming each one moves, you're moving around about 100GB of data in reasonably-sized writes. I might assume the drive / controller / OS can move about 100MB/sec for you. So, 1000 seconds or so? 20 minutes?
But does it fit in memory? You have 100GB of strings, each of which is 256 bytes. How many strings? 100 * 2^30 / 2^8, or about 419M strings. You need 419M ints, each is 4 bytes, or about 1.7GB. Voila, fits in your 2GB.
Sounds like a task that calls for External sorting method. Volume 3 of "The Art of Computer Programming" contains a section with extensive discussion of external sorting methods.
I think you should use BogoSort. You might have to modify the algorithm a bit to allow for inplace sorting, but that shouldn't be too hard. :)
You should use a trie (aka: a prefix tree): to build a tree-like structure that allows you to easily walk through your strings in an ordered manner by comparing their prefixes. In fact, you don't need to store it in memory. You can build the trie as a tree of directories on your file system (obviously, not the one which the data is coming from).
AFAIK, merge-sort requires as much free space as you have data. This may be a requirement for any external sort that avoids random access, though I'm not sure of this.
Points:
We process thousands of flat files in a day, concurrently.
Memory constraint is a major issue.
We use thread for each file process.
We don't sort by columns. Each line (record) in the file is treated as one column.
Can't Do:
We cannot use unix/linux's sort commands.
We cannot use any database system no matter how light they can be.
Now, we cannot just load everything in a collection and use the sort mechanism. It will eat up all the memory and the program is gonna get a heap error.
In that situation, how would you sort the records/lines in a file?
It looks like what you are looking for is
external sorting.
Basically, you sort small chunks of data first, write it back to the disk and then iterate over those to sort all.
As other mentionned, you can process in steps.
I would like to explain this with my own words (differs on point 3) :
Read the file sequentially, process N records at a time in memory (N is arbitrary, depending on your memory constraint and the number T of temporary files that you want).
Sort the N records in memory, write them to a temp file. Loop on T until you are done.
Open all the T temp files at the same time, but read only one record per file. (Of course, with buffers). For each of these T records, find the smaller, write it to the final file, and advance only in that file.
Advantages:
The memory consumption is as low as you want.
You only do the double of disk accesses comparing to a everything-in-memory policy. Not bad! :-)
Exemple with numbers:
Original file with 1 million records.
Choose to have 100 temp files, so read and sort 10 000 records at a time, and drop these in their own temp file.
Open the 100 temp file at a time, read the first record in memory.
Compare the first records, write the smaller and advance this temp file.
Loop on step 5, one million times.
EDITED
You mentionned a multi-threaded application, so I wonder ...
As we seen from these discussions on this need, using less memory gives less performance, with a dramatic factor in this case. So I could also suggest to use only one thread to process only one sort at a time, not as a multi-threaded application.
If you process ten threads, each with a tenth of the memory available, your performance will be miserable, much much less than a tenth of the initial time. If you use only one thread, and queue the 9 other demands and process them in turn, you global performance will be much better, you will finish the ten tasks much faster.
After reading this response :
Sort a file with huge volume of data given memory constraint
I suggest you consider this distribution sort. It could be huge gain in your context.
The improvement over my proposal is that you don't need to open all the temp files at once, you only open one of them. It saves your day! :-)
You can read the files in smaller parts, sort these and write them to temporrary files. Then you read two of them sequentially again and merge them to a bigger temporary file and so on. If there is only one left you have your file sorted. Basically that's the Megresort algorithm performed on external files. It scales quite well with aribitrary large files but causes some extra file I/O.
Edit: If you have some knowledge about the likely variance of the lines in your files you can employ a more efficient algorithm (distribution sort). Simplified you would read the original file once and write each line to a temporary file that takes only lines with the same first char (or a certain range of first chars). Then you iterate over all the (now small) temporary files in ascending order, sort them in memory and append them directly to the output file. If a temporary file turns out to be too big for sorting in memory, you can reapeat the same process for this based on the 2nd char in the lines and so on. So if your first partitioning was good enough to produce small enough files, you will have only 100% I/O overhead regardless how large the file is, but in the worst case it can become much more than with the performance wise stable merge sort.
In spite of your restriction, I would use embedded database SQLITE3. Like yourself, I work weekly with 10-15 millions of flat file lines and it is very, very fast to import and generate sorted data, and you only need a little free of charge executable (sqlite3.exe). For example: Once you download the .exe file, in a command prompt you can do this:
C:> sqlite3.exe dbLines.db
sqlite> create table tabLines(line varchar(5000));
sqlite> create index idx1 on tabLines(line);
sqlite> .separator '\r\n'
sqlite> .import 'FileToImport' TabLines
then:
sqlite> select * from tabLines order by line;
or save to a file:
sqlite> .output out.txt
sqlite> select * from tabLines order by line;
sqlite> .output stdout
I would spin up an EC2 cluster and run Hadoop's MergeSort.
Edit: not sure how much detail you would like, or on what. EC2 is Amazon's Elastic Compute Cloud - it lets you rent virtual servers by the hour at low cost. Here is their website.
Hadoop is an open-source MapReduce framework designed for parallel processing of large data sets. A job is a good candidate for MapReduce when it can be split into subsets that can be processed individually and then merged together, usually by sorting on keys (ie the divide-and-conquer strategy). Here is its website.
As mentioned by the other posters, external sorting is also a good strategy. I think the way I would decide between the two depends on the size of the data and speed requirements. A single machine is likely going to be limited to processing a single file at a time (since you will be using up available memory). So look into something like EC2 only if you need to process files faster than that.
You could use the following divide-and-conquer strategy:
Create a function H() that can assign each record in the input file a number. For a record r2 that will be sorted behind a record r1 it must return a larger number for r2 than for r1. Use this function to partition all the records into separate files that will fit into memory so you can sort them. Once you have done that you can just concatenate the sorted files to get one large sorted file.
Suppose you have this input file where each line represents a record
Alan Smith
Jon Doe
Bill Murray
Johnny Cash
Lets just build H() so that it uses the first letter in the record so you might get up to 26 files but in this example you will just get 3:
<file1>
Alan Smith
<file2>
Bill Murray
<file10>
Jon Doe
Johnny Cash
Now you can sort each individual file. Which would swap "Jon Doe" and "Johnny Cash" in <file10>. Now, if you just concatenate the 3 files you'll have a sorted version of the input.
Note that you divide first and only conquer (sort) later. However, you make sure to do the partitioning in a way that the resulting parts which you need to sort don't overlap which will make merging the result much simpler.
The method by which you implement the partitioning function H() depends very much on the nature of your input data. Once you have that part figured out the rest should be a breeze.
If your restriction is only to not use an external database system, you could try an embedded database (e.g. Apache Derby). That way, you get all the advantages of a database without any external infrastructure dependencies.
Here is a way to do it without heavy use of sorting in-side Java and without using DB.
Assumptions : You have 1TB space and files contain or start with unique number, but are unsorted
Divide the files N times.
Read those N files one by one, and create one file for each line/number
Name that file with corresponding number.While naming keep a counter updated to store least count.
Now you can already have the root folder of files marked for sorting by name or pause your program to give you the time to fire command on your OS to sort the files by names. You can do it programmatically too.
Now you have a folder with files sorted with their name, using the counter start taking each file one by one, put numbers in your OUTPUT file, close it.
When you are done you will have a large file with sorted numbers.
I know you mentioned not using a database no matter how light... so, maybe this is not an option. But, what about hsqldb in memory... submit it, sort it by query, purge it. Just a thought.
You can use SQL Lite file db, load the data to the db and then let it sort and return the results for you.
Advantages: No need to worry about writing the best sorting algorithm.
Disadvantage: You will need disk space, slower processing.
https://sites.google.com/site/arjunwebworld/Home/programming/sorting-large-data-files
You can do it with only two temp files - source and destination - and as little memory as you want.
On first step your source is the original file, on last step the destination is the result file.
On each iteration:
read from the source file into a sliding buffer a chunk of data half size of the buffer;
sort the whole buffer
write to the destination file the first half of the buffer.
shift the second half of the buffer to the beginning and repeat
Keep a boolean flag that says whether you had to move some records in current iteration.
If the flag remains false, your file is sorted.
If it's raised, repeat the process using the destination file as a source.
Max number of iterations: (file size)/(buffer size)*2
You could download gnu sort for windows: http://gnuwin32.sourceforge.net/packages/coreutils.htm Even if that uses too much memory, it can merge smaller sorted files as well. It automatically uses temp files.
There's also the sort that comes with windows within cmd.exe. Both of these commands can specify the character column to sort by.
File sort software for big file https://github.com/lianzhoutw/filesort/ .
It is based on file merge sort algorithm.
If you can move forward/backward in a file (seek), and rewrite parts of the file, then you should use bubble sort.
You will have to scan lines in the file, and only have to have 2 rows in memory at the moment, and then swap them if they are not in the right order. Repeat the process until there are no files to swap.