I have a huge file, composed by ~800M rows (60g). Rows can be duplicates and are composed by an id and a value. For example:
id1 valueA
id1 valueB
id2 valueA
id3 valueC
id3 valueA
id3 valueC
note: ids are not in order (and grouped) as in the example.
I want to aggregate rows by keys, in this way:
id1 valueA,valueB
id2 valueA
id3 valueC,valueA
There are 5000 possible values.
The file doesn't fit in memory so I can't use simple Java Collections.
Also, the greatest part of the lines are single (like id2 for example) and they should be written directly in the output file.
For this reason my first solution was to iterate twice the file:
In the first iteration I store two structures, with only ids and no values:
single value ids (S1)
multiple values ids (S2)
in the second iteration, after discarding single value ids (S1) from memory, I could write directly single values id-value pairs to the output file checking if they are not in multiple values ids (S2)
The problem is that I can not finish the first iteration cause to memory limits.
I know that the problem could be faced in several ways (key-value store, map reduce, external sort).
My question is what method could be more adapt to use and fast to implement? It is an only once process and I prefer to use Java methods (not external sort).
As already said (that's quick!), merge-sort is one approach. Concretely, sort locally by id, say, every 1 million lines. Then save the locally sorted lines into smaller files. And then repetitively merge up the smaller, sorted files in pairs into one big sorted file. You can do aggregation when you merge up the smaller files.
The intuition is that, when you merge up 2 sorted lists, you maintain 2 pointers, one for each list, and sort as you go. You don't need to load the complete lists. This allows you to buffer-in big files and buffer-out the merged results immediately.
Here is sample code to sort in-memory and output to a file:
private void sortAndSave(List<String> lines, Path fileOut) throws IOException {
Collections.sort(lines, comparator);
Files.write(fileOut, lines);
}
Here is sample code to sort locally and save the results into smaller files:
// Sort once we collect 1000000 lines
final int cutoff = 1000000;
final List<String> lines = new ArrayList<>();
int fileCount = 0;
try (BufferedReader reader = Files.newBufferedReader(fileIn, charset)) {
String line = reader.readLine();
while (line != null) {
lines.add(line);
if (lines.size() > cutoff) {
fileCount++;
sortAndSave(lines, Paths.get("fileOut" + fileCount));
lines.clear();
}
line = reader.readLine();
}
if (lines.size() > 0) {
fileCount++;
sortAndSave(lines, fileOut, Paths.get("fileOut" + fileCount));
}
}
Here is sample code to merge sort 2 files:
try (BufferedReader reader1 = Files.newBufferedReader(file1, charset);
BufferedReader reader1 = Files.newBufferedReader(file2, charset);
BufferedWriter writer = Files.newBufferedWriter(fileOut, charset)) {
String line1 = reader1.read();
String line2 = reader2.read();
while (line1 != null && line2 != null) {
if (comparator.compare(line1, line2) < 0) {
writer.write(line2);
line2 = reader2.read();
} else {
writer.write(line1);
line1 = reader1.read();
}
}
if (line1 != null) {
// TODO: Merge in the remaining lines of file1
} else if (line2 != null {
// TODO: Merge in the remaining lines of file2
}
}
when dealing with such large amount of data, we need to think out of the box - and buffer the entire thing
First: how it's working already?
lets say that i have 4gb of a video and i'm trying to load it to my video player.. my player would basically need to perform 2 main operations:
buffering - 'splitting' the video to chunks and read one chunk at a time (buffer)
streaming - displaying the result (video) to my software (player)
why? because it would be impossible to load everything to memory at once (and we don't even really need it... at a specific moment the user observes a portion of the video from the buffer (which is a portion of the entire file)
Second: how this can help us?
we can do the same thing for large files:
split the main file into smaller files (each file contains X rows where X is the 'buffer')
load it to java and group it
save the result to a new file
after this process we have many small files that contains information like this
id1 valueA,valueB
id2 valueA
id3 valueC,valueA
so each grouped file contains less rows the the original small file it derived from
we can now merge it back and try to load it to java and re-group everything
if the process fails (still too big) we can merge the small grouped files into several grouped files (and repeat the process)
The file doesn't fit in memory so I can't use simple Java Collections. Also, the greatest part of the lines are single (like id2 for example) and they should be written directly in the output file.
My solution would be, use the BufferedReader for reading your large file(Obviously the only way).
Store the key-value pair in Redis (If you are on a linux environment) or Mongo DB (if you on Windows)
Related
I'm processing 2 csv files and checking common entries and saving them into a new csv file .however the comparison is taking a lot of time.My approach is to first read all the data from files into ArrayList then using parallelStream over main list, i do comparison on the other list and append the common entries with a string builder which will then be saved to the new csv file.
Below is my code for this.
allReconFileLines.parallelStream().forEach(baseLine -> {
String[] baseLineSplitted = baseLine.split(",|,,");
if (baseLineSplitted != null && baseLineSplitted.length >= 13 && baseLineSplitted[13].trim().equalsIgnoreCase("#N/A")) {
for (int i = 0; i < allCompleteFileLines.size(); i++) {
String complteFileLine = allCompleteFileLines.get(i);
String[] reconLineSplitted = complteFileLine.split(",|,,");
if (reconLineSplitted != null && reconLineSplitted[3].replaceAll("^\"|\"$", "").trim().equals(baseLineSplitted[3].replaceAll("^\"|\"$", "").trim())) {
//pw.write(complteFileLine);
matchedLines.append(complteFileLine);
break;
}
}
}
});
pw.write(matchedLines.toString());
Currently it is taking hours to process. How can i make it quick ?
Read the keys of one file into e.g. a HashSet, and then as you're reading the second file, for each line check if it's in the set and if so write it out. This way you only need enough memory to keep the keys of one file.
I have a CSV file of nearly 2 million rows with 3 columns (item, rating, user). I am able to transfer the data into a 2D String array or list. However, my issue arises when I am trying to parse through the arrays to create CSV files from because the application stops and I do not know how long I am expected to wait for the program to finish running.
Basically, my end goal is to be able to parse through large CSV file, create a matrix in which each distinct item represents a row and each distinct user represents a column with the rating being at the intersection of the user and item. With this matrix, I then create a cosine similarity matrix with the rows and columns represented by items with their cosine similarity being at the intersection of the two distinct items.
I already know how to create CSV files, but my issue falls within the large loop structures when creating other arrays for the purposes of comparison.
Is there a better way to be able to process and calculate large amounts of data so that my application doesn't freeze?
My current program does the following:
Take large CSV file
Parse through large CSV file
Create 2D array resembling original CSV file
Create list of distinct items (each distinct item being represented by an index number)
Create list of distinct users (each distinct user being represented by an index number)
Create 2D array of with row indexes representing items, column indexes representing users resulting in array[row][column] = rating
Calculate cosine similarity of two matrices
Create 2D array with both row and column indexes representing items resulting in array[row]
[column] = cosine similarity
I noticed that my program freezes when it reaches steps 4 and 5
If I remove steps 4 and 5, it will still freeze at step 6
I have attached that portion of my code
FileInputStream stream = null;
Scanner scanner = null;
try{
stream = new FileInputStream(fileName);
scanner = new Scanner(stream, "UTF-8");
while (scanner.hasNextLine()){
String line = scanner.nextLine();
if (!line.equals("")){
String[] elems = line.split(",");
if (itemList.isEmpty()){
itemList.add(elems[0]);
}
else{
if (!itemList.contains(elems[0]))
itemList.add(elems[0]);
}
if (nameList.isEmpty()){
nameList.add(elems[2]);
}
else{
if (!nameList.contains(elems[2]))
nameList.add(elems[2]);
}
for (int i = 0; i < elems.length; i++){
if (i == 1){
if (elems[1].equals("")){
list.add("0");
}
else{
list.add(elems[1]);
}
}
else{
list.add(elems[i]);
}
}
}
}
if (scanner.ioException() != null){
throw scanner.ioException();
}
}
catch (IOException e){
System.out.println(e);
}
finally{
try{
if (stream != null){
stream.close();
}
}
catch (IOException e){
System.out.println(e);
}
if (scanner != null){
scanner.close();
}
}
You can try setting -Xms and -Xmx. If you're using default values, it's possible you just need more memory allocated to the JVM.
In addition to that, you could modify your code so it doesn't treat everything as String. For the score column (which is presumably numeric), you should be able to parse that as a numeric value and store that instead of the string representation. Why? Strings use a lot more memory than numeric values. Even an empty string uses 40 bytes, whereas a single numeric value can be as little as one byte.
If a single byte could work (numeric range is -128 to 127), then you could replace ~80MB memory usage with ~2MB. Even using int (4 bytes) would be a huge improvement over String. If there are any other numeric (or boolean) values present in the data, you could make further reductions.
Is there a smart way to create a 'JSON-like' structure of String - Float pairs, 'key' not needed as data will be grabbed randomly - although an incremented key from 0-n might aid random retrieval of associated data. Due to the size of data set (10k pairs of values), I need this to be saved out to an external file type.
The reason is how my data will be compiled. To save someone entering data into an array manually the item will be excel based, saved out to CSV, parsed using a temporary java program to a file format (for example jJSON) which can be added to my project resources folder. I can then retrieve data from this set, without my application having to manually load a huge array into memory upon application creation. I can quite easily parse the CSV to 'fill-up' an array (or similar) at run-time - but I fear that on a mobile device, the memory overhead will be significant?
I have reviewed the answers to: Suitable Java data structure for parsing large data file and Data structure options for efficiently storing sets of integer pairs on disk? and have not been able to draw a definitive conclusion.
I have tried saving to a .JSON file, however not sure if I can request a random entry, plus this seems quite cumbersome for holding a simple structure. Is a treeMap or hashtable where I need to be focusing my search.
To provide some context to my query, my application will be running on android, and needs to reference a definition (approx 500 character String) and a conversion factor (an Float). I need to retrieve a random data entry. The user may only make 2 or 3 requests during a session - therefore see no point in loading a 10k element array into memory. QUERY: potentially modern day technology on android phones will easily munch through this type of query, and its perhaps only an issue if I am parsing millions of entries at run-time?
I am open to using SQLlite to hold my data if this will provide the functionality required. Please note that the data set must be derived from an easily exportable file format from excel (CSV, TXT etc).
Any advice you can give me would be much appreciated.
Here's one possible design that requires a minimal memory footprint while providing fast access:
Start with a data file of comma-separated or tab-separated values so you have line breaks between your data pairs.
Keep an array of long values corresponding to the indexes of the lines in the data file. When you know where the lines are, you can use InputStream.skip() to advance to the desired line. This leverages the fact that skip() is typically quite a bit faster than read for InputStreams.
You would have some setup code that would run at initialization time to index the lines.
An enhancement would be to only index every nth line so that the array is smaller. So if n is 100 and you're accessing line 1003, you take the 10th index to skip to line 1000, then read past two more lines to get to line 1003. This allows you to tune the size of the array to use less memory.
I thought this was an interesting problem, so I put together some code to test my idea. It uses a sample 4MB CSV file that I downloaded from some big data website that has about 36K lines of data. Most of the lines are longer than 100 chars.
Here's code snippet for the setup phase:
long start = SystemClock.elapsedRealtime();
int lineCount = 0;
try (InputStream in = getResources().openRawResource(R.raw.fl_insurance_sample)) {
int index = 0;
int charCount = 0;
int cIn;
while ((cIn = in.read()) != -1) {
charCount++;
char ch = (char) cIn; // this was for debugging
if (ch == '\n' || ch == '\r') {
lineCount++;
if (lineCount % MULTIPLE == 0) {
index = lineCount / MULTIPLE;
if (index == mLines.length) {
mLines = Arrays.copyOf(mLines, mLines.length + 100);
}
mLines[index] = charCount;
}
}
}
mLines = Arrays.copyOf(mLines, index+1);
} catch (IOException e) {
Log.e(TAG, "error reading raw resource", e);
}
long elapsed = SystemClock.elapsedRealtime() - start;
I discovered my data file was actually separated by carriage returns rather than line feeds. It must have been created on an Apple computer. Hence the test for '\r' as well as '\n'.
Here's a snippet from the code to access the line:
long start = SystemClock.elapsedRealtime();
int ch;
int line = Integer.parseInt(editText.getText().toString().trim());
if (line < 1 || line >= mLines.length ) {
mTextView.setText("invalid line: " + line + 1);
}
line--;
int index = (line / MULTIPLE);
in.skip(mLines[index]);
int rem = line % MULTIPLE;
while (rem > 0) {
ch = in.read();
if (ch == -1) {
return; // readLine will fail
} else if (ch == '\n' || ch == '\r') {
rem--;
}
}
BufferedReader reader = new BufferedReader(new InputStreamReader(in));
String text = reader.readLine();
long elapsed = SystemClock.elapsedRealtime() - start;
My test program used an EditText so that I could input the line number.
So to give you some idea of performance, the first phase averaged around 1600ms to read through the entire file. I used a MULTIPLE value of 10. Accessing the last record in the file averaged about 30ms.
To get down to 30ms access with only a 29312-byte memory footprint is pretty good, I think.
You can see the sample project on GitHub.
I have a dense symmetric matrix of size about 30000 X 30000 that contains distances between strings. Since the distance is symmetric, the upper triangle of the matrix is stored in a tab-separated 3-column file of the form
stringA<tab>stringB<tab>distance
I am using HashMap and org.javatuples.Pair to create a map to quickly look up distances for given pairs of string as follows:
import org.javatuples.Pair;
HashMap<Pair<String,String>,Double> pairScores = new HashMap<Pair<String,String>,Double>();
BufferedReader bufferedReader = new BufferedReader(new FileReader("data.txt"));
String line = null;
while((line = bufferedReader.readLine()) != null) {
String [] parts = line.split("\t");
String d1 = parts[0];
String d2 = parts[1];
Double score = Double.parseDouble(parts[2]);
Pair<String,String> p12 = new Pair<String,String>(d1,d2);
Pair<String,String> p21 = new Pair<String,String>(d2,d1);
pairScores.put(p12, score);
pairScores.put(p21, score);
}
data.txt is very big (~400M lines) and the process eventually slows down to a crawl with most time being spent in java.util.HashMap.put.
I don't think there should be (m)any hash code collisions on pairs but I might be wrong. How can I verify this? Is it enough to simply look at how unique p12.hashCode() and p12.hashCode() are?
If there are no collisions, what else could be causing to slow down?
Is there a batter way to construct this matrix for quick lookup?
I am now using Guava's Table<Integer, Integer, Double> after also realizing that my strings are unique enough that I could use their hashes, instead of the strings themselves, as keys, to reduce memory requirements. The creation of the table runs in reasonable time, however, there are issues with serializing and deserializing the resulting objects: I ran into out of memory errors even with the move from String to Integer. It seems to be working after I decided to not store both a-b and b-a pairs, but I might be balancing on the edge of what my machine can handle
What I am trying to do is assign a number value to a group of zip codes and then use that number to do a calculation later on in the program. What I have now is
if (zip == 93726 || zip == 93725 || zip == 92144) {
return 100;
}
else if (zip == 94550 || zip == 34599 || zip == 77375) {
return 150;
}
and then I take the variable and use it in a calculation. Assigning the number to the variable and the calculation all work but what I have ran into is apparently android only allows you to have so many lines of code and I have ran out of lines with just using if else statements. My question is what would be a better what to go about this?
I am not trying to assign a city to each zip because I have seen that they have services that do that from other posters.
a. You can either use a
switch (zip)
{
case 93726: case 93725: case 92144: return 100;
case 94550: case 34599: case 77375: return 150;
}
-
b. or create a HashMap<Integer,Integer> and fill it with 'zip to value' entries, which should give you a much better performance if you have that many cases.
Map<Integer,Integer> m = new HashMap<>();
m.put(93726,100);
later you could call
return m.get(zip);
-
c. If your zip count is in the tens of thousands and you want to work all in memory, then you should consider just holding a hundred-thousand int sized array:
int[] arr=new int[100000];
and filling it with the data:
arr[93726]=100;
.
.
.
You should probably use String constants for you ZIP codes since
in some places, some start with 0
in some places, they may contain letters as well as numbers
If you are using an Object (either String or Integer) for your constants, I have often used Collection.contains(zip) as a lazy shortcut for the condition in each if statement. That collection contains all the constants for that condition, and you would probably use a subclass that is geared towards finding, perhaps HashSet. Keep in mind that if you use a HashMap solution as suggested elsewhere, your keys will be Integer objects too, so you will do hashing on the keys in any case, you just won't need to store the result values in the collection suggestion.
I suspect that for a large collection of constants, hashing may turn out to be faster than having to work through the large number of == conditions in the if statement until you get to the right condition. (It may help a bit if the most-used constants come first, and in the first if statement...)
On a completely different note (i.e. strategy instead of code), you should see if you could group your ZIPs. What I mean is for example, that if you know that all (or most) ZIPs of the forms "923xx" and "924xx" result in a return of 250, you could potentially shorten your conditionals considerably. E.g. zip.startsWith("923") || zip.startsWith("923") for String ZIPs, or (zip / 100) == 923 || (zip / 100) == 924 for int.
A small number of more specific exceptions to the groups can still be handled, you just need to have the more specific conditionals before the more general conditionals.
Use declarative data. Especially as the zip codes might get updated, extended, corrected:
Using for instance a zip.properties
points = 100, 150
zip100 = 93726, 93725, 92144
zip150 = 94550, 34599, 77375,\
88123, 12324, 23424
And
Map<Integer, Integer> zipToPoints = new HashMap<>();
If you got ZIP codes with leading zeroes maybe better use String or take care to parse them with base 10 (leading zero is base 8, octal).
Whether there really exists such a limitation I do not know, but the extra effort of a bit of coding is worth having all as data.
Have a map of the zip codes.just return the value.
Map<Integer,Integer> zipValues = new HashMap<Integer,Integer>();
zipValues.put(93726,100);
.
.
And so on. or u can read from a prop file and populate the map.
then instead of using the if(s),
return zipValues.get(zipCode);
so say zipCode=93726, it will return u 100.
Cheers.
You could create a Map which maps each zip code to an integer. For example:
Map<Integer, Integer> zipCodeMap = new HashMap<>();
zipCodeMap.put(93726, 100);
And later you can retrieve values from there.
Integer value = zipCodeMap.get(93726);
Now you still have to map each zipcode to a value. I would not do that in Java code but rather use a database or read from a text file (csv for example). This depends mostly on your requirements.
Example csv file:
93726, 100
93727, 100
93726, 150
Reading from a csv file:
InputStream is = this.class.getResourceAsStream("/data.csv");
BufferedReader reader = new BufferedReader(new InputStreamReader(is));
try {
String line;
while ((line = reader.readLine()) != null) {
String[] row = line.split(",");
int zip = Integer.parseInt(row[0].trim());
int value = Integer.parseInt(row[1].trim());
zipCodeMap.put(zip, value);
}
}
catch (IOException ex) {
// handle exceptions here
}
finally {
try {
is.close();
}
catch (IOException e) {
// handle exceptions here
}
}