Java Compare and update large files - java

Here's my problem statement. I have 3 large text files (each <5GB) each with 100 million lines (records). I want to compare files 1 and 2 and update the value in file 3. A single record in each file would look like this:
File 1
PrimaryValue1|OldValue1
File 2
PrimaryValue1|NewValue1
File 3
Field1Value|Field2Value|Field3Value|OldValue1|Field5Value....|Field100Value
All I have for every record is the OldValue1, which is unique for every record. Now, Using the 'PrimaryValue1', I need to get the NewValue1 corresponding to the OldValue1 using files 1&2, and then update this NewValue1 in file3 in place of OldValue1. Both OldValue and NewValue are unique for every record.
I understand that if I read all the files into memory, then I will be able to compare and replace the values. Since this could be memory-intensive, I would like to know if there are better approaches to handle this scenario.

File1 and File2 define the replacements that you have to apply to File3.
I'd do the following steps:
Read File2 into a HashMap newValues.
Read File1, and for every entry:
Look up the primary value in newValues.
If found, put the old value / new value pair into a HashMap replacement.
Delete the entry from newValues (I guess, there are no duplicates).
Read File3, and for every entry :
Do the replacements from the replacement Map.
Write the resulting line to the output.
You need memory for the replacements, but the main file can be processed sequentially, needing just one record buffer.
If you get an OutOfMemoryError, this will probably happen while building the replacement Map, before you start working on the main File3. You can then modify the program to work on managable chunks of File2. The rest of the algorithm need not change.

Related

Filling a file who needs values from 1000 other files - Java

Suppose you have this .csv that we'll name "toComplete":
[Date,stock1, stock2, ...., stockn]
[30-jun-2015,"NA", "NA", ...., "NA"]
....
[30-Jun-1994,"NA","NA",....,"NA"]
with n = 1000 and number of row = 5000. Each row is for a different date. That's kind of a big file and I'm not really used to it.
My goal is to fill the "NA" by values I'll take into other .csv.
In fact, I have 1 file (still a .csv) for each stock. This means I have 1000 files for my stock and my file "toComplete".
Here are what looks like the files stock :
[Date, value1, value2]
[27-Jun-2015, v1, v2]
....
[14-Fev-2013,z1,z2]
They are less date in each stock's file than in the "toComplete" file and each date in stock's file is necessarily in "toComplete"'s file.
My question is : What is the best way to fill my file "toComplete" ? I tried by reading it line by line but this is very slow. I've been reading "toComplete" line by line and every each line I'm reading the 1000 stock's file to complete my file "toComplete". I think there are better solutions but I can't see them.
EDIT :
For example, to replace the "NA" from the second row and second column from "toComplete", I need to call my file stock1, read it line by line to find the value from value1 corresponding to the date of second row in "toCompelte".
I hope it makes more sense now.
EDIT2 :
Dates are edited. For a lot of stocks, I won't have values. In this example, we only have dates from 14-Fev-2013 to 27-Jun-2015, which means that there will stay some "NA" at the end (but it's not a problem).
I know in which files to search because my files are named stock1.csv, stock2.csv, ... I put them in a unique directory so I can use .list() method.
So you have 1000 "price history" CSV files for certain stocks containing up to 5000 days of price history each, and you want to combine the data from those files into one CSV file where each line starts with a date and the rest of the entries on the line are the up to 1000 different stock prices for that historical day? -back of the napkin calculations indicate the final file would likely contain less than 1 MB of data (less than 20 bytes per stock price would mean less than 20kb per line * 5k lines). There should be plenty of RAM in a 256/512MB JVM to read the data you want to keep from those 1000 files into a Map where the keys are the dates and the value for each key is another Map with 1000 stock symbol keys and 1000 stock value values. Then write out your final file by iterating the Map(s).

How map reduce reads from the input file to get keys and values

I am trying to implement shortest path using map reduce and this is my input file Key value
Source Node <Destination node,Weight>
1 <2,3>
1 <3,1>
2 <2,1>
2 <3,4>
and so on .I know that at run time input file is picked from hdfs using something like this $HADOOP_HOME/bin/hadoop --config $HADOOP_CONF_DIR jar Assignment3.jar InputMatrix.txt in bash script submitted to cluster.But i dont understand how does the mapper get the key and value,do i need to tokenize the input file to get the key and weights,I am thinking of getting the least associated value of weight so my reduce gets something like this [1,<2,3>,1,<3,1>] so it loops over associated weights to get me least value which in this case is 1 for key 1 .But i dont understand that how at runtime keys are made available to mapper and how is parsing done to get the keys (in the above input file keys are separarted by tabs("\t" ) from values and values themselved are "," separated
Depends on the input format but if you are using the standard inputformat (TextInputFormat), it processes line by line. The input for the map task is then: an offset in the file, and the line of your input file is the value. You could split the file on the tab character. An alternative would be to use a key-value input format.

How can I improve performance of string processing with less memory?

I'm implementing this in Java.
Symbol file Store data file
1\item1 10\storename1
10\item20 15\storename6
11\item6 15\storename9
15\item14 1\storename250
5\item5 1\storename15
The user will search store names using wildcards like storename?
My job is to search the store names and produce a full string using symbol data. For example:
item20-storename1
item14-storename6
item14-storename9
My approach is:
reading the store data file line by line
if any line contains matching search string (like storename?), I will push that line to an intermediate store result file
I will also copy the itemno of a matching storename into an arraylist (like 10,15)
when this arraylist size%100==0 then I will remove duplicate item no's using hashset, reducing arraylist size significantly
when arraylist size >1000
sort that list using Collections.sort(itemno_arraylist)
open symbol file & start reading line by line
for each line Collections.binarySearch(itemno_arraylist,itmeno)
if matching then push result to an intermediate symbol result file
continue with step1 until EOF of store data file
...
After all of this I would combine two result files (symbol result file & store result file) to present actual strings list.
This approach is working but it is consuming more CPU time and main memory.
I want to know a better solution with reduced CPU time (currently 2 min) & memory (currently 80MB). There are many collection classes available in Java. Which one would give a more efficient solution for this kind of huge string processing problem?
If you have any thoughts on this kind of string processing problems that too in Java would be great and helpful.
Note: Both files would be nearly a million lines long.
Replace the two flat files with an embedded database (there's plenty of them, I used SQLite and Db4O in the past): problem solved.
So you need to replace 10\storename1 with item20-storename1 because the symbol file contains 10\item20. The obvious solution is to load the symbol file into a Map:
String tokens=symbolFile.readLine().split("\\");
map.put(tokens[0], tokens[1]);
Then read the store file line by line and replace:
String tokens=storelFile.readLine().split("\\");
output.println(map.get(tokens[0])+'-'+tokens[1]));
This is the fastest method, though still using a lot of memory for the map. You can reduce the memory storing the map in a database, but this would increase the time significantly.
If your input data file is not changing frequently, then parse the file once, put the data into a List of custom class e.g. FileStoreRecord mapping your record in the file. Define a equals method on your custom class. Perform all next steps over the List e.g. for search, you can call contains method by passing search string in form of the custom object FileStoreRecord .
If the file is changing after some time, you may want to refresh the List after certain interval or keep the track of list creation time and compare against the file update timestamp before using it. If ifferent, recreate the list. One other way to manage the file check could be to have a Thread continuously polling the file update and the moment, it is updated, it notifies to refresh the list.
Is there any limitation to use Map?
You can add Items to Map, then you can search easily?
1 million record means 1M * recordSize, therefore it will not be problem.
Map<Integer,Item> itemMap= new HashMap();
...
Item item= itemMap.get(store.getItemNo());
But, the best solution will be with Database.

java solution for hashing lines that contain variables in .csv

I have a file that represent a table recorded in .csv or similar format. Table may include missing values.
I look for a solution (preferably in java), that would process my file in the incremental manner without loading everything into memory, as my file can be huge. I need to identify duplicate records in my file, being able to specify which columns I want to exclude from consideration; then produce an output grouping those duplicate records. I would add an additional value at the end with a group number and output in the same format (.csv) sorted by group number.
I hope an effective solution can be found with some hashing function. For example, reading all lines and storing a hash value with each line number, hash calculated based on the set of variables I provide as an input.
Any ideas?
Ok, here is the paper that holds the key to the answer: P. Gopalan & J. Radhakrishnan "Finding duplicates in a data stream".

Read two files (text) and compare for common values and output the string?

Question: I have two files one with list of serial number,items,price, location and other file has items. So i would like compare two files and printout the number times items are repeated in file1 with serial number.
Text1 file will have
Text2 file will have
Output should be
So the file1 is not formatted in proper order and file 2 is in order (line by line).
Since you have no apparent code or effort put into this, I'll only hint/guide you to some tools you can use.
For parsing strings: http://docs.oracle.com/javase/6/docs/api/java/lang/String.html
For reading in from a file: http://www.roseindia.net/java/beginners/java-read-file-line-by-line.shtml
And I would recommend reading file #2 first and saving those values to an arraylist, perhaps, so you can iterate through them later on when you do your searching.
Okay my approach to this would be
Read in the file1 and file2 into a string
"Split" the string in file 1 as well as file2 based on "," if that is what is being used
Check for the item in every 3rd one so my iteration would iterate +3 every time (You might need to sort if not in order both of these)
If found store in an Array,ArrayList etc. Go back to Step 3 if more items present. Else stop
Even though your file1 is not well formatted, it's content has some pattern which you can use to read it successfully.
For each item, it has all the information (i.e. serial number, name, price, location) but not in a certain order. So, you have pay attention to and use the following patterns while you read each item from the file1 -
Serial number is always a plain integer.
Price has that $ and . character.
Location is 2-character long, all capital.
And name is a string can not be any of the above.
Such problems are not best solved by monolithic JAVA code. If you don't have tool constraint then recommended way to solve it is to import data from file 1 into a database table and then run queries from your program to fetch whatever information you like. You can easily select serial numbers based on items and group them for count based on location.
This approach will ensure that you can keep up with changing requirements and if your files are huge you will have good performance.
I hope you are well versed with SQL and DB tools, so I have not posted any details on them.
Use regex.
Step one, tracing and splitting at [\d,], store results in map
Step two, read in the word from the second file. say it's "pen"
Step three, do regex search "pen" on each string within the map.
Step four, if the above returns true , do something like ([A-Z][A-Z],) on each string within the map.

Categories