I have a file that represent a table recorded in .csv or similar format. Table may include missing values.
I look for a solution (preferably in java), that would process my file in the incremental manner without loading everything into memory, as my file can be huge. I need to identify duplicate records in my file, being able to specify which columns I want to exclude from consideration; then produce an output grouping those duplicate records. I would add an additional value at the end with a group number and output in the same format (.csv) sorted by group number.
I hope an effective solution can be found with some hashing function. For example, reading all lines and storing a hash value with each line number, hash calculated based on the set of variables I provide as an input.
Any ideas?
Ok, here is the paper that holds the key to the answer: P. Gopalan & J. Radhakrishnan "Finding duplicates in a data stream".
Related
Here is what I want to do. Now I have some text files like this:
<page>
<url>xxx.example.com</url>
<title>xxx</title>
<content>abcdef</content>
</page>
<page>
<url>yyy.example.com</url>
<title>yyy</title>
<content>abcdef</content>
</page>
...
And I want to read the file split in mapper and convert them to key-value pairs, where each value is the content in one <page> tag.
My problem is about the key. I can use urls as keys because they are global unique. However, due to the context of my job, I want to generate a global unique number as a key for each key-value pair. I know this is somehow against the horizontal scalability of Hadoop. But is there any solution to this?
If you're going to process such files by MapReduce I'd take the following strategy:
Use general text input format, line by line. This results every different file goes to different mapper job.
In mapper build cycle which reads next lines in cycle through context.nextKeyValue() instead of being called for each line.
Feed lines to some syntax analyzer (maybe you're just enough to read 6 non-empty lines, maybe you will use something like libxml but finally you will gen number of objects.
If you intend to pass objects that you read to reducer you need to wrap them into something that implements Writable interaface.
To form keys I'd use UUID implementation java.util.UUID. Something like:
UUID key = UUID.randomUUID();
It's enough if you're not generating billions records per second and your job does not take 100 years. :-)
Just note - UUID should be probably encoded in ImmutableBytesWritable class, useful for such things.
That's all, context.write(object,key).
OK, your reducer (if any) and output format is another story. You will definitely need output format to store your objects if you don't convert them to something like Text during the mapping.
Not sure if this answers your question directly. But I am taking the advantage of the input file format.
You might use the NLineInputFormat and set N = 6 as each record encompasses 6 lines:
<page>
<url>xxx.example.com</url>
<title>xxx</title>
<content>abcdef</content>
</page>
.
With each record, the mapper would get the offset position in the file. This offset would be unique for each record.
PS: It would work only if the schema is fixed. I am doubtful if it would work properly for multiple input text files.
This question already has answers here:
Parsing one terabyte of text and efficiently counting the number of occurrences of each word
(16 answers)
Closed 10 years ago.
I have a huge text file (larger than the available RAM memory). I need to count the frequency of all words and output the word and the frequency count into a new file. The result should be sorted in the descending order of frequency count.
My Approach:
Sort the given file - external sort
Count the frequency of each word sequentially, store the count in another file (along with the word)
Sort the output file based of frequency count - external sort.
I want to know if there are better approaches to do it. I have heard of disk based hash tables? or B+ trees, but never tried them before.
Note: I have seen similar questions asked on SO, but none of them have to address the issue with data larger than memory.
Edit: Based on the comments, agreed the a dictionary in practice should fit in the memory of today's computers. But lets take a hypothetical dictionary of words, that is huge enough not to fit in the memory.
I would go with a map reduce approach:
Distribute your text file on nodes, assuming each text in a node can fit into RAM.
Calculate each word frequency within the node. (using hash tables )
Collect each result in a master node and combine them all.
All unique words probably fit in memory so I'd use this approach:
Create a dictionary (HashMap<string, int>).
Read the huge text file line by line.
Add new words into the dictionary and set value to 1.
Add 1 to the value of existing words.
After you've parsed the entire huge file:
Sort the dictionary by frequency.
Write, to a new file, the sorted dictionary with words and frequency.
Mind though to convert the words to either lowercase or uppercase.
Best way to achieve it would be to read the file line by line and store the words into a Multimap (e.g. Guava). If this Map extends your memory you could try using a Key-Value store (e.g. Berkeley JE DB, or MapDB). These key-value stores work similar to a map, but they store their values on the HDD. I used MapDB for a similar problem and it was blazing fast.
If the list of unique words and the frequency fits in memory (not the file just the unique words) you can use a hash table and read the file sequentially (without storing it).
You can then sort the entries of the hash table by the number of occurrences.
I'm implementing this in Java.
Symbol file Store data file
1\item1 10\storename1
10\item20 15\storename6
11\item6 15\storename9
15\item14 1\storename250
5\item5 1\storename15
The user will search store names using wildcards like storename?
My job is to search the store names and produce a full string using symbol data. For example:
item20-storename1
item14-storename6
item14-storename9
My approach is:
reading the store data file line by line
if any line contains matching search string (like storename?), I will push that line to an intermediate store result file
I will also copy the itemno of a matching storename into an arraylist (like 10,15)
when this arraylist size%100==0 then I will remove duplicate item no's using hashset, reducing arraylist size significantly
when arraylist size >1000
sort that list using Collections.sort(itemno_arraylist)
open symbol file & start reading line by line
for each line Collections.binarySearch(itemno_arraylist,itmeno)
if matching then push result to an intermediate symbol result file
continue with step1 until EOF of store data file
...
After all of this I would combine two result files (symbol result file & store result file) to present actual strings list.
This approach is working but it is consuming more CPU time and main memory.
I want to know a better solution with reduced CPU time (currently 2 min) & memory (currently 80MB). There are many collection classes available in Java. Which one would give a more efficient solution for this kind of huge string processing problem?
If you have any thoughts on this kind of string processing problems that too in Java would be great and helpful.
Note: Both files would be nearly a million lines long.
Replace the two flat files with an embedded database (there's plenty of them, I used SQLite and Db4O in the past): problem solved.
So you need to replace 10\storename1 with item20-storename1 because the symbol file contains 10\item20. The obvious solution is to load the symbol file into a Map:
String tokens=symbolFile.readLine().split("\\");
map.put(tokens[0], tokens[1]);
Then read the store file line by line and replace:
String tokens=storelFile.readLine().split("\\");
output.println(map.get(tokens[0])+'-'+tokens[1]));
This is the fastest method, though still using a lot of memory for the map. You can reduce the memory storing the map in a database, but this would increase the time significantly.
If your input data file is not changing frequently, then parse the file once, put the data into a List of custom class e.g. FileStoreRecord mapping your record in the file. Define a equals method on your custom class. Perform all next steps over the List e.g. for search, you can call contains method by passing search string in form of the custom object FileStoreRecord .
If the file is changing after some time, you may want to refresh the List after certain interval or keep the track of list creation time and compare against the file update timestamp before using it. If ifferent, recreate the list. One other way to manage the file check could be to have a Thread continuously polling the file update and the moment, it is updated, it notifies to refresh the list.
Is there any limitation to use Map?
You can add Items to Map, then you can search easily?
1 million record means 1M * recordSize, therefore it will not be problem.
Map<Integer,Item> itemMap= new HashMap();
...
Item item= itemMap.get(store.getItemNo());
But, the best solution will be with Database.
Question: I have two files one with list of serial number,items,price, location and other file has items. So i would like compare two files and printout the number times items are repeated in file1 with serial number.
Text1 file will have
Text2 file will have
Output should be
So the file1 is not formatted in proper order and file 2 is in order (line by line).
Since you have no apparent code or effort put into this, I'll only hint/guide you to some tools you can use.
For parsing strings: http://docs.oracle.com/javase/6/docs/api/java/lang/String.html
For reading in from a file: http://www.roseindia.net/java/beginners/java-read-file-line-by-line.shtml
And I would recommend reading file #2 first and saving those values to an arraylist, perhaps, so you can iterate through them later on when you do your searching.
Okay my approach to this would be
Read in the file1 and file2 into a string
"Split" the string in file 1 as well as file2 based on "," if that is what is being used
Check for the item in every 3rd one so my iteration would iterate +3 every time (You might need to sort if not in order both of these)
If found store in an Array,ArrayList etc. Go back to Step 3 if more items present. Else stop
Even though your file1 is not well formatted, it's content has some pattern which you can use to read it successfully.
For each item, it has all the information (i.e. serial number, name, price, location) but not in a certain order. So, you have pay attention to and use the following patterns while you read each item from the file1 -
Serial number is always a plain integer.
Price has that $ and . character.
Location is 2-character long, all capital.
And name is a string can not be any of the above.
Such problems are not best solved by monolithic JAVA code. If you don't have tool constraint then recommended way to solve it is to import data from file 1 into a database table and then run queries from your program to fetch whatever information you like. You can easily select serial numbers based on items and group them for count based on location.
This approach will ensure that you can keep up with changing requirements and if your files are huge you will have good performance.
I hope you are well versed with SQL and DB tools, so I have not posted any details on them.
Use regex.
Step one, tracing and splitting at [\d,], store results in map
Step two, read in the word from the second file. say it's "pen"
Step three, do regex search "pen" on each string within the map.
Step four, if the above returns true , do something like ([A-Z][A-Z],) on each string within the map.
I hava read a related question here link text
It was suggested there to work with a giant file and then use RandomAccessFile.
My problem is that a matrix(consists of "0" and "1", not sparse) could be really huge. For example, a row size could be 10^10000. I need an efficient way to store such a matrix. Also, I need to work with such file (if I would store my matrix in it) in that way:
Say, I have a giant file which contains sequences of numbers. Numbers in a sequence are divided by ","(first number shows the raw number, remaining numbers show places in matrix where "1"s stay). Sequences are divided by symbol "|". In addition, there is a symbol "||" which divide all of sequences into two groups. (that is a view of two matrixes. May be it is not efficient, but I don't know the way to make it better. Do you have any ideas? =) ) I have to read, for example, 100 numbers from an each row from a first group (extract a submatrix) and determine by them which rows I need to read from the second group.
So I need the function seek(). Would it work with such a giant file?
I am a newbie. May be there are some efficient ways to store and read such a data?
There are about 10^80 atoms in the observable universe. So say you could store one bit in each atom, you need about 10^9920 universes about the same size as ours. Thats just to store one row.
How many rows were you condiering? You will need 10^9920 universes per row.
Hopefully you mean 10 000 entries and not 10^10000 Then you could use the BitSet class to store all in RAM (or you could use sth. like hadoop)