I want to implement a method to merge two huge file (the files contains JsonObject for each row) through a common value.
The first file is like this:
{
"Age": "34",
"EmailHash": "2dfa19bf5dc5826c1fe54c2c049a1ff1",
"Id": 3,
...
}
and the second:
{
"LastActivityDate": "2012-10-14T12:17:48.077",
"ParentId": 34,
"OwnerUserId": 3,
}
I have implemented a method that read the first file and take the first JsonObject, after it takes the Id and if in the second file there is a row that contains the same Id (OwnerUserId == Id), it appends the second JsonObject to the first file, otherwise I wrote another file that contains only the row that doesn't match with the first file. In this way if the first JsonObject has 10 match, the second row of the first file doesn't seek these row.
The method works fine, but it is too slow.
I have already trying to load the data in mongoDb and query the Db, but it is slow too.
Is there another way to process the two file?
What you're doing simply must be damn slow. If you don't have the memory for all the JSON object, then try to store the data as normal Java objects as this way you surely need much less.
And there's a simple way needing even much less memory and only n passes, where n is the ratio of required memory to available memory.
On the ith pass consider only objects with id % n == i and ignore all the others. This way the memory consumption reduces by nearly factor n, assuming the ids are nicely distributed modulo n.
If this assumption doesn't hold, use f(id) % n instead, where f is some hash function (feel free to ask if you need it).
I have solved using a temporary DB.
I have created a index with the key in which I want to make a merge and in this way I can make a query over the DB and the response is very fast.
Related
I have a .JSON file that has content as:
"Name":"something"
"A":10
"B": 12
"Name":"something else"
"A":5
"B":9
....
I want to read this file and then find among these objects which one have the most number of parts(sum of A+B). What would be a good approach to that? I thought about reading JSON data in a Map, and then going through each object of the Map, finding the its total number of A+B, and then storing that in another linked list as LinkedList<String,Integer> (where String would be a name, and Integer would be the sum of A+B). Then after that sort LinkedList, and then findout which ever has the most number of A+B. Would this be a good solution?
I have two or more .csv files which have the following data:
//CSV#1
Actor.id, Actor.DisplayName, Published, Target.id, Target.ObjectType
1, Test, 2014-04-03, 2, page
//CSV#2
Actor.id, Actor.DisplayName, Published, Object.id
2, Testing, 2014-04-04, 3
Desired Output file:
//CSV#Output
Actor.id, Actor.DisplayName, Published, Target.id, Target.ObjectType, Object.id
1, Test, 2014-04-03, 2, page,
2, Testing, 2014-04-04, , , 3
For the case some of you might wonder: the "." in the header is just an additional information in the .csv file and shouldn't be treated as a separator (the "." results from the conversion of a json-file to csv, respecting the level of the json-data).
My problem is that I did not find any solution so far which accepts different column counts.
Is there a fine way to achieve this? I did not have code so far, but I thought the following would work:
Read two or more files and add each row to a HashMap<Integer,String> //Integer = lineNumber, String = data, so that each file gets it's own HashMap
Iterate through all indices and add the data to a new HashMap.
Why I think this thought is not so good:
If the header and the row data from file 1 differs from file 2 (etc.) the order won't be kept right.
I think this might result if I do the suggested thing:
//CSV#Suggested
Actor.id, Actor.DisplayName, Published, Target.id, Target.ObjectType, Object.id
1, Test, 2014-04-03, 2, page //wrong, because one "," is missing
2, Testing, 2014-04-04, 3 // wrong, because the 3 does not belong to Target.id. Furthermore the empty values won't be considered.
Is there a handy way I can merge the data of two or more files without(!) knowing how many elements the header contains?
This isn't the only answer but hopefully it can point you in a good direction. Merging is hard, you're going to have to give it some rules and you need to decide what those rules are. Usually you can break it down to a handful of criteria and then go from there.
I wrote a "database" to deal with situations like this a while back:
https://github.com/danielbchapman/groups
It is basically just a Map<Integer, Map<Integer. Map<String, String>>> which isn't all that complicated. What I'd recommend is you read each row into a structure similar to:
(Set One) -> Map<Column, Data>
(Set Two) -> Map<Column, Data>
A Bidi map (as suggested in the comments) will make your lookups faster but carries some pitfalls if you have duplicate values.
Once you have these structures you lookup can be as simple as:
public List<Data> process(Data one, Data two) //pseudo code
{
List<Data> result = new List<>();
for(Row row : one)
{
Id id = row.getId();
Row additional = two.lookup(id);
if(additional != null)
merge(row, additional);
result.add(row);
}
}
public void merge(Row a, Row b)
{
//Your logic here.... either mutating or returning a copy.
}
Nowhere in this solution am I worried about the columns as this is just acting on the raw data-types. You can easily remap all the column names either by storing them each time you do a lookup or by recreating them at output.
The reason I linked my project is that I'm pretty sure I have a few methods in there (such as outputing column names etc...) that might save you considerable time/point you in the right direction.
I do a lot of TSV processing in my line of work and maps are my best friends.
Suppose I'm into Big Data (as in bioinformatics), and I've chosen to analyze it in Java using the wonderful Collections Map-Reduce framework on HPC. How can I work with datasets of more than 2 31 ^ 1 - items? For example,
final List<Gene> genome = getHugeData();
profit.log(genome.parallelStream().collect(magic);
Wrap your data so it consists of many chunks -- once you're exceed 2 ^ 31 - 1 you're going to next one. Sketch is:
class Wrapper {
private List<List<Gene>> chunks;
Gene get(long id) {
int chunkId = id / Integer.MAX_VALUE;
int itemId = id % Integer.MAX_VALUE;
List<Gene> chunk = chunks.get(chunkId);
return chunk.get(itemId);
}
}
In this case you have multiple problems. How big your data is?
The simplest solution is to use another structure such as LinkedList which (only if you are interested in serial accesses) or a HashMap which may have a high insertion cost. A LinkedList does not allow any random access whatsoever. If you want to access the 5th element you have to access first all previous 4 elements as well.
Here is another thought:
Let us assume that each gene has an id number (long). You can use an index structure such as a B+-tree and index your data using the tree. The index does not have to be stored on the disk it can remain on the memory. It does not have much overhead as well. You can find many implementations of it online.
Another solution would be to create a container class which would contain either other container classes or Genes. In order to achieve that both should implement an interface called e.g. Containable. In that way both classes Gene and Container are Containable(s). Once a container reaches its max. size it can be inserted in another container and so on. You can create multiple levels that way.
I would suggest you look online (e.g. Wikipedia) for the B+-tree if u are not familiar with that.
An Array with 2^31 Objects would consume about 17 GB memory...
You schould store the data a Database.
I'm implementing this in Java.
Symbol file Store data file
1\item1 10\storename1
10\item20 15\storename6
11\item6 15\storename9
15\item14 1\storename250
5\item5 1\storename15
The user will search store names using wildcards like storename?
My job is to search the store names and produce a full string using symbol data. For example:
item20-storename1
item14-storename6
item14-storename9
My approach is:
reading the store data file line by line
if any line contains matching search string (like storename?), I will push that line to an intermediate store result file
I will also copy the itemno of a matching storename into an arraylist (like 10,15)
when this arraylist size%100==0 then I will remove duplicate item no's using hashset, reducing arraylist size significantly
when arraylist size >1000
sort that list using Collections.sort(itemno_arraylist)
open symbol file & start reading line by line
for each line Collections.binarySearch(itemno_arraylist,itmeno)
if matching then push result to an intermediate symbol result file
continue with step1 until EOF of store data file
...
After all of this I would combine two result files (symbol result file & store result file) to present actual strings list.
This approach is working but it is consuming more CPU time and main memory.
I want to know a better solution with reduced CPU time (currently 2 min) & memory (currently 80MB). There are many collection classes available in Java. Which one would give a more efficient solution for this kind of huge string processing problem?
If you have any thoughts on this kind of string processing problems that too in Java would be great and helpful.
Note: Both files would be nearly a million lines long.
Replace the two flat files with an embedded database (there's plenty of them, I used SQLite and Db4O in the past): problem solved.
So you need to replace 10\storename1 with item20-storename1 because the symbol file contains 10\item20. The obvious solution is to load the symbol file into a Map:
String tokens=symbolFile.readLine().split("\\");
map.put(tokens[0], tokens[1]);
Then read the store file line by line and replace:
String tokens=storelFile.readLine().split("\\");
output.println(map.get(tokens[0])+'-'+tokens[1]));
This is the fastest method, though still using a lot of memory for the map. You can reduce the memory storing the map in a database, but this would increase the time significantly.
If your input data file is not changing frequently, then parse the file once, put the data into a List of custom class e.g. FileStoreRecord mapping your record in the file. Define a equals method on your custom class. Perform all next steps over the List e.g. for search, you can call contains method by passing search string in form of the custom object FileStoreRecord .
If the file is changing after some time, you may want to refresh the List after certain interval or keep the track of list creation time and compare against the file update timestamp before using it. If ifferent, recreate the list. One other way to manage the file check could be to have a Thread continuously polling the file update and the moment, it is updated, it notifies to refresh the list.
Is there any limitation to use Map?
You can add Items to Map, then you can search easily?
1 million record means 1M * recordSize, therefore it will not be problem.
Map<Integer,Item> itemMap= new HashMap();
...
Item item= itemMap.get(store.getItemNo());
But, the best solution will be with Database.
I hava read a related question here link text
It was suggested there to work with a giant file and then use RandomAccessFile.
My problem is that a matrix(consists of "0" and "1", not sparse) could be really huge. For example, a row size could be 10^10000. I need an efficient way to store such a matrix. Also, I need to work with such file (if I would store my matrix in it) in that way:
Say, I have a giant file which contains sequences of numbers. Numbers in a sequence are divided by ","(first number shows the raw number, remaining numbers show places in matrix where "1"s stay). Sequences are divided by symbol "|". In addition, there is a symbol "||" which divide all of sequences into two groups. (that is a view of two matrixes. May be it is not efficient, but I don't know the way to make it better. Do you have any ideas? =) ) I have to read, for example, 100 numbers from an each row from a first group (extract a submatrix) and determine by them which rows I need to read from the second group.
So I need the function seek(). Would it work with such a giant file?
I am a newbie. May be there are some efficient ways to store and read such a data?
There are about 10^80 atoms in the observable universe. So say you could store one bit in each atom, you need about 10^9920 universes about the same size as ours. Thats just to store one row.
How many rows were you condiering? You will need 10^9920 universes per row.
Hopefully you mean 10 000 entries and not 10^10000 Then you could use the BitSet class to store all in RAM (or you could use sth. like hadoop)