Merge CSV files with dynamic headers in Java - java

I have two or more .csv files which have the following data:
//CSV#1
Actor.id, Actor.DisplayName, Published, Target.id, Target.ObjectType
1, Test, 2014-04-03, 2, page
//CSV#2
Actor.id, Actor.DisplayName, Published, Object.id
2, Testing, 2014-04-04, 3
Desired Output file:
//CSV#Output
Actor.id, Actor.DisplayName, Published, Target.id, Target.ObjectType, Object.id
1, Test, 2014-04-03, 2, page,
2, Testing, 2014-04-04, , , 3
For the case some of you might wonder: the "." in the header is just an additional information in the .csv file and shouldn't be treated as a separator (the "." results from the conversion of a json-file to csv, respecting the level of the json-data).
My problem is that I did not find any solution so far which accepts different column counts.
Is there a fine way to achieve this? I did not have code so far, but I thought the following would work:
Read two or more files and add each row to a HashMap<Integer,String> //Integer = lineNumber, String = data, so that each file gets it's own HashMap
Iterate through all indices and add the data to a new HashMap.
Why I think this thought is not so good:
If the header and the row data from file 1 differs from file 2 (etc.) the order won't be kept right.
I think this might result if I do the suggested thing:
//CSV#Suggested
Actor.id, Actor.DisplayName, Published, Target.id, Target.ObjectType, Object.id
1, Test, 2014-04-03, 2, page //wrong, because one "," is missing
2, Testing, 2014-04-04, 3 // wrong, because the 3 does not belong to Target.id. Furthermore the empty values won't be considered.
Is there a handy way I can merge the data of two or more files without(!) knowing how many elements the header contains?

This isn't the only answer but hopefully it can point you in a good direction. Merging is hard, you're going to have to give it some rules and you need to decide what those rules are. Usually you can break it down to a handful of criteria and then go from there.
I wrote a "database" to deal with situations like this a while back:
https://github.com/danielbchapman/groups
It is basically just a Map<Integer, Map<Integer. Map<String, String>>> which isn't all that complicated. What I'd recommend is you read each row into a structure similar to:
(Set One) -> Map<Column, Data>
(Set Two) -> Map<Column, Data>
A Bidi map (as suggested in the comments) will make your lookups faster but carries some pitfalls if you have duplicate values.
Once you have these structures you lookup can be as simple as:
public List<Data> process(Data one, Data two) //pseudo code
{
List<Data> result = new List<>();
for(Row row : one)
{
Id id = row.getId();
Row additional = two.lookup(id);
if(additional != null)
merge(row, additional);
result.add(row);
}
}
public void merge(Row a, Row b)
{
//Your logic here.... either mutating or returning a copy.
}
Nowhere in this solution am I worried about the columns as this is just acting on the raw data-types. You can easily remap all the column names either by storing them each time you do a lookup or by recreating them at output.
The reason I linked my project is that I'm pretty sure I have a few methods in there (such as outputing column names etc...) that might save you considerable time/point you in the right direction.
I do a lot of TSV processing in my line of work and maps are my best friends.

Related

Writing and reading Java objects to CSV that have Lists as attributes

I have a List<Person> that I want to write to a CSV-file and then read it again. My problem is that the Person-class have a List<Insurance>-attribute. Therefor the line length will depend on how many objects are in the List<Insurance>-attribute. A Person can have 0 or multiple insurances and these are stored in a List in the specific Person. Is there a reasonable way of doing this?
Like this:
public class Person {
private int insuranceNR;
private String firstName;
private String lastName;
private List<Insurance> insurances;
}
I have failed to find questions like mine, but if there are please redirect me. Thanks.
CSV is not the best choice here because it is supposed to have fixed number of columns for effective parsing. You should use json or xml.
However, if you still want to use CSV, you must ensure there is only 1 list element in the class (future modifications will break the consistency), and that too is written at the end of row.
something like this
1, Bill, Gates, Ins1, Ins2, Ins3
2, Donald, Trump
3, Elon, Musk, Ins4
4, Jeff, Bezos, Ins5, Ins6, Ins7, Ins8, Ins9
In your code, only consider first 3 elements as fixed, and iterate over remaining accordingly.
Here is a reference to a problem similar to yours:
Using CsvBeanReader to read a CSV file with a variable number of columns

Merge two JSON file through a value in java

I want to implement a method to merge two huge file (the files contains JsonObject for each row) through a common value.
The first file is like this:
{
"Age": "34",
"EmailHash": "2dfa19bf5dc5826c1fe54c2c049a1ff1",
"Id": 3,
...
}
and the second:
{
"LastActivityDate": "2012-10-14T12:17:48.077",
"ParentId": 34,
"OwnerUserId": 3,
}
I have implemented a method that read the first file and take the first JsonObject, after it takes the Id and if in the second file there is a row that contains the same Id (OwnerUserId == Id), it appends the second JsonObject to the first file, otherwise I wrote another file that contains only the row that doesn't match with the first file. In this way if the first JsonObject has 10 match, the second row of the first file doesn't seek these row.
The method works fine, but it is too slow.
I have already trying to load the data in mongoDb and query the Db, but it is slow too.
Is there another way to process the two file?
What you're doing simply must be damn slow. If you don't have the memory for all the JSON object, then try to store the data as normal Java objects as this way you surely need much less.
And there's a simple way needing even much less memory and only n passes, where n is the ratio of required memory to available memory.
On the ith pass consider only objects with id % n == i and ignore all the others. This way the memory consumption reduces by nearly factor n, assuming the ids are nicely distributed modulo n.
If this assumption doesn't hold, use f(id) % n instead, where f is some hash function (feel free to ask if you need it).
I have solved using a temporary DB.
I have created a index with the key in which I want to make a merge and in this way I can make a query over the DB and the response is very fast.

Looking for a table-like data structure

I have 2 sets of data.
Let say one is a people, another is a group.
A people can be in multiple groups while a group can have multiple people.
My operations will basically be CRUD on group and people.
As well as a method that makes sure a list of people are in different groups (which gets called alot).
Right now I'm thinking of making a table of binary 0's and 1's with horizontally representing all the people and vertically all the groups.
I can perform the method in O(n) time by adding each list of binaries and compare with the "and" operation of the list of binaries.
E.g
Group A B C D
ppl1 1 0 0 1
ppl2 0 1 1 0
ppl3 0 0 1 0
ppl4 0 1 0 0
check (ppl1, ppl2) = (1001 + 0110) == (1001 & 0110)
= 1111 == 1111
= true
check (ppl2, ppl3) = (0110 + 0010) == (0110+0010)
= 1000 ==0110
= false
I'm wondering if there is a data structure that does something similar already so I don't have to write my own and maintain O(n) runtime.
I don't know all of the details of your problem, but my gut instinct is that you may be over thinking things here. How many objects are you planning on storing in this data structure? If you have really large amounts of data to store here, I would recommend that you use an actual database instead of a data structure. The type of operations you are describing here are classical examples of things that relational databases are good at. MySQL and PostgreSQL are examples of large scale relational databases that could do this sort of thing in their sleep. If you'd like something lighter-weight SQLite would probably be of interest.
If you do not have large amounts of data that you need to store in this data structure, I'd recommend keeping it simple, and only optimizing it when you are sure that it won't be fast enough for what you need to do. As a first shot, I'd just recommend using java's built in List interface to store your people and a Map to store groups. You could do something like this:
// Use a list to keep track of People
List<Person> myPeople = new ArrayList<Person>();
Person steve = new Person("Steve");
myPeople.add(steve);
myPeople.add(new Person("Bob"));
// Use a Map to track Groups
Map<String, List<Person>> groups = new HashMap<String, List<Person>>();
groups.put("Everybody", myPeople);
groups.put("Developers", Arrays.asList(steve));
// Does a group contain everybody?
groups.get("Everybody").containsAll(myPeople); // returns true
groups.get("Developers").containsAll(myPeople); // returns false
This definitly isn't the fastest option available, but if you do not have a huge number of People to keep track of, you probably won't even notice any performance issues. If you do have some special conditions that would make the speed of using regular Lists and Maps unfeasible, please post them and we can make suggestions based on those.
EDIT:
After reading your comments, it appears that I misread your issue on the first run through. It looks like you're not so much interested in mapping groups to people, but instead mapping people to groups. What you probably want is something more like this:
Map<Person, List<String>> associations = new HashMap<Person, List<String>>();
Person steve = new Person("Steve");
Person ed = new Person("Ed");
associations.put(steve, Arrays.asList("Everybody", "Developers"));
associations.put(ed, Arrays.asList("Everybody"));
// This is the tricky part
boolean sharesGroups = checkForSharedGroups(associations, Arrays.asList(steve, ed));
So how do you implement the checkForSharedGroups method? In your case, since the numbers surrounding this are pretty low, I'd just try out the naive method and go from there.
public boolean checkForSharedGroups(
Map<Person, List<String>> associations,
List<Person> peopleToCheck){
List<String> groupsThatHaveMembers = new ArrayList<String>();
for(Person p : peopleToCheck){
List<String> groups = associations.get(p);
for(String s : groups){
if(groupsThatHaveMembers.contains(s)){
// We've already seen this group, so we can return
return false;
} else {
groupsThatHaveMembers.add(s);
}
}
}
// If we've made it to this point, nobody shares any groups.
return true;
}
This method probably doesn't have great performance on large datasets, but it is very easy to understand. Because it's encapsulated in it's own method, it should also be easy to update if it turns out you need better performance. If you do need to increase performance, I would look at overriding the equals method of Person, which would make lookups in the associations map faster. From there you could also look at a custom type instead of String for groups, also with an overridden equals method. This would considerably speed up the contains method used above.
The reason why I'm not too concerned about performance is that the numbers you've mentioned aren't really that big as far as algorithms are concerned. Because this method returns as soon as it finds two matching groups, in the very worse case you will call ArrayList.contains a number of times equal to the number of groups that exist. In the very best case scenario, it only needs to be called twice. Performance will likely only be an issue if you call the checkForSharedGroups very, very often, in which case you might be better off finding a way to call it less often instead of optimizing the method itself.
Have you considered a HashTable? If you know all of the keys you'll be using, it's possible to use a Perfect Hash Function which will allow you to achieve constant time.
How about having two separate entities for People and Group. Inside People have a set of Group and vice versa.
class People{
Set<Group> groups;
//API for addGroup, getGroup
}
class Group{
Set<People> people;
//API for addPeople,getPeople
}
check(People p1, People p2):
1) call getGroup on both p1,p2
2) check the size of both the set,
3) iterate over the smaller set, and check if that group is present in other set(of group)
Now, you can basically store People object in any data structure. Preferably a linked list if size is not fixed otherwise an array.

How to test generated Strings where order does not matter?

How to unit generated strings where the end order is fairly flexible. Lets say I'm trying to test some code that prints out out generated SQL that comes from key-value pairs. However, the exact order of many of the fragments does not matter.
For example
SELECT
*
FROM
Cats
WHERE
fur = 'fluffy'
OR
colour = 'white'
is functionally identical to
SELECT
*
FROM
Cats
WHERE
colour = 'white'
OR
fur = 'fluffy'
It doesn't matter in which order the condition clauses get generated, but it does matter that they follow the where clause. Also, it is hard to predict since the ordering of pairs when looping through the entrySet() of a HashMap is not predictable. Sorting the keys would solve this, but introduces a runtime penalty for no (or negative) business value.
How do I unit test the generation of such strings without over-specifying the order?
I thought about using a regexp but* I could not think of how to write one that said:
A regex is what I was thinking of but I can think of a regex that says something like "SELECT * FROM Cats WHERE" followed by one of {"fur = 'fluffy', colour = 'white'} followed by "OR"followed by one of one of {"fur = 'fluffy',colour = 'white'} ... and not the one used last time.
NB: I'm not actually doing this with SQL, it just made for an easier way to frame the problem.
I see a few different options:
If you can live with a modest runtime penalty, LinkedHashMap keeps insertion order.
If you want to solve this completely without changing your implementation, in your example I don't see why you should have to do something more complicated than checking that every fragment appears in the code, and that they appear after the WHERE. Pseudo-code:
Map<String, String> parametersAndValues = { "fur": "fluffy", "colour", "white" };
String generatedSql = generateSql(parametersToValues);
int whereIndex = generatedSql.indexOf("WHERE");
for (String key, value : parametersAndValues) {
String fragment = String.format("%s = '%s'", key, value);
assertThat(generatedSql, containsString(fragment));
assertThat(whereIndex, is(lessThan(generatedSql.indexOf(fragment))));
}
But we can do it even simpler than that. Since you don't actually have to test this with a large set of parameters - for most implementations there are only three important quantities, "none, one, or many" - it's actually feasible to test it against all possible values:
String variation1 = "SELECT ... WHERE fur = 'fluffy' OR colour = 'white'";
String variation2 = "SELECT ... WHERE colour = 'white' OR fur = 'fluffy'";
assertThat(generatedSql, is(anyOf(variation1, variation2)));
Edit: To avoid writing all possible variations by hand (which gets rather tedious if you have more than two or three items as there are n! ways to combine n items), you could have a look at the algorithm for generating all possible permutations of a sequence and do something like this:
List<List<String>> permutations = allPermutationsOf("fur = 'fluffy'",
"colour = 'white'", "scars = 'numerous'", "disposition = 'malignant'");
List<String> allSqlVariations = new ArrayList<>(permutations.size());
for (List<String> permutation : permutations) {
allSqlVariations.add("SELECT ... WHERE " + join(permutation, " OR "));
}
assertThat(generatedSql, is(anyOf(allSqlVariations)));
Well, one option would be to somehow parse the SQL, extract the list of fields and check that everything is ok, disregarding order of the fields. However, this is going to be rather ugly: If done right, you have to implement a complete SQL parser (obviously overkill), if you do it quick-and-dirty using regex or similar, you risk that the test will break for minor changes to the generated SQL.
Instead, I'd propose to use a combination of unit and integration testing:
Have a unit test that tests the code which supplies the list of fields for building the SQL. I.e., have a method Map getRestrictions() which you can easily unit-test.
Have an integration test for the SQL generation as a whole, which runs against a real database (maybe some embedded DB like the H2 database, which you can a start just for the test).
That way, you unit-test the actual values supplied to the SQL, and you integration-test that you are really creating the right SQL.
Note: I my opinion this is an example of "integration code", which cannot be usefully unit-tested. The problem is that the code does not produce a real, testable result by itself. Rather, its purpose is to interface with a database (by sending it SQL), which produces the result. In other words, the code does the right thing not if it produces some specific SQL string, but if it drives the database to do the right thing. Therefore, this code can be meaningfully tested only with the database, i.e. in an integration test.
First, use a LinkedHashMap instead of a regular HashMap. It shouldn't introduce any noticeable performance degradation. Rather than sorting, it retains insertion ordering.
Second, insert the pairs into the map in a well understood manner. Perhaps you are getting the data from a table, and adding an ordering index is unacceptable. But perhaps the database can be ordered by primary key or something.
Combined, those two changes should give you predictable results.
Alternatively, compare actual vs. expected using something smarter than string equals. Perhaps a regex to scrape out all the pairs that have been injected into the actual SQL query?
The best I have come-up so far is to use some library during testing (suck as PowerMockito) to replace the HashMap with a SortedMap like TreeMap. That way for the tests the order will be fixed. However, this only works if the map isn't built in the same code that generates the string.

How to store matrix information in MySQL?

I'm working on an application that analizes music similarity. In order to do that I proccess audio data and store the results in txt files. For each audio file I create 2 files, 1 containing and 16 values (each value can be like this:2.7000023942731723) and the other file contains 16 rows, each row containing 16 values like the one previously shown.
I'd like to store the contents of these 2 file in a table of my MySQL database.
My table looks like:
Name varchar(100)
Author varchar (100)
in order to add the content of those 2 file I think I need to use the BLOB data type:
file1 blob
file2 blob
My question is how should I store this info in the data base? I'm working with Java where I have a double array containing the 16 values (for the file1) and a matrix containing the file2 info. Should I process the values as strings and add them to the columns in my database?
Thanks
Hope I don't get negative repped into oblivion with this crazy answer, but I am trying to think outside the box. My first question is, how are you processing this data after a potential query? If I were doing something similar, I would likely use something like matlab or octave, which have a specific notation for representing matricies. It is basically a bunch of comma and semicolon delimited text with square brackets at the right spots. I would store just a string that my mathematics software or module can parse natively. After all, it doesn't sound like you want to do some kind of query based on a data point.
I think you need to normalize a schema like this if you intend to keep it in a relational database.
Sounds like you have a matrix table that has a one-to-many relationship with its files.
If you insist on one denormalized table, one way to do it would be to store the name of the file, its author, the name of its matrix, and its row and column position in the named matrix that owns it.
Please clarify one thing: Is this a matrix in the linear algebra sense? A mathematical entity?
If yes, and you only use the matrix in its entirety, then maybe you can store it in a single column as a blob. That still forces you to serialize and deserialize to a string or blob every time it goes into and comes out of the database.
Do you need to query the data (say for all the values that are bigger than 2.7) or just store it (you always load the whole file from the database)?
Given the information in the comment I would save the files in a BLOB or TEXT like said in other answers. You don't even need a line delimiter since you can do a modulus operation on the list of values to get the row of the matrix.
I think the problem that dedalo is facing is that he's working with arrays (I assume one is jagged, one is multi-demensional) and he wants to serialize these to blob.
But, arrays aren't directly serializable so he's asking how to go about doing this.
The simplest way to go about it would be to loop through the array and build a string as Dave suggested and store the string. This would allow you to view the contents from the value in the database instead of deserializing the data whenever you need to inpsect it, as duffymo points out.
If you'd like to know how to serialize the array into BLOB...(this just seems like overkill)
You are able to serialize one-dimensional arrays and jagged arrays, e.g.:
public class Test {
public static void main(String[] args) throws Exception {
// Serialize an int[]
ObjectOutputStream out = new ObjectOutputStream(new FileOutputStream("test.ser"));
out.writeObject(new int[] {0, 1, 2, 3, 4, 5, 6, 7, 8, 9});
out.flush();
out.close();
// Deserialize the int[]
ObjectInputStream in = new ObjectInputStream(new FileInputStream("test.ser"));
int[] array = (int[]) in.readObject();
in.close();
// Print out contents of deserialized int[]
System.out.println("It is " + (array instanceof Serializable) + " that int[] implements Serializable");
System.out.print("Deserialized array: " + array[0]);
for (int i=1; i<array.length; i++) {
System.out.print(", " + array[i]);
}
System.out.println();
}
}
As for what data type to store it as in MySQL, there are only four blob types to choose from:
The four BLOB types are TINYBLOB, BLOB, MEDIUMBLOB, and LONGBLOB
Choose the best one depends on the size of the serialized object. I'd imagine BLOB would be good enough.

Categories