Sorting of 2 or more massive resultsets? - java

I need to be able to sort multiple intermediate result sets and enter them to a file in sorted order. Sort is based on a single column/key value. Each result set record will be list of values (like a record in a table)
The intermediate result sets are got by querying entirely different databases.
The intermediate result sets are already sorted based on some key(or column). They need to be combined and sorted again on the same key(or column) before writing it to a file.
Since these result sets can be massive(order of MBs) this cannot be done in memory.
My Solution broadly :
To use a hash and a random access file . Since the result sets are already sorted, when retrieving the result sets , I will store the sorted column values as keys in a hashmap.The value in the hashmap will be a address in the random access file where every record associated with that column value will be stored.
Any ideas ?

Have a pointer into every set, initially pointing to the first entry
Then choose the next result from the set, that offers the lowest entry
Write this entry to the file and increment the corresponding pointer
This approach has basically no overhead and time is O(n). (it's Merge-Sort, btw)
Edit
To clarify: It's the merge part of merge sort.

If you've got 2 pre-sorted result sets, you should be able to iterate them concurrently while writing the output file. You just need to compare the current row in each set:
Simple example (not ready for copy-and-paste use!):
ResultSet a,b;
//fetch a and b
a.first();
b.first();
while (!a.isAfterLast() || !b.isAfterLast()) {
Integer valueA = null;
Integer valueB = null;
if (a.isAfterLast()) {
writeToFile(b);
b.next();
}
else if (b.isAfterLast()) {
writeToFile(a);
a.next();
} else {
int valueA = a.getInt("SORT_PROPERTY");
int valueB = b.getInt("SORT_PROPERTY");
if (valueA < valueB) {
writeToFile(a);
a.next();
} else {
writeToFile(b);
b.next();
}
}
}

Sounds like you are looking for an implementation of the Balance Line algorithm.

Related

Java Stream - Retrieving repeated records from CSV

I searched the site and didn't find something similar. I'm newbie to using the Java stream, but I understand that it's a replacement for a loop command. However, I would like to know if there is a way to filter a CSV file using stream, as shown below, where only the repeated records are included in the result and grouped by the Center field.
Initial CSV file
Final result
In addition, the same pair cannot appear in the final result inversely, as shown in the table below:
This shouldn't happen
Is there a way to do it using stream and grouping at the same time, since theoretically, two loops would be needed to perform the task?
Thanks in advance.
You can do it in one pass as a stream with O(n) efficiency:
class PersonKey {
// have a field for every column that is used to detect duplicates
String center, name, mother, birthdate;
public PersonKey(String line) {
// implement String constructor
}
// implement equals and hashCode using all fields
}
List<String> lines; // the input
Set<PersonKey> seen = new HashSet<>();
List<String> unique = lines.stream()
.filter(p -> !seen.add(new PersonKey(p))
.distinct()
.collect(toList());
The trick here is that a HashSet has constant time operations and its add() method returns false if the value being added is already in the set, true otherwise.
What I understood from your examples is you consider an entry as duplicate if all the attributes have same value except the ID. You can use anymatch for this:
list.stream().filter(x ->
list.stream().anyMatch(y -> isDuplicate(x, y))).collect(Collectors.toList())
So what does the isDuplicate(x,y) do?
This returns a boolean. You can check whether all the entries have same value except the id in this method:
private boolean isDuplicate(CsvEntry x, CsvEntry y) {
return !x.getId().equals(y.getId())
&& x.getName().equals(y.getName())
&& x.getMother().equals(y.getMother())
&& x.getBirth().equals(y.getBirth());
}
I've assumed you've taken all the entries as String. Change the checks according to the type. This will give you the duplicate entries with their corresponding ID

Comparing Keys in a Hashmap

I have a test.csv file that is formatted as:
Home,Owner,Lat,Long
5th Street,John,5.6765,-6.56464564
7th Street,Bob,7.75,-4.4534564
9th Street,Kyle,4.64,-9.566467364
10th Street,Jim,14.234,-2.5667564
I have a hashmap that reads a file that contains the same header contents such as the CSV, just a different format, with no accompanying data.
In example:
Map<Integer, String> container = new HashMap<>();
where,
Key, Value
[0][NULL]
[1][Owner]
[2][Lat]
[3][NULL]
I have also created a second hash map that:
BufferedReader reader = new BufferedReader (new FileReader("test.csv"));
CSVParser parser = new CSVParser(reader, CSVFormat.DEFAULT);
Boolean headerParsed = false;
CSVRecord headerRecord = null;
int i;
Map<String,String> value = new HashMap<>();
for (final CSVRecord record : parser) {
if (!headerParsed = false) {
headerRecord = record;
headerParsed = true;
}
for (i =0; i< record.size(); i++) {
value.put (headerRecord.get(0), record.get(0));
}
}
I want to read and compare the hashmap, if the container map has a value that is in the value map, then I put that value in to a corresponding object.
example object
public DataSet (//args) {
this.home
this.owner
this.lat
this.longitude
}
I want to create a function where the data is set inside the object when the hashmaps are compared and when a value map key is equal to a contain map key, and the value is placed is set into the object. Something really simply that is efficient at handling the setting as well.
Please note: I made the CSV header and the rows finite, in real life, the CSV could have x number of fields(Home,Owner,Lat,Long,houseType,houseColor, ect..), and a n number of values associated to those fields
First off, your approach to this problem is too unnecessarily long. From what I see, all you are trying to do is this:
Select a two columns from a CSV file, and add them to a data structure. I highlighted the word two because in a map, you have a key and a value. One column becomes the key, and the other becomes the value.
What you should do instead:
Import the names of columns you wish to add to the data structure into two strings. (You may read them from a file).
Iterate over the CSV file using the CSVParser class that you did.
Store the value corresponding to the first desired column in a string, repeat with the value corresponding to the second desired column, and push them both into a DataSet object, and push the DataSet object into a List<DataSet>.
If you prefer to stick to your way of solving the problem:
Basically, the empty file is supposed to hold just the headers (column names), and that's why you named the corresponding hash map containers. The second file is supposed to contain the values and hence you named the corresponding hash map values.
First off, where you say
if (!headerParsed = false) {
headerRecord = record;
headerParsed = true;
}
you probably mean to say
if (!headerParsed) {
headerRecord = record;
headerParsed = true;
}
and where you say
for (i =0; i< record.size(); i++) {
value.put(headerRecord.get(0), record.get(0));
}
you probably mean
for (i =0; i< record.size(); i++) {
value.put(headerRecord.get(i), record.get(i));
}
i.e. You iterate over one record and store the value corresponding to each column.
Now I haven't tried this code on my desktop, but since the for loop also iterates over Home and Longitude, I think it should create an error and you should add an extra check before calling value.put (i.e. value.put("Home", "5th Street") should create an error I suppose). Wrap it inside an if conditional and check of the headerRecord(i) even exists in the containers hash map.
for (i =0; i< record.size(); i++) {
if (container[headerRecord.get(i)] != NULL) {
value.put(headerRecord.get(i), record.get(i));
}
}
Now thing is, that the data structure itself depends on which values from the containers hash map you want to store. It could be Home and Lat, or Owner and Long. So we are stuck. How about you create a data structure like below:
struct DataSet {
string val1;
string val2;
}
Also, note that this DataSet is only for storing ONE row. For storing information from multiple rows, you need to create a Linked List of DataSet.
Lastly, the container file contains ALL the column names. Not all these columns will be stored in the Data Set (i.e. You chose to NULL Home and Long. You could have chosen to NULL Owner and Lat), hence the header file is not what you need to make this decision.
If you think about it, just iterate over the values hash map and store the first value in string val1 and the second value in val2.
List<DataSet> myList;
DataSet row;
Iterator it = values.entrySet().iterator();
while (it.hasNext()) {
Map.Entry pair = (Map.Entry)it.next();
row.val1 = pair.getKey();
row.val2 = pair.getValue();
myList.add(row);
it.remove();
}
I hope this helps.

Best way to implement a Reverse MultiMap

My Java8 program has several stages:
A CSV file is parsed. The CSV file looks like this:
123,[Foo:true; Bar:true; Foobar:false; Barfoo:false]
456,[Foobar:true; Barfoo:false; Foo:false; Bar:false]
789,[Foobar:true; Barfoo:false; Foo:false]
where 123, 546 and 789 are unique identifiers of each datastructure, one datastructure is represented by the column identifiers Foo Bar Foobar and Barfoo, where each boolean indicates, if the column is a key-colum.
While the CSV file is parsed, for each line a datastructure must be created.
Later, in runtime, wich needs to be fast, the following will happen:
An ArrayList<String> containing column data is given. Data needs to be added to a specific datastructure. (I do now the unique identifier 123).
Say: the ArrayList<String> needs to be added to 123: 1-> foo, 2->bar, 3-> foobar, 4 ->barfoo.
Say: another ArrayList<String> needs to be added to 456: 1-> foobar, 2-> barfoo
Say: another ArrayList<String> needs to be added to 789: 1-> foobar, 2-> null, 3-> foo
The tasks that the datastructure needs to provide are the following:
add(ArrayList<String>) : void
remove(ArrayList<String>) : boolean (if successfull)
contains(Arraylist<String>) : boolean
get(ArrayList<String>) : ArrayList<String>
Notes:
The combination of all keys inside of one datastructure 123 are unique. Meaning: If in 123 is one entry with foo,bar,foobar, barfoo, Another enrty with foo,bar,doesnt matter, neither will not be allowed. Another enrty with foo1,bar, foobar,barfoo is allowed, as well as an entry with foo1,bar1,foobar, barfoo is also allowed.
It won't happen, that wile parsing, a column name not beeing a key (true), is in front of a key. This will not happen:
[Foobar:true;Barfoo:false;Foo:true;Bar:false]
It won't happen, at runtime, that a column marked as a key will not get data: This will not happen: an ArrayList<String> added to 123 with data looks like this: 1->foo, 2->null, 3->foobar.
I tried: storing at each datastructure-class two arrays. One with the Column Names, and one with Numbers of the Columns, which are keys. At runtime the key-indicating array will be processed to get all key values (at the first add example above it would be foo and bar) and they will be concatenated. (to a String "foo,bar"). This is a new key for a second HashMap<String,ArrayList<String>> where the value (ArrayList<String>) contains the data of all columns (foo,bar,foobar,barfoo).
I have a getKeyString method:
String getKeyString(ArrayList<String> keys, ArrayList<Integer> keyPos){
if (keyPos.get(keyPos.size()-1) >= keys.size()) //if the last entry from orders arraylist keyPos is greater than size of keys
throw new Exception();
String collect = keyPos.stream().map(i -> keys.get(i))
.map(string ->{
try{
if(string.equals("null")) // happens not very often, ~1time in 1,000
return "";
}
catch(NullPointerException e) { //happens even less 1 in 100,000
return "";
}
return string;
})
.collect(Collectors.joining(","));
if(collect.length()<keyPos.size())
throw new Exception("results in an empty key: ");
return collect;
and the addDataListEnty looks quite similar to this:
HashMap<String, ArrayList<String>> dataLists = new HashMap<>();
ArrayList<Integer> keyPos = new ArrayList<>();
...
public void addDataListEntry(ArrayList<String> values) {
// will overwrite Entry if it already exists
try {
this.dataLists.put(getKeyString(values, keyPos), values);
} catch (Exception e) {
logger.info(e.getMessage());
}
}
This does actually work, but is really slow, since the key (foo,bar) needs to created at every datastructure-access.
Which combination of HashMaps, Lists, Sets, (even Google Guava) is the best to make it as fast as possible?

Create a sorted by value Map before putting any data in it

I know answer to this question has been provided in many variants but I couldn't find it for my specific query.
I want to have a map which is sorted by values and I need to get it created before I put data into it. I came up with below code to create it
private Map<String, Integer> mapUserScore = new ConcurrentSkipListMap<>(new Comparator<String>() {
#Override
public int compare(String o1, String o2) {
int i1=mapUserScore.get(o2);
int i2=mapUserScore.get(o1);
if(mapUserScore.get(o2)!=null && mapUserScore.get(o1)!=null){
int compare = mapUserScore.get(o2)-(mapUserScore.get(o1));
if(compare==0)compare=-1;
return compare;
}else
return 0;
}
});
So basically I want entries in map sorted by integer values in descending order so that highest scorers are on top.
However upon doing this when the first key-value pair is inserted, the program exits with below exception
Exception in thread "Thread-0" java.lang.StackOverflowError
at java.util.concurrent.ConcurrentSkipListMap.comparable(ConcurrentSkipListMap.java:658)
at java.util.concurrent.ConcurrentSkipListMap.doGet(ConcurrentSkipListMap.java:821)
at java.util.concurrent.ConcurrentSkipListMap.get(ConcurrentSkipListMap.java:1626)
Upon tracing, I found that line int i1=mapUserScore.get(o2) results in this exception.
Can anyone please help me to understand what could be the reason of stackoverflow here?
I am thinking that because before any item is stored in the map, code is trying to obtain it by using get() method to sort it and hence it goes into some recursive calls and results in exception.
If I understand correctly, you would like to be able to get the score associated to a name quickly (hence the need for a Map), and you would like to be able to iterate through tyhe name-score pairs with the highest scores first.
I would just use a HashMap<String, NameScore> (where the key is the name and the value is the name-score pair). This would give you O(1) lookups. And when you need to name-score pairs sorted by score, create a new ArrayList<NameScore> from the values() of the map, sort it, and return it.
The get() method uses the comparator to find the value. You can't use get in the comparator or your will get a stack over flow.
A simple work around is to include the score in the key and sort on that instead.
class NameScore implement Comparable<NameScore> {
String name;
int score;
}
BTW: When the comparator return 0 this means it is duplicate which is dropped. Unless you want only one name per score, you need to compare on the score AND the name.

How to "join" Hashtables in Java?

I have two strings:
A { 1,2,3,4,5,6 }
B { 6,7,8,9,10,11 }
it doesnt really matter what the numbers are in the strings. So then the user is going to pick what to join:
A hashjoin A.a1 = B.b5 B
I think I put the A into a hashtable by the A.a1 as the key and then iterate through B? The keys will be what the user wants then to join on and the data will be whats in the strings.
Are you sure you're trying to join hashtables? Perhaps you have the wrong data structure?
Look into java.util.Set (and java.util.HashSet). If you want the items that are in both tables, then it's a simple Set operation like so:
Collection A = new ...
...fill the A up...
Collection B = new ...
...fill the B up...
Set join = new HashSet();
join.addAll(A);
join.retainAll(B);
If you mean something more like a SQL table join, then the output will depend on what type of join you mean to perform, and what the equals sign means in this case. Note you'll have to write a Pair class (which you should make more descriptive than Pair for your exact case)
For a full join:
ArrayList pairs = new ArrayList();
for (Number numberA : A) {
for (Number numberB : B) {
pairs.add(new Pair(numberA, numberB));
}
}
For a full join with a where clause:
ArrayList pairs = new ArrayList();
for (Number numberA : A) {
for (Number numberB : B) {
if (check the condition of the where clause here) {
pairs.add(new Pair(numberA, numberB));
}
}
}
That's about as good an answer that can be given under the circumstances, as your question isn't very specific. If these general answers don't help you out, then you'll need to explain your question in more detail to get a more detailed answer.
--- First Edit, after some clarification ---
Ok, so it's an SQL-like equi-join.
Hashtables are Maps, which means they have an element in one "domain" which can be used to look up an element in another "domain". In a hash table, the first domain is the set of keys, and the second domain is the set of values. Think of it as a bunch of labels and a bunch of items. If the equi-join is to be performed, it must join like elements. That means it will either join one key to another key, or it will join one value to another value.
For keys:
Hashtable A = ...
Hashtable B = ...
Set keyJoin = new HashSet();
keyJoin.addAll(A.keySet());
keyJoin.retainAll(B.keySet());
For values:
Hashtable A = ...
Hashtable B = ...
Set valueJoin = new HashSet();
valueJoin.addAll(A.values());
valueJoin.retainAll(B.values());
It doesn't make sense to join the hashtables themselves; because, one "matching" value may live in both hashtables but be referenced by two different keys. Likewise, one "matching" key found in two different hashtables might not refer to the same value.
Your question doesn't make much sense. A hashtable (or hashmap), stores data as keys and values. You've said nothing about which of those values should be keys, and which should be values.

Categories