Importing a Large Amount of Data and Searching Efficiently

Importing a Large Amount of Data and Searching Efficiently - java

I'm currently writing a program that takes in two CSVs - one containing database keys (and other information irrelevant to the current issue), the other being an asset manifest. The program checks the database key from the first CSV, queries an online database to retrieve the asset key, then gets the asset status from the second CSV. (This is a workaround to a stupid API issue.)
My problem is that while the CSV that is being iterated over is relatively short - only about 300 lines long usually - the other is an asset manifest that is easily 10000 lines long (and sorted, though not by the key I can obtain from the first CSV). I obviously don't want to iterate over the entire asset manifest for every single input line, since that will take roughly 10 eternities.
I'm a fairly inexperienced programmer, so I only know of sorting/searching algorithms, and I definitely don't know what would be the one to use for this. What algorithm would be the most efficient? Is there a way to "batch-query" the manifest for all of the assets listed in the input CSV that would be faster than searching the manifest individually for each key? Or should I use a tree or hashtable or something else I heard mentioned in other SE threads? I don't know anything about the performance implications of any of these.
I can format the manifest as needed when it's input (it's just copy-pasted into a GUI), so I guess I could iterate over the entire manifest when it's input and make a hashtable of key:line pairs and then search that? Or I could turn it into a 2D array and just search the specified index? Those are all I can think of.
Problem is, I don't know how much time computer operations like that take, and if that would just waste time or actually improve performance.
P.s. I'm using Java for this currently since it's all I know, but if another language would be faster then I'm all ears.

The simple solution will be creating a HashMap, iterating through one of the files and add each line of that file to the HashMap(with corresponding key and value), then iterate through the other one and see if the created HashMap contains the key, if yes add the data to anotherHashMap, then after iteration return the second HashMap.
Imagine we have test1.csv file with the content such key,name,family as below:
5000,ehsan,tashkhisi
2,ali,lllll
3,amel,lllll
1,azio,skkk
And test2.csv file with the content such key,status like below:
1000,status1
1,status2
5000,status3
4000,status4
4001,status1
4002,status3
5,status1
We want to have output like this:
1 -> status2
5000 -> status3
Simple code will be like below:
Java 8 Stream:
private static Map<String, String> findDataInTwoFilesJava8() throws IOException {
Map<String, String> map =
Files.lines(Paths.get("/tmp/test2.csv")).map(a -> a.split(","))
.collect(Collectors.toMap((a -> a[0]), (a -> a[1])));
return Files.lines((Paths.get("/tmp/test1.csv"))).map(a -> a.split(","))
.filter(a -> map.containsKey(a[0]))
.collect(Collectors.toMap(a -> a[0], a -> map.get(a[0])));
}
Simple Java:
private static Map<String, String> findDataInTwoFiles() throws IOException {
String line;
Map<String, String> map = new HashMap<>();
BufferedReader br = new BufferedReader(new FileReader("/tmp/test2.csv"));
while ((line = br.readLine()) != null) {
String[] lienData = line.split(",");
map.put(lienData[0], lienData[1]);
}
Map<String, String> resultMap = new HashMap<>();
br = new BufferedReader(new FileReader("/tmp/test1.csv"));
while ((line = br.readLine()) != null) {
String key = line.split(",")[0];
if(map.containsKey(key))
resultMap.put(key, map.get(key));
}
return resultMap;
}

Related

How do I handle duplicate keys in HashMaps in Java?

I have a HashMap. I wrote a method for reading files and printing them, but I just thought that HashMap does not tolerate duplicate keys in the file, so I need to do something about it, e.g. saving the same key but with some kind of a character in the end (like just _ or something like that so they differ from each other). I can't come up with the solution (maybe I could catch an exception of just write an if-block). Could you please help me?
public static HashMap<String, String> hashmapReader(File test3) {
HashMap<String, String> data = new HashMap<>();
try (BufferedReader hmReader = new BufferedReader(new FileReader(test3))) {
String line;
while ((line = hmReader.readLine()) != null) {
String[] columns = line.split("\t");
String key = columns[0];
String value = columns[1];
data.put(key, value);
}
} catch (Exception e) {
System.out.println("Something went wrong");
}
return data;
}

You can add a control on the key if it already exist in the HashMap data.
In order to do this you can use get(key) method of the HashMap Java Class which returns null if the key doesn't exist:
if(data.get(key) != null)
key = key + "_";
data.put(key, value); //adding the split line array to the ArrayList
If it already exists (didn't return null) then you can change his name by adding a character at the end, e.g. "_" as you said.
EDIT: The answer above mine pointed out to me a fact: "What if there are more than 2 identical keys?".
For this reason, I recommend following his solution instead of mine.

To achieve what you actually ask for:
Before your put line:
while (data.containsKey(key)) key += "_";
data.put(key, value);
This will keep on checking the map to see if key already exists, and if it does, it adds the _ to the end, and tries again.
You can do these two lines in one:
while (data.putIfAbsent(key, value) != null) key += "_";
This does basically the same, but it just avoids having to look up twice in the case that the key isn't found (and thus the value should be inserted).
However, consider whether this is actually the best thing to do: how will you then look up things by "key", if you've essentially made up the keys while reading them.
You can keep multiple values per key by using a value type which stores multiple values, e.g. List<String>.
HashMap<String, List<String>> data = new HashMap<>();
// ...
data.computeIfAbsent(key, k -> new ArrayList<>()).add(value);

Not entirely clear on what this code does? (Includes Set, HashMap and .keySet())

So I've finished a program and have had help building it/worked with another person. I understand all of the program in terms of what each line of code does except one part. This is the code:
Set<String> set1 = firstWordGroup.getWordCountsMap().keySet();
Map<String, Integer> stringIntegerMap1 = set1.stream().collect(HashMap::new,
(hashMap, s) -> hashMap.put(s, s.length()), HashMap::putAll);
stringIntegerMap1.forEach((key,value) ->System.out.println(key + " : "+value));
Some background info:
getWordCut is a method that looks like this:
public HashMap getWordCountsMap() {
HashMap<String, Integer> myHashMap = new HashMap<String, Integer>();
for (String word : this.getWordArray()) {
if (myHashMap.keySet().contains(word)) {
myHashMap.put(word, myHashMap.get(word) + 1);
} else {
myHashMap.put(word, 1);
}
}
return myHashMap;
}
firstWordGroup is a constructor that stores a string of words.
If anybody could explain exactly what this block of code does and how it does it then that would be helpful, thanks.
P.S: I'm not sure if supplying the whole program to reproduce the code is relevant so if you think it is, just leave a comment saying so and I will edit the question so you can reproduce the program.

getWordsCountsMap() returns a map where the key is a word and the value is how many times the word occurred in the array
Set<String> set1 = firstWordGroup.getWordCountsMap().keySet();
The .keyset() method returns just the keys of the map, so now you have a set of the words, but have lost the occurrence counts.
Map<String, Integer> stringIntegerMap1 =
set1.stream()
.collect(HashMap::new,
(hashMap, s) -> hashMap.put(s, s.length()),
HashMap::putAll)
This is using Java8 streams to iterate through the set (words) originally put into a map and create a new hash map, where the key is the word (as it was before) and the value is the length of the word (whereas originally it was the word count). A new hash map is created and populated and returned.
What I'm not understanding is the final HashMap::putAll() which would seem to take the hashmap just populated and re-add all entries (which really would be a no-op because the keys would be replaced). Since I haven't whipped open my IDE to put in the code and test it (which, if you haven't yourself, would recommend, I'm just not interested enough to do so because it's not my problem!).
stringIntegerMap1.forEach((key,value) ->System.out.println(key + " : "+value));
In essence, this is a cleaner way to iterate through the entries in the map created, printing out the word and length for each.
After working through this and thinking about it, I have a feeling I'm doing your homework for you, the real way to figure this out is to break things down and debug through your IDE and seeing what each step of the way does.

Set<String> set1 = firstWordGroup.getWordCountsMap().keySet();
This line calles getWordCountsMap which returns a map from words to their count. It then ignores the count and just takes the words in a set. Note this could be achieved in a lot of much simpler ways.
Map<String, Integer> stringIntegerMap1 = set1.stream()
.collect(HashMap::new, (hashMap, s) -> hashMap.put(s, s.length()), HashMap::putAll);
This converts the set of words to a stream and then collects the stream. The collector starts by creating a hash map then, for each word, adding a map from the to its length. If multiple maps are created (as is allowed in streams) then they are combined using putAll. Again there are much simpler and clearer ways to achieve this.
stringIntegerMap1.forEach((key,value) ->System.out.println(key + " : "+value));
This line iterates through all entries in the map and prints out the key and value.
All this code could have been achieved with:
Arrays.stream(getWordArray())
.distinct().map(w -> w + ":" + w.length()).forEach(System.out::println);
This command converts the words to a stream, removes duplicates, maps the words to the output string then prints them.

Sort csv file by key in Apache Spark

I have a csv file that consists of data in this format:
id, name, surname, morecolumns
5, John, Lok, more
2, John2, Lok2, more
1, John3, Lok3, more
etc..
I want to sort my csv file using the id as key and store the sorted results in another file.
What I've done so far in order to create JavaPairs of (id, rest_of_line).
SparkConf conf = new SparkConf().setAppName.....;
JavaSparkContext sc = new JavaSparkContext(conf);
JavaRDD<String> file = sc.textFile("inputfile.csv");
// extract the header
JavaRDD<String> lines = file.filter(s -> !s.equals(header));
// create JavaPairs
JavaPairRDD<Integer, String> pairRdd = lines.mapToPair(
new PairFunction<String, Integer, String>() {
public Tuple2<Integer, String> call(final String line) {
String str = line.split(",", 2)[0];
String str2 = line.split(",", 2)[1];
int id = Integer.parseInt(str);
return new Tuple2(id, str2);
}
});
// sort and save the output
pairRdd.sortByKey(true, 1);
pairRdd.coalesce(1).saveAsTextFile("sorted.csv");
This works in cases that I have small files. However when I am using bigger files, the output is not sorted properly. I think this happens because the sort procedure takes place on different nodes, so the merge of all the procedures from all the nodes doesn't give the expected output.
So, the question is how can I sort my csv file using the id as key and store the sorted results in another file.

The method coalesce is probably the one to blame, as it apparently does not contractually guarantee the ordering or the resulting RDD (see Which operations preserve RDD order?). So if you avoid such coalesce, the resulting output files will be ordered.
As you want a unique csv file, you could get the results from whatever file-system you're using but taking care of their actual order, and merge them. For example, if you're using HDFS (as stated by #PinoSan) this can be done using the command hdfs dfs -getmerge <hdfs-output-dir> <local-file.csv>.

As pointed by #mauriciojost, you should not do coalesce.
Instead, better way to do this is pairRdd.sortByKey(true,pairRdd.getNumPartitions()).saveAsTextFile(path) so that maximum possible work is carried out on partitions that hold data.

Best way to save two depending Strings in Java and compare if new strings already exist

I need to save two depending Strings (action and parameter) into a file or a hashtable/map or an Array, depending what is the best solution for speed and memory.
My Application iterates through a large amount of forms on a website and i want to skip if the combination (String action,String parameter) already was tested and therefore saved. I thing an Array would be too slow if I have more then thousands of different action and parameter tupels. I´m not experienced enough to chose the right method for this. I tried a hashtable but it does not work:
Hashtable<String, String> ht = new Hashtable<String, String>();
if (ht.containsKey(action) && ht.get(action).contains(parameter)) {
System.out.println("Tupel already exists");
continue;
}
else
ht.put(action, parameter);

If a action and parameter will always be a 1-to-1 mapping (an action will only ever have one parameter), then your basic premise should be fine (though I'd recommend HashMap over Hashtable as it's faster and supports null keys)
If you will have many parameters for a given action, then you want Map<String, Set<String>> - where action is the key and each action is then associated with a set of parameters.
Declare it like this:
Map<String, Set<String>> map = new HashMap<>();
Use it like this:
Set<String> parameterSet = map.get(action); // lookup the parameterSet
if ((parameterSet != null) && (parameterSet.contains(parameter)) { // if it exists and contains the key
System.out.println("Tupel already exists");
} else { // pair doesn't exist
if (parameterSet == null) { // create parameterSet if needed
parameterSet = new HashSet<String>();
map.put(action, parameterSet);
}
parameterSet.add(parameter); // and add your parameter
}
As for the rest of your code and other things that may not be working:
I'm not sure what your use of continue is for in your original code; it's hard to tell without the rest of the method.
I'm assuming the creation of your hashtable is separated from the usage - if you're recreating it each time, then you'll definitely have problems.

My arraylist is only outputting the last value

I created a HashMap to store a text file with the columns of information. I compared the key to a specific name and stored the values of the HashMap into an ArrayList. When I try to println my ArrayList, it only outputs the last value and leaves out all the other values that match that key.
This isn't my entire code just my two loops that read in the text file, stores into the HashMap and then into the ArrayList. I know it has something to do with my loops.
Did some editing and got it to output, but all my values are displayed multiple times.
My output looks like this.
North America:
[ Anguilla, Anguilla, Antigua and Barbuda, Antigua and Barbuda, Antigua and Barbuda, Aruba, Aruba, Aruba,
HashMap<String, String> both = new HashMap<String, String>();
ArrayList<String> sort = new ArrayList<String>();
//ArrayList<String> sort2 = new ArrayList<String>();
// We need a try catch block so we can handle any potential IO errors
try {
try {
inputStream = new BufferedReader(new FileReader(filePath));
String lineContent = null;
// Loop will iterate over each line within the file.
// It will stop when no new lines are found.
while ((lineContent = inputStream.readLine()) != null) {
String column[]= lineContent.split(",");
both.put(column[0], column[1]);
Set set = both.entrySet();
//Get an iterator
Iterator i = set.iterator();
// Display elements
while(i.hasNext()) {
Map.Entry me = (Map.Entry)i.next();
if(me.getKey().equals("North America"))
{
String value= (String) me.getValue();
sort.add(value);
}
}
}
System.out.println("North America:");
System.out.println(sort);
System.out.println("\n");
}

Map keys need to be unique. Your code is working according to spec.

if you need to have many values for a key, you may use
Map<key,List<T>>
here T is String (not only list you can use any collection)

Some things seems wrong with your code :
you are iterating on the Map EntrySet to get just one value (you could just use the following code :
if (both.containsKey("North America"))
sort.add(both.get("North America"));
it seems that you can have "North America" more than one time in your input file, but you are storing it in a Map, so each time you store a new value for "North America" in your Map, it will overwrite the current value
I don't know what the type of sort is, but what is printed by System.out.print(sort); is dependent of the toString() implementation of this type, and the fact that you use print() instead of println() may also create problems depending on how you run your program (some shells may not print the last value for instance).
If you want more help, you may want to provide us with the following things :
sample of the input file
declaration of sort
sample of output
what you want to obtain.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.