I have a test.csv file that is formatted as:
Home,Owner,Lat,Long
5th Street,John,5.6765,-6.56464564
7th Street,Bob,7.75,-4.4534564
9th Street,Kyle,4.64,-9.566467364
10th Street,Jim,14.234,-2.5667564
I have a hashmap that reads a file that contains the same header contents such as the CSV, just a different format, with no accompanying data.
In example:
Map<Integer, String> container = new HashMap<>();
where,
Key, Value
[0][NULL]
[1][Owner]
[2][Lat]
[3][NULL]
I have also created a second hash map that:
BufferedReader reader = new BufferedReader (new FileReader("test.csv"));
CSVParser parser = new CSVParser(reader, CSVFormat.DEFAULT);
Boolean headerParsed = false;
CSVRecord headerRecord = null;
int i;
Map<String,String> value = new HashMap<>();
for (final CSVRecord record : parser) {
if (!headerParsed = false) {
headerRecord = record;
headerParsed = true;
}
for (i =0; i< record.size(); i++) {
value.put (headerRecord.get(0), record.get(0));
}
}
I want to read and compare the hashmap, if the container map has a value that is in the value map, then I put that value in to a corresponding object.
example object
public DataSet (//args) {
this.home
this.owner
this.lat
this.longitude
}
I want to create a function where the data is set inside the object when the hashmaps are compared and when a value map key is equal to a contain map key, and the value is placed is set into the object. Something really simply that is efficient at handling the setting as well.
Please note: I made the CSV header and the rows finite, in real life, the CSV could have x number of fields(Home,Owner,Lat,Long,houseType,houseColor, ect..), and a n number of values associated to those fields
First off, your approach to this problem is too unnecessarily long. From what I see, all you are trying to do is this:
Select a two columns from a CSV file, and add them to a data structure. I highlighted the word two because in a map, you have a key and a value. One column becomes the key, and the other becomes the value.
What you should do instead:
Import the names of columns you wish to add to the data structure into two strings. (You may read them from a file).
Iterate over the CSV file using the CSVParser class that you did.
Store the value corresponding to the first desired column in a string, repeat with the value corresponding to the second desired column, and push them both into a DataSet object, and push the DataSet object into a List<DataSet>.
If you prefer to stick to your way of solving the problem:
Basically, the empty file is supposed to hold just the headers (column names), and that's why you named the corresponding hash map containers. The second file is supposed to contain the values and hence you named the corresponding hash map values.
First off, where you say
if (!headerParsed = false) {
headerRecord = record;
headerParsed = true;
}
you probably mean to say
if (!headerParsed) {
headerRecord = record;
headerParsed = true;
}
and where you say
for (i =0; i< record.size(); i++) {
value.put(headerRecord.get(0), record.get(0));
}
you probably mean
for (i =0; i< record.size(); i++) {
value.put(headerRecord.get(i), record.get(i));
}
i.e. You iterate over one record and store the value corresponding to each column.
Now I haven't tried this code on my desktop, but since the for loop also iterates over Home and Longitude, I think it should create an error and you should add an extra check before calling value.put (i.e. value.put("Home", "5th Street") should create an error I suppose). Wrap it inside an if conditional and check of the headerRecord(i) even exists in the containers hash map.
for (i =0; i< record.size(); i++) {
if (container[headerRecord.get(i)] != NULL) {
value.put(headerRecord.get(i), record.get(i));
}
}
Now thing is, that the data structure itself depends on which values from the containers hash map you want to store. It could be Home and Lat, or Owner and Long. So we are stuck. How about you create a data structure like below:
struct DataSet {
string val1;
string val2;
}
Also, note that this DataSet is only for storing ONE row. For storing information from multiple rows, you need to create a Linked List of DataSet.
Lastly, the container file contains ALL the column names. Not all these columns will be stored in the Data Set (i.e. You chose to NULL Home and Long. You could have chosen to NULL Owner and Lat), hence the header file is not what you need to make this decision.
If you think about it, just iterate over the values hash map and store the first value in string val1 and the second value in val2.
List<DataSet> myList;
DataSet row;
Iterator it = values.entrySet().iterator();
while (it.hasNext()) {
Map.Entry pair = (Map.Entry)it.next();
row.val1 = pair.getKey();
row.val2 = pair.getValue();
myList.add(row);
it.remove();
}
I hope this helps.
Related
My Java8 program has several stages:
A CSV file is parsed. The CSV file looks like this:
123,[Foo:true; Bar:true; Foobar:false; Barfoo:false]
456,[Foobar:true; Barfoo:false; Foo:false; Bar:false]
789,[Foobar:true; Barfoo:false; Foo:false]
where 123, 546 and 789 are unique identifiers of each datastructure, one datastructure is represented by the column identifiers Foo Bar Foobar and Barfoo, where each boolean indicates, if the column is a key-colum.
While the CSV file is parsed, for each line a datastructure must be created.
Later, in runtime, wich needs to be fast, the following will happen:
An ArrayList<String> containing column data is given. Data needs to be added to a specific datastructure. (I do now the unique identifier 123).
Say: the ArrayList<String> needs to be added to 123: 1-> foo, 2->bar, 3-> foobar, 4 ->barfoo.
Say: another ArrayList<String> needs to be added to 456: 1-> foobar, 2-> barfoo
Say: another ArrayList<String> needs to be added to 789: 1-> foobar, 2-> null, 3-> foo
The tasks that the datastructure needs to provide are the following:
add(ArrayList<String>) : void
remove(ArrayList<String>) : boolean (if successfull)
contains(Arraylist<String>) : boolean
get(ArrayList<String>) : ArrayList<String>
Notes:
The combination of all keys inside of one datastructure 123 are unique. Meaning: If in 123 is one entry with foo,bar,foobar, barfoo, Another enrty with foo,bar,doesnt matter, neither will not be allowed. Another enrty with foo1,bar, foobar,barfoo is allowed, as well as an entry with foo1,bar1,foobar, barfoo is also allowed.
It won't happen, that wile parsing, a column name not beeing a key (true), is in front of a key. This will not happen:
[Foobar:true;Barfoo:false;Foo:true;Bar:false]
It won't happen, at runtime, that a column marked as a key will not get data: This will not happen: an ArrayList<String> added to 123 with data looks like this: 1->foo, 2->null, 3->foobar.
I tried: storing at each datastructure-class two arrays. One with the Column Names, and one with Numbers of the Columns, which are keys. At runtime the key-indicating array will be processed to get all key values (at the first add example above it would be foo and bar) and they will be concatenated. (to a String "foo,bar"). This is a new key for a second HashMap<String,ArrayList<String>> where the value (ArrayList<String>) contains the data of all columns (foo,bar,foobar,barfoo).
I have a getKeyString method:
String getKeyString(ArrayList<String> keys, ArrayList<Integer> keyPos){
if (keyPos.get(keyPos.size()-1) >= keys.size()) //if the last entry from orders arraylist keyPos is greater than size of keys
throw new Exception();
String collect = keyPos.stream().map(i -> keys.get(i))
.map(string ->{
try{
if(string.equals("null")) // happens not very often, ~1time in 1,000
return "";
}
catch(NullPointerException e) { //happens even less 1 in 100,000
return "";
}
return string;
})
.collect(Collectors.joining(","));
if(collect.length()<keyPos.size())
throw new Exception("results in an empty key: ");
return collect;
and the addDataListEnty looks quite similar to this:
HashMap<String, ArrayList<String>> dataLists = new HashMap<>();
ArrayList<Integer> keyPos = new ArrayList<>();
...
public void addDataListEntry(ArrayList<String> values) {
// will overwrite Entry if it already exists
try {
this.dataLists.put(getKeyString(values, keyPos), values);
} catch (Exception e) {
logger.info(e.getMessage());
}
}
This does actually work, but is really slow, since the key (foo,bar) needs to created at every datastructure-access.
Which combination of HashMaps, Lists, Sets, (even Google Guava) is the best to make it as fast as possible?
I am trying to write a Java program that loads the data (from a tab delimited DAT file) and determines the average amount in Euros (EUR), grouped by Country and Credit Rating.
I have 2 questions,
what is the best way to load the data into data structure after spliting into array?
How do i approach about providing group by functionality in Java
Update: I have given a first try and this is how implementation looks like. Feels like there is a room for improvement.
/**
* #param rows - Each row as a bean
* This method will group objects together based on Country/City and Credit Rating
*/
static void groupObjectsTogether(List<CompanyData> rows) {
Map<String, List<CompanyData>> map = new HashMap<String, List<CompanyData>>();
for(CompanyData companyData : rows){
String key;
if(companyData.getCountry().trim().equalsIgnoreCase("") || companyData.getCountry() == null){
key = companyData.getCity()+":"+companyData.getCreditRating(); //use city+creditRating as key
}else{
key = companyData.getCountry()+":"+companyData.getCreditRating(); //use country+creditRating as key
}
if(map.get(key) == null){
map.put(key, new ArrayList<CompanyData>());
}
map.get(key).add(companyData);
}
processGroupedRowsAndPrint(map);
}
It all depends on the amount of data and performance (CPU vs memory) of the machine. It the amount of data is not significant (less than millions of records or columns) and the number of columns is fixed then you may simply put all data in arrays using
String[] row = String.split(";");
which shall split each row using ; as delimiter. Then you may achieve your grouping functionality using HashMap, i.e.:
ArrayList<String[]> rowAr = new ArrayList<String[]>();
HashMap<String,ArrayList<Integer>> map = new HashMap<String,ArrayList<Integer>>();
int index = 0;
for (String rowStr: rows) {
String[] row = rowStr.split(";");
rowAr.add(row);
String companyCode = row[0];
//please keep in mind that for simplicity of the example I avoided
//creation of new array if it does not exist in HashMap
((ArrayList<Integer>)map.get(companyCode)).add(index);
index++;
}
Sorry for any syntax or other simple errors above (I do not have any tools in hand to verify if there is not any stupid mistake).
I created a HashMap to store a text file with the columns of information. I compared the key to a specific name and stored the values of the HashMap into an ArrayList. When I try to println my ArrayList, it only outputs the last value and leaves out all the other values that match that key.
This isn't my entire code just my two loops that read in the text file, stores into the HashMap and then into the ArrayList. I know it has something to do with my loops.
Did some editing and got it to output, but all my values are displayed multiple times.
My output looks like this.
North America:
[ Anguilla, Anguilla, Antigua and Barbuda, Antigua and Barbuda, Antigua and Barbuda, Aruba, Aruba, Aruba,
HashMap<String, String> both = new HashMap<String, String>();
ArrayList<String> sort = new ArrayList<String>();
//ArrayList<String> sort2 = new ArrayList<String>();
// We need a try catch block so we can handle any potential IO errors
try {
try {
inputStream = new BufferedReader(new FileReader(filePath));
String lineContent = null;
// Loop will iterate over each line within the file.
// It will stop when no new lines are found.
while ((lineContent = inputStream.readLine()) != null) {
String column[]= lineContent.split(",");
both.put(column[0], column[1]);
Set set = both.entrySet();
//Get an iterator
Iterator i = set.iterator();
// Display elements
while(i.hasNext()) {
Map.Entry me = (Map.Entry)i.next();
if(me.getKey().equals("North America"))
{
String value= (String) me.getValue();
sort.add(value);
}
}
}
System.out.println("North America:");
System.out.println(sort);
System.out.println("\n");
}
Map keys need to be unique. Your code is working according to spec.
if you need to have many values for a key, you may use
Map<key,List<T>>
here T is String (not only list you can use any collection)
Some things seems wrong with your code :
you are iterating on the Map EntrySet to get just one value (you could just use the following code :
if (both.containsKey("North America"))
sort.add(both.get("North America"));
it seems that you can have "North America" more than one time in your input file, but you are storing it in a Map, so each time you store a new value for "North America" in your Map, it will overwrite the current value
I don't know what the type of sort is, but what is printed by System.out.print(sort); is dependent of the toString() implementation of this type, and the fact that you use print() instead of println() may also create problems depending on how you run your program (some shells may not print the last value for instance).
If you want more help, you may want to provide us with the following things :
sample of the input file
declaration of sort
sample of output
what you want to obtain.
My first CSV file looks like this with header included (header is included only at the top not after every entry):
NAME,SURNAME,AGE
Fred,Krueger,Unknown
.... n records
My second file might look like this:
NAME,MIDDLENAME,SURNAME,AGE
Jason,Noname,Scarry,16
.... n records with this header template
The merged file should look like this:
NAME,SURNAME,AGE,MIDDLENAME
Fred,Krueger,Unknown,
Jason,Scarry,16,Noname
....
Basically if headers don't match, all new header titles (columns) should be added after original header and their values according to that order.
Update
Above CSV were made smaller so I can illustrate what I want to achieve, in reality CSV files are generated one step before this (merge) and can be up to 100 columns
How can I do this?
I'd create a model for the 'bigger' format (a simple class with four fields and a collection for instances of this class) and implemented two parsers, one for the first, one for the second model. Create records for all rows of both csv files and implement a writer to output the csv in the correct format. IN brief:
public void convert(File output, File...input) {
List<Record> records = new ArrayList<Record>();
for (File file:input) {
if (input.isThreeColumnFormat()) {
records.addAll(ThreeColumnFormatParser.parse(file));
} else {
records.addAll(FourColumnFormatParser.parse(file));
}
}
CsvWriter.write(output, records);
}
From your comment I see, that you a lot of different csv formats with some common columns.
You could define the model for any row in the various csv files like this:
public class Record {
Object id; // some sort of unique identifier
Map<String, String> values; // all key/values of a single row
public Record(Object id) {this.id=id;}
public void put(String key, String value){
values.put(key, value);
}
public void get(String key) {
values.get(key);
}
}
For parsing any file you would first read the header and add the column headers to a global keystore (will be needed later on for outputting), then create records for all rows, like:
//...
List<Record> records = new ArrayList<Record>()
for (File file:getAllFiles()) {
List<String> keys = getColumnsHeaders(file);
KeyStore.addAll(keys); // the store is a Set
for (String line:file.getLines()) {
String[] values = line.split(DELIMITER);
Record record = new Record(file.getName()+i); // as an example for id
for (int i = 0; i < values.length; i++) {
record.put(keys.get(i), values[i]);
}
records.add(record);
}
}
// ...
Now the keystore has all used column header names and we can iterate over the collection of all records, get all values for all keys (and get null if the file for this record didn't use the key), assemble the csv lines and write everything to a new file.
Read in the header of the first file and create a list of the column names. Now read the header of the second file and add any column names that don't exist already in the list to the end of the list. Now you have your columns in the order that you want and you can write this to the new file first.
Next I would parse each file and for each row I would create a Map of column name to value. Once the row is parsed you could then iterate over the new list of column names and pull the values from the map and write them immediately to the new file. If the value is null don't print anything (just a comma, if required).
There might be more efficient solutions available, but I think this meets the requirements you set out.
Try this:
http://ondra.zizka.cz/stranky/programovani/ruzne/querying-transforming-csv-using-sql.texy
crunch input.csv output.csv "SELECT AVG(duration) AS durAvg FROM (SELECT * FROM indata ORDER BY duration LIMIT 2 OFFSET 6)"
I need to be able to sort multiple intermediate result sets and enter them to a file in sorted order. Sort is based on a single column/key value. Each result set record will be list of values (like a record in a table)
The intermediate result sets are got by querying entirely different databases.
The intermediate result sets are already sorted based on some key(or column). They need to be combined and sorted again on the same key(or column) before writing it to a file.
Since these result sets can be massive(order of MBs) this cannot be done in memory.
My Solution broadly :
To use a hash and a random access file . Since the result sets are already sorted, when retrieving the result sets , I will store the sorted column values as keys in a hashmap.The value in the hashmap will be a address in the random access file where every record associated with that column value will be stored.
Any ideas ?
Have a pointer into every set, initially pointing to the first entry
Then choose the next result from the set, that offers the lowest entry
Write this entry to the file and increment the corresponding pointer
This approach has basically no overhead and time is O(n). (it's Merge-Sort, btw)
Edit
To clarify: It's the merge part of merge sort.
If you've got 2 pre-sorted result sets, you should be able to iterate them concurrently while writing the output file. You just need to compare the current row in each set:
Simple example (not ready for copy-and-paste use!):
ResultSet a,b;
//fetch a and b
a.first();
b.first();
while (!a.isAfterLast() || !b.isAfterLast()) {
Integer valueA = null;
Integer valueB = null;
if (a.isAfterLast()) {
writeToFile(b);
b.next();
}
else if (b.isAfterLast()) {
writeToFile(a);
a.next();
} else {
int valueA = a.getInt("SORT_PROPERTY");
int valueB = b.getInt("SORT_PROPERTY");
if (valueA < valueB) {
writeToFile(a);
a.next();
} else {
writeToFile(b);
b.next();
}
}
}
Sounds like you are looking for an implementation of the Balance Line algorithm.