I have a CSV in this format:
"Account Name","Full Name","Customer System Name","Sales Rep"
"0x7a69","Mike Smith","0x7a69","Tim Greaves"
"0x7a69","John Taylor","0x7a69","Brian Anthony"
"Apple","Steve Jobs","apple","Anthony Michael"
"Apple","Steve Jobs","apple","Brian Anthony"
"Apple","Tim Cook","apple","Tim Greaves"
...
I would like to parse this CSV (using Java) so that it becomes:
"Account Name","Full Name","Customer System Name","Sales Rep"
"0x7a69","Mike Smith, John Taylor","0x7a69","Tim Greaves, Brian Anthony"
"Apple","Steve Jobs, Tim Cook","apple","Anthony Michael, Brian Anthony, Tim Greaves"
Essentially I just want to condense the CSV so that there is one entry per account/company name.
Here is what I have so far:
String csvFile = "something.csv";
String line = "";
String cvsSplitBy = ",";
List<String> accountList = new ArrayList<String>();
List<String> nameList = new ArrayList<String>();
List<String> systemNameList = new ArrayList<String>();
List<String> salesList = new ArrayList<String>();
try (BufferedReader br = new BufferedReader(new FileReader(csvFile)))
{
while ((line = br.readLine()) != null) {
// use comma as separator
String[] csv = line.split(cvsSplitBy);
accountList.add(csv[0]);
nameList.add(csv[1]);
systemNameList.add(csv[2]);
salesList.add(csv[3]);
}
So I was thinking of adding them all to their own lists, then looping through all of the lists and comparing the values, but I can't wrap my head around how that would work. Any tips or words of advice are much appreciated. Thanks!
By analyzing your requirements you can get a better idea of the data structures to use. Since you need to map keys (account/company) to values (name/rep) I would start with a HashMap. Since you want to condense the values to remove duplicates you'll probably want to use a Set.
I would have a Map<Key, Data> with
public class Key {
private String account;
private String companyName;
//Getters/Setters/equals/hashcode
}
public class Data {
private Key key;
private Set<String> names = new HashSet<>();
private Set<String> reps = new Hashset<>();
public void addName(String name) {
names.add(name);
}
public void addRep(String rep) {
reps.add(rep);
}
//Additional getters/setters/equals/hashcode
}
Once you have your data structures in place, you can do the following to populate the data from your CSV and output it to its own CSV (in pseudocode)
Loop each line in CSV
Build Key from account/company
Try to get data from Map
If Data not found
Create new data with Key and put key -> data mapping in map
add name and rep to data
Loop values in map
Output to CSV
Well, I probably would create a class, let's say "Account", with the attributes "accountName", "fullName", "customerSystemName", "salesRep". Then I would define an empty ArrayList of type Account and then loop over the read lines. And for every read line I just would create a new object of this class, set the corresponding attributes and add the object to the list. But before creating the object I would iterate overe the already existing objects in the list to see whether there is one which already has this company name - and if this is the case, then, instead of creating the new object, just reset the salesRep attribute of the old one by adding the new value, separated by comma.
I hope this helps :)
Related
I am asked to create a word vector space from a csv file. So I need to extract words and their vectors(the size is 57) to a dictionary for being able to reuse it for my futur operations.
My csv format is giving me a lot of problems because it's basically a text with key and doubles all separated by spaces and i wasn't able to separate correctly string and double parts until now.
So do you have any idea how to parse this file into a dictionary which contains (key, vector) type of entries.
Thanks a lot.
Here is a demonstration of csv file:
key1 4.0966564 7.963437 -2.1844673 1.9319566 -0.04495791 2.454401 3.1006012 -0.3813638 1.567303 -2.2067556 3.44506744 -4.382278 4.1457844 2.342756 -2.7707205 3.5015 2.5717492 -2.6846366...
key2 -3.968007 0.86151505 0.06163538 1.918614 0.34340435 -1.5178788 1.3857365 0.230331 0.7025755 -2.6575062 -0.7426953 3.1636698 2.8441591 0.4522623 3.3907628 2.425691 -1.2052362....
.
.
.
This data structure is called a multi-map: a key can have multiple values.
You can find examples in libraries.
If you'd rather not have the dependency, and wish to write your own, it might look like this:
public class MultiMap {
private Map<String, List<Double>> multi = new HashMap<>();
public void put(String key, Double newValue) {
if (newValue != null) {
List<Double> values = (this.multi.containsKey(key) ? this.multi.get(key) : new ArrayList<>());
values.add(newValue);
this.multi.put(key, values);
}
}
}
It's possible to use generics, but I'm too lazy to bother right now. This example is correct for your narrow use case.
Split each line into tokens by splitting at regex "\\s+". The first value is the key; iterate over all the others to add them to the multi-map.
You can do something like that :
String line = "key1 4.0966564 7.963437";
String[] parts = line.split(" ");
String key = parts[0];
ArrayList<Double> values = new ArrayList<Double>();
for(int i =1; i < parts.length; i++){
String doubleAsString = parts[i];
values.add(Double.valueOf(doubleAsString));
}
And then add this elements to your map.
What I am trying to do is build a collection of UserObjects from an ArrayList<String> that I've read from a BufferedReader
UserObject simply consists of these fields:
int UserId
ArrayList<Integer> AssociatesId
My current code is using a BufferedReader to read in file.edgelist and building an ArrayList<String> which has entries of this format: "1 1200"
I am splitting that string into a String[] by its whitespace and building a new UserObject with UserId = 1 and initializing a new ArrayList<Integer> that holds any integers in the second element that has the same UserId
My problem is that file.edgelist has around 20,000,000 entries and while the BufferedReader takes under 10 seconds to read the file, it takes forever to build the collection of UserObjects. In fact, I haven't even gotten to the end of the file because it takes so long. I can confirm that I am successfully building these entries as I've run the code in debug and dropped an occasional breakpoint to find that the UserId is increasing and the UserObject's AssociatesId collections contain data.
Is there a quicker and/or better way to build this collection?
This is currently my code:
private ArrayList<UserObject> tempUsers;
public Utilities(){
tempUsers = new ArrayList<UserObject>();
}
//reading file through BufferedReader and returns ArrayList of strings formatted like "1 1200"
public ArrayList<String> ReadFile(){
BufferedReader reader = null;
ArrayList<String> userStr = new ArrayList<String>();
try {
File file = new File("file.edgelist");
reader = new BufferedReader(new FileReader(file));
String line;
while ((line = reader.readLine()) != null) {
userStr.add(line);
}
return userStr;
} catch (IOException e) {
e.printStackTrace();
} finally {
try {
reader.close();
} catch (IOException e) {
e.printStackTrace();
}
}
return null;
}
//Where the problem actually lies
public ArrayList<UserObject> BuildUsers(ArrayList<String> userStrings){
for (String s : userStrings){
String[] ids = s.split("\\s+");
UserObject exist = getUser(Integer.parseInt(ids[0]));
if (exist == null){ //builds new UserObject if it doesn't exist in tempUsers
UserObject newUser = new UserObject(Integer.parseInt(ids[0]));
newUser.associate(Integer.parseInt(ids[1]));
tempUsers.add(newUser);
} else{ //otherwise adds "associate" Id to UserObject's AssociatesId collection
exist.associate(Integer.parseInt(ids[1]));
}
}
return tempUsers;
}
//helper method that uses Stream to find and return existing UserObject
private UserObject getUser(int id){
if (tempUsers.isEmpty()) return null;
try{
return tempUsers.stream().filter(t -> t.equals(new UserObject(id))).findFirst().get();
} catch (NoSuchElementException ex){
return null;
}
}
Everytime you call getUser, you iterate through the whole list to check whether given user exist. This is very inefficient, as the size of the list is growing (linear complexity in the worst case). You might want to replace it with HashMap (the lookup has a constant complexity).
private Map<Integer, UserObject> tempUsers = new HashMap();
//helper method that uses Stream to find and return existing UserObject
private UserObject getUser(int id){
return users.get(id);
}
Moreover, creating intermediate ArrayList<String> userStr with 20,000,000 million of entries is completely unnecessary and wastes lots of memory. You should create UserObject instances as you read lines from the reader.
Wow, you are just wasting memory and performance there.
First, don't load the entire file into memory as a List<String>. That is just a total waste of memory. Load the file directly into UserObject objects.
Next, don't store them as List<UserObject> and perform a sequential search for object by id. That's just .... sllloooooooooowwwww....
You should store them in a Map<Integer, UserObject> for fast access by id.
Actually, you don't even need UserObject. From what you've said, you just need a Map<Integer, List<Integer>>, which is also called a MultiMap. It's simple enough to do yourself, or you can find third-party libraries with MultiMap implementations.
Also, don't use split() is you know each line will contain exactly 1 space. Use indexOf() and substring()
You code fits the definition of a "pipeline", and thus could benefit enormously from a more judicious usage of the Streams API. For example, you don't need to read the whole file into memory, just use Files.lines to get a Stream<String> with every line in the file. Furthermore, you could do your parsing like:
//Where the problem actually lies
public ArrayList<UserObject> BuildUsers(Stream<String> userStrings){
java.util.Map<Integer,UserObject> users = userStrings // Stream<String>
.map(str -> s.split("\\s+")) // Stream<String[]>
.map(ids -> {
UserObject newUser = new UserObject(Integer.parseInt(ids[0]));
newUser.associate(Integer.parseInt(ids[1]));
return newUser;
}) // Stream<UserObject>, all new (maybe with duplicated ids)
.collect(Collectors.groupingBy(
uObj -> uObj.getId(), // whatever returns the "ids[0]" value
java.util.HashMap::new,
Collectors.reducing((uo1, uo2) -> {
// This lambda "merges" uo2 into uo1
uo2.getAssociates().forEach(uo1::associate);
return uo1;
})));
return new ArrayList<>(users.values());
}
Where I've made up the "getId" and "getAssociates" functions in UserObject to return the values that came originally from the elements of the ids array. This function first splits each line into a String array, then parses each of those 2-element arrays into new UserObject instances. The final collectors perform two functions:
Grouping by the Id property, so you would get a Map<Integer,List<UserObject>> with all the UserObjects with the same primary id.
The reducing (squashing) the several UserObject instances with the same primary id into a single instance (per Collectors.reducing) so that in the end you actually get a Map<Integer,UserObject>. The function passed to reducing takes two UserObject instances and returns one that contains the associate IDs of both of its "parents".
Finally, since apparently you want an ArrayList with the values, the code just takes them from the map and dumps them into the desired container type.
I have a map:
private HashMap<String, CompactDisc> database;
Every CompactDisc object has an artist, and I want to have a user enter a string and search through the hash-map and print out ALL values containing the string.
So if I searched "Jackson", I would get both The Jackson 5 and Michael Jackson (assuming they are in the CD).
Iterate through the values of the HashMap and check if the CompactDisc's artist name contains the specified string.
for (CompactDisc cd : database.values()) {
if(cd.getArtist().contains(searchString)){
System.out.println(cd.getArtist());
}
}
assuming CompactDisk has a getArtist() method that returns a String. And searchString is the string specified by user.
Commons Collections provides a nice way to filter a Collection of objects:
// Set up data
String searchString = "Jackson";
HashMap<String, CompactDisc> database = new HashMap<String, CompactDisc>();
database.put("key1", new CompactDisc("Jackson 5"));
database.put("key2", new CompactDisc("Michael Jackson));
database.put("key3", new CompactDisc("Random artist"));
List<CompactDisc> values = database.values();
CollectionUtils.filter(values , new Predicate<CompactDisc>(){
public boolean evaluate( CompactDisc obj ) {
return obj.getArtist().contains(searchString);
}
});
Given searchString = "Jackson", after invoking .filter the values list will contain only CompactDiscs with artists containing "Jackson".
I am trying to get some values from config file. I have lot of keys and want to get only certain values. These values have keys starting with same initial name with a slight variation towards the end.
can Someone help me quickly?
assuming when you say key you mean value (as in values in an array),
final String PREFIX = "yourPrefix";
for(String value : valueList) {
if(value.startwith(PREFIX)) {
<do whatever...>
}
here is the link to the java Doc
http://docs.oracle.com/javase/7/docs/api/java/lang/String.html#startsWith(java.lang.String)
I am assuming you are scanning the config file for Strings that have similar prefixes. Why not try scanning them in grouped instead of scanning them in all in one hashmap. If you know already the specified prefixes try creating an arraylist for each prefix and while scanning receive the given prefix and add it accordingly.
StringTokenizer s = new StringTokenizer ("Configuration File : Server_intenties = keyId_11503, keyId_11903 : Server_passcodes = keyCode_1678, keyCode_9893", " ");
ArrayList<String> keyCode = new ArrayList();
ArrayList<String> keyId = new ArrayList();
while(s.hasMoreTokens){
String key = s.nextToken
if(key.contains("keyId")){
keyId.add(key);
}
if(key.contains("keyCode")){
keyCode.add(key);
}
}
System.out.println(keyCode);
System.out.println(keyId);
I created a HashMap to store a text file with the columns of information. I compared the key to a specific name and stored the values of the HashMap into an ArrayList. When I try to println my ArrayList, it only outputs the last value and leaves out all the other values that match that key.
This isn't my entire code just my two loops that read in the text file, stores into the HashMap and then into the ArrayList. I know it has something to do with my loops.
Did some editing and got it to output, but all my values are displayed multiple times.
My output looks like this.
North America:
[ Anguilla, Anguilla, Antigua and Barbuda, Antigua and Barbuda, Antigua and Barbuda, Aruba, Aruba, Aruba,
HashMap<String, String> both = new HashMap<String, String>();
ArrayList<String> sort = new ArrayList<String>();
//ArrayList<String> sort2 = new ArrayList<String>();
// We need a try catch block so we can handle any potential IO errors
try {
try {
inputStream = new BufferedReader(new FileReader(filePath));
String lineContent = null;
// Loop will iterate over each line within the file.
// It will stop when no new lines are found.
while ((lineContent = inputStream.readLine()) != null) {
String column[]= lineContent.split(",");
both.put(column[0], column[1]);
Set set = both.entrySet();
//Get an iterator
Iterator i = set.iterator();
// Display elements
while(i.hasNext()) {
Map.Entry me = (Map.Entry)i.next();
if(me.getKey().equals("North America"))
{
String value= (String) me.getValue();
sort.add(value);
}
}
}
System.out.println("North America:");
System.out.println(sort);
System.out.println("\n");
}
Map keys need to be unique. Your code is working according to spec.
if you need to have many values for a key, you may use
Map<key,List<T>>
here T is String (not only list you can use any collection)
Some things seems wrong with your code :
you are iterating on the Map EntrySet to get just one value (you could just use the following code :
if (both.containsKey("North America"))
sort.add(both.get("North America"));
it seems that you can have "North America" more than one time in your input file, but you are storing it in a Map, so each time you store a new value for "North America" in your Map, it will overwrite the current value
I don't know what the type of sort is, but what is printed by System.out.print(sort); is dependent of the toString() implementation of this type, and the fact that you use print() instead of println() may also create problems depending on how you run your program (some shells may not print the last value for instance).
If you want more help, you may want to provide us with the following things :
sample of the input file
declaration of sort
sample of output
what you want to obtain.