Manipulating CSV file with a non standardized content - java

I have a CSV file with a non standardized content, it goes something like this:
John, 001
01/01/2015, hamburger
02/01/2015, pizza
03/01/2015, ice cream
Mary, 002
01/01/2015, hamburger
02/01/2015, pizza
John, 003
04/01/2015, chocolate
Now, what I'm trying to do is to write a logic in java to separate them.I would like "John, 001" as the header and to put all the rows under John, before Mary to be John's.
Will this be possible? Or should I just do it manually?
Edit:
For the input, even though it is not standardized, a noticeable pattern is that the row that do not have names will always starts with a date.
My output goal would be a java object, where I can store it in the database eventually in the format below.
Name, hamburger, pizza, ice cream, chocolate
John, 01/01/2015, 02/01/2015, 03/01/2015, NA
Mary, 01/01/2015, 02/01/2015, NA, NA
John, NA, NA, NA, 04/01/2015

You could just read the file into a list
List<String> lines = Files.readAllLines(Paths.get(path), StandardCharsets.UTF_8);
Afterwards iterate over the list and split them for wanted delimiters (",").
Now you could just use if-else or switch blocks to check for specific entries.
List<DataObject> objects = new ArrayList<>();
DataObject dataObject = null;
for(String s : lines) {
String [] splitLine = s.split(",");
if(splitLine[0].matches("(\d{2}\/){2}\d{4}")) {
// We found a data
if(dataObject != null && splitLine.length == 2) {
String date = splitLine[0];
String dish = splitLine[1];
dataObject.add(date, dish);
} else {
// Handle error
}
} else if(splitLine.length == 2) {
// We can create a new data object
if(dataObject != null) {
objects.add(dataObject);
}
String name = splitLine[0];
String id = splitLine[1];
dataObject = new DataObject(name, id);
} else {
// Handle error
}
}
Now you can sort them into your specific categories.
Edit: Changed the loop and added a regex (which may not be optimal) for matching the date strings and using them to decide whether to add them to the last data object.
The DataObject class can contain data structures holding the dates/dishes. When the CSV is parsed you can iterate over the objects List and do whatever you want. I hope this answer helps :)

If I have correctly understood, the specs are :
input is text, one record per line (fields are comma delimited)
2 kinds of records :
headers consisting of a name and a number (number is ignored)
actual records consisting of a date and a meal
output should contain :
one header containing the constant Name, and the meals in order of occurence
on record per name consisting with the name and the dates corresponding to the meals - an absent field will have NA constant string
we assume that we will never get for a name the same date for different input records.
The algorithm is in pseudo code :
Data structures :
one list of struct< string name, hash< int meal index, date> > for the names : base
one list of strings for the meals : meals
Code :
name = null
iname = -1
Loop per input lines {
if first field is date {
if name == null {
throw Exception("incorrect structure");
}
meal = second field
look for index of meal in meals
if not found {
index = len(meals);
add meal at end of list meals
}
base[iname].hash[index] = date
}
else {
name = first field
iname += 1
add a new struc { name, empty hash } at end of list base
}
}
close input file
open output file
// headers
print "names"
for meal in meals {
print ",", meal
}
print newline
for (i=0; i<=iname; i++) {
print base[i].name
for meal in meals {
look for meal in base[i].hash.keys
if found {
print ",", base[i].hash[meal]
}
else {
print ",NA"
}
}
print newline
}
close output file
Just code it in correct Java and come back here if you have any problem.

Use uniVocity-parsers to handle this for you. It comes with a master-detail row processor.
// 1st, Create a RowProcessor to process all "detail" elements (dates/ingredients)
ObjectRowListProcessor detailProcessor = new ObjectRowListProcessor();
// 2nd, Create MasterDetailProcessor to identify whether or not a row is the master row (first value of the row is a name, second is an integer).
MasterDetailListProcessor masterRowProcessor = new MasterDetailListProcessor(RowPlacement.TOP, detailProcessor) {
#Override
protected boolean isMasterRecord(String[] row, ParsingContext context) {
try{
//tries to convert the second value of the row to an Integer.
Integer.parseInt(String.valueOf(row[1]));
return true;
} catch(NumberFormatException ex){
return false;
}
}
};
CsvParserSettings parserSettings = new CsvParserSettings();
// Set the RowProcessor to the masterRowProcessor.
parserSettings.setRowProcessor(masterRowProcessor);
CsvParser parser = new CsvParser(parserSettings);
parser.parse(new FileReader(yourFile));
// Here we get the MasterDetailRecord elements.
List<MasterDetailRecord> rows = masterRowProcessor.getRecords();
// Each master record has one master row and multiple detail rows.
MasterDetailRecord masterRecord = rows.get(0);
Object[] masterRow = masterRecord.getMasterRow();
List<Object[]> detailRows = masterRecord.getDetailRows();
Disclosure: I am the author of this library. It's open-source and free (Apache V2.0 license).

Related

Converting .txt Spark Output to .csv

Currently, I am getting output from a spark job in .txt file. I am trying to convert it to .csv
.txt output (Dataset <String>)
John MIT Bachelor ComputerScience Mike UB Master ComputerScience
.csv output
NAME, UNIV, DEGREE, COURSE
John,MIT,Bachelor,ComputerScience
Amit,UB,Master,ComputerScience
I tried to collect it into a List and I am not sure, how to convert it to .csv and add the header.
This is a simple approach that converts the txt output data into a data structure (that can easily be written into a csv file).
The basic idea is using data structures along with the amount of headers / columns in order to parse entry sets from the one liner txt output.
Have a look at the code comments, every "TODO 4 U" means work for you, mostly because I cannot really guess what you need to do at those positions in the code (like how to get the headers).
This is just a main method that does its work straight forward. You may want to understand what it does and apply changes that make the code meet your requiremtens. Input and output are just Strings that you have to create, receive or process yourself.
public static void main(String[] args) {
// TODO 4 U: get the values for the header somehow
String headerLine = "NAME, UNIV, DEGREE, COURSE";
// TODO 4 U: read the txt output
String txtOutput = "John MIT Bachelor ComputerScience Mike UB Master ComputerScience";
/*
* then split the header line
* (or do anything similar, I don't know where your header comes from)
*/
String[] headers = headerLine.split(", ");
// store the amount of headers, which is the amount of columns
int amountOfColumns = headers.length;
// split txt output data by space
String[] data = txtOutput.split(" ");
/*
* declare a data structure that stores lists of Strings,
* each one is representing a line of the csv file
*/
Map<Integer, List<String>> linesForCsv = new TreeMap<Integer, List<String>>();
// get the length of the txt output data
int a = data.length;
// create a list of Strings containing the headers and put it into the data structure
List<String> columnHeaders = Arrays.asList(headers);
linesForCsv.put(0, columnHeaders);
// declare a line counter for the csv file
int l = 0;
// go through the txt output data in order to get the lines for the csv file
for (int i = 0; i < a; i++) {
// check if there is a new line to be created
if (i % amountOfColumns == 0) {
/*
* every time the amount of headers is reached,
* create a new list for a new line in the csv file
*/
l++; // increment the line counter (even at 0 because the header row is inserted at 0)
linesForCsv.put(l, new ArrayList<String>()); // create a new line-list
linesForCsv.get(l).add(data[i]); // add the data to the line-list
} else {
// if there is no new line to be created, store the data in the current one
linesForCsv.get(l).add(data[i]);
}
}
// print the lines stored in the map
// TODO 4 U: write this to a csv file instead of just printing it to the console
linesForCsv.forEach((lineNumber, line) -> {
System.out.println("Line " + lineNumber + ": " + String.join(",", line));
});
}

Counting and comparing strings from scv file

I have a big csv file of 18000 rows. They represent different kinds of articles from a liqueur store. If I with bufferedReader split up the rows I get columns with stuff like - article number, name, amount of alcohol, price etc etc.
In the seventh column is the type of beverage (beer, whine rum etc.). What would be the best way to count how many articles there are of each type? I would like to be able to do this without having to know the types. And I only want to use java.util.*.
Can I read through the column and store it in a queue. While reading I store each new type in a set. Then I can maybe compare all elements in the queue to this set?
br = new BufferedReader(new FileReader(filename));
while ((line = br.readLine()) != null) {
// using \t as separator
String[] articles = line.split(cvsSplitBy);
The output should be something like.
There are:
100 beers
2000 whines
etc. etc.
in the sortiment
You could use a HashMap<String, integer> in order to know how many products of each category you have. The usage would be like:
HashMap<String, Integer> productCount = new HashMap<>(); //This must be outsise the loop
//read you CSV here, this should be placed inside the loop
String product = csvColumns[6];
if(productCount.containsKey(product)) { //If the product doesn't exist
productCount.put(product, 1);
} else { //Product exists, add 1
int count = productCount.get(product);
productCount.put(productCount, count+1);
}
DISCLAIMER You have to be sure all products are named the same ie: beer or Beer. To Java, these are different Strings, and so, will have different count. One way will be to conver every product to uppercase ie: beer -> BEER Beer->BEER. This will cause to show all names in uppercase when showing results.

Split a text file using delimiter \t and store in hashmap in java

I have a text file of n columns and n rows separated by tab space.. How do i use split function to store the columns and rows into a hashmap.Please help.My text file will be like..
Dept Id Name Contact
IT 1 zzz 678
ECE 2 ttt 789
IT 3 rrr 908
I tried the following.But it dint work.
Map<String,String> map=new HashMap<String,String>();
while(lineReader!=null)
{
String[] tokens = lineReader.split("\\t");
key = tokens[0];
values = tokens[1];
map.put(key , values );
System.out.println("ID:"+map.get(key ));
System.out.println("Other Column Values:"+map.get(values ));
}
This returns the key of the last entry(row) of the file and value as null. But i need to store all rows and columns in the map. How do i do it?
If I understand your data correctly,
After
String[] tokens = lineReader.split("\\t");
is processed on the first line, you'd have 4 tokens in the array.
I think you are using wrong logic, if you want to store the map in the following way:
IT -> (1 ZZZ 678)
.... etc then you need to process the data differently.
What you are storing in the map is follows:
IT -> 1
ECE -> 2
...
and so on.
That's why you get null when you are trying to do:
map.get(value);
What you should instead print is the Key and map.get(key).
Actually, in any case I don't think Map is what you want (but I don't know what you really want).
For now though, for your understanding of this problem try printing:
System.out.println("Total collumns: "+ tokens.length);
Updated:
This should work for you. It isn't the most elegant implementation for what you want, but gets the job done. You should try improving it from here on.
Map<String,String> map=new HashMap<String,String>();
while(lineReader!=null)
{
String[] tokens = lineReader.split("\\t");
key = tokens[1];
values = tokens[2]+tokens[3];
map.put(key , values );
System.out.println("ID:"+key);
System.out.println("Other Column Values:"+map.get(key));
}
Good luck!

OpenCSV returns a string even when there is no value in the CSV row

I'm writing a program to check if the first two rows, excluding the header, contain any data or not. If they do not, the file is to be ignored, and if either of the first two rows contain data, then process the file. I am using OpenCSV to retrieve the header, the first row and the second row into 3 different arrays and then checking them for my requirements. My problem is that even if the first two rows are empty, the reader returns something like [Ljava.lang.String;#13f17c9e as the output of the first and/or second row (depending on my test files).
Why does it return anything at all, other than a null, that is?
I'm not at my computer right now so excuse any mistakes~ The OpenCSV API Javadocs is rather brief but there doesn't seem to be much to it. Reading a line should parse the content into an array of strings. An empty line should result in an empty string array which gives something like [Ljava.lang.String;#13f17c9e if you try to print it out...
I would assume that the following example file:
1 |
2 |
3 | "The above lines are empty", 12345, "foo"
would produce the following if you did myCSVReader.readAll()
// List<String[]> result = myCSVReader.readAll();
0 : []
1 : []
2 : ["The above lines are empty","12345","foo"]
To perform what you describe in your question, test for length instead of some sort of null checking or string comparison.
List<String> lines = myCSVReader.readAll();
// lets print the output of the first three lines
for (int i=0, i<3, i++) {
String[] lineTokens = lines.get(i);
System.out.println("line:" + (i+1) + "\tlength:" + lineTokens.length);
// print each of the tokens
for (String token : lineTokens) {
System.out.println("\ttoken: " + token);
}
}
// only process the file if lines two or three aren't empty
if (lineTokens.get(1).length > 0 || lineTokens.get(2).length > 0) {
System.out.println("Process this file!");
processFile(lineTokens);
}
else {
System.out.println("Skipping...!");
}
// EXPECTED OUTPUT:
// line:1 length:0
// line:2 length:0
// line:3 length:3
// token: The above lines are empty
// token: 12345
// token: foo
// Process this file!

Get separate record for each value

I have a list, List<String> myList=new ArrayList<String>(); This list contains the list of countries that I am dealing with.
I am dealing with several records. I need to calculate in such a way that a separate entry for records as per country is sorted.I am using the following logic
for(int zTmp = 0; zTmp<myList.size(); zTmp++)
{
System.out.println("COUNTRY IS"+myList.get(zTmp));
if((record).contains(myList.get(zTmp)))
{
// my next step
}
}
How ever I find that each and every record is entering after the if condition. The records are alphabetically sorted as per countries, and records of every country are together. Please correct me.
This is my String
RECORD 1#India$
RECORD 2#India$
RECORD 3#United Arab Emirates$
RECORD 4#United Arab Emirates$
RECORD 5#United Kingdom$
Sorted as per country name.
I need to give a condition such that it enters in the loop for every country ie say RECORD 1,RECORD 2 calculation must be done break; record 3 ,4 break; record 5 like this.
Hope I am more clear now.
Maybe you mean this?
String currentCountry = "";
for (String record : myList) {
// Regex for entire string "^....$"
// Country between '#' and '$' (the latter escaped)
String country = record.replaceFirst("^.*#(.*)\\$$", "$1");
if (!country.equals(currentCountry)) {
currentCountry = country;
... // Deal with next country
}
}
for(int zTmp = 0; zTmp<myList.size(); zTmp++)
{
System.out.println("COUNTRY IS"+myList.get(zTmp));
if((record).contains(myList.get(zTmp)))
{
// my next step
}
}
Your if condition will result in false only if record doesn't contain a country which is not present in complete myList, otherwise It will be true atleast in one iteration.
In comments you wrote:
You want to calculate 3 records
instead of using myList, create a separate list (say myChoosenCountriesList) of having those countries only when you want if condition to become true.
Then replace your code with following: (Note other improvments also)
int countryCount = myChoosenCountriesList.size();
for(int zTmp = 0; zTmp<countryCount; zTmp++)
{
String countryName = myChoosenCountriesList.get(zTmp);
System.out.println("COUNTRY IS"+countryName);
if(record.contains(countryName))
{
// my next step
}
}
The desired output has been achieved by using do while loop
here is the snippet
int zTmp=0;
do
{
String country=myList.get(zTmp);
if(inputCountry.equals(country))
{
CalcDays(tmpTokens[iTmp]);
myDateList.clear();
}zTmp++;
}while(zTmp<myList.size());

Categories