Converting .txt Spark Output to .csv

Converting .txt Spark Output to .csv - java

Currently, I am getting output from a spark job in .txt file. I am trying to convert it to .csv
.txt output (Dataset <String>)
John MIT Bachelor ComputerScience Mike UB Master ComputerScience
.csv output
NAME, UNIV, DEGREE, COURSE
John,MIT,Bachelor,ComputerScience
Amit,UB,Master,ComputerScience
I tried to collect it into a List and I am not sure, how to convert it to .csv and add the header.

This is a simple approach that converts the txt output data into a data structure (that can easily be written into a csv file).
The basic idea is using data structures along with the amount of headers / columns in order to parse entry sets from the one liner txt output.
Have a look at the code comments, every "TODO 4 U" means work for you, mostly because I cannot really guess what you need to do at those positions in the code (like how to get the headers).
This is just a main method that does its work straight forward. You may want to understand what it does and apply changes that make the code meet your requiremtens. Input and output are just Strings that you have to create, receive or process yourself.
public static void main(String[] args) {
// TODO 4 U: get the values for the header somehow
String headerLine = "NAME, UNIV, DEGREE, COURSE";
// TODO 4 U: read the txt output
String txtOutput = "John MIT Bachelor ComputerScience Mike UB Master ComputerScience";
/*
* then split the header line
* (or do anything similar, I don't know where your header comes from)
*/
String[] headers = headerLine.split(", ");
// store the amount of headers, which is the amount of columns
int amountOfColumns = headers.length;
// split txt output data by space
String[] data = txtOutput.split(" ");
/*
* declare a data structure that stores lists of Strings,
* each one is representing a line of the csv file
*/
Map<Integer, List<String>> linesForCsv = new TreeMap<Integer, List<String>>();
// get the length of the txt output data
int a = data.length;
// create a list of Strings containing the headers and put it into the data structure
List<String> columnHeaders = Arrays.asList(headers);
linesForCsv.put(0, columnHeaders);
// declare a line counter for the csv file
int l = 0;
// go through the txt output data in order to get the lines for the csv file
for (int i = 0; i < a; i++) {
// check if there is a new line to be created
if (i % amountOfColumns == 0) {
/*
* every time the amount of headers is reached,
* create a new list for a new line in the csv file
*/
l++; // increment the line counter (even at 0 because the header row is inserted at 0)
linesForCsv.put(l, new ArrayList<String>()); // create a new line-list
linesForCsv.get(l).add(data[i]); // add the data to the line-list
} else {
// if there is no new line to be created, store the data in the current one
linesForCsv.get(l).add(data[i]);
}
}
// print the lines stored in the map
// TODO 4 U: write this to a csv file instead of just printing it to the console
linesForCsv.forEach((lineNumber, line) -> {
System.out.println("Line " + lineNumber + ": " + String.join(",", line));
});
}

Related

Looking for elegant way of searching through an array of strings for duplicate entries. Brute force method works

I have a file of alphanumeric VIN numbers from vehicles (saved as strings). I need to parse through this file and determine
1) Is a VIN duplicated? If so, how many times
2) Write the duplicated VIN and the total number of duplicates to a text file
I have gotten it to work using the brute force method dual nested For loops. Am looking for a more elegant way to parse the strings. I'm using Java 7 in NetBeans 8.2 and it doesn't appear to like using the .set or hashmap.
Constraints
1) The VINs may be in any order
2) The duplicates can be scattered through the file at random
/* a) Open input and output files
*/
try {
inputStream = new BufferedReader(new FileReader(fileName));//csv file
outputStream = new PrintWriter(new FileWriter("DuplicateVINs.txt"));
/* b) Read in file line by line
then slice out the 17 digit VIN from the extra data I don't care about
*/
while ((thisLine = inputStream.readLine()) != null) {
l = thisLine.substring(1, 18);
linesVIN.add(l.split(","));//why does this split have to be here?
}
/*c) Now that the List is full calculate its size and then write to array of strings
*/
String[][] inputArray = new String[linesVIN.size()][];
i=linesVIN.size();
System.out.println(i);
linesVIN.toArray(inputArray);
/* d) Will use two nested For loos to look for duplicates
*/
countj=0;
countk=0;
for (int j = 1;j<=i-1; j++){ //j loop
duplicateVIN=Arrays.toString(inputArray[j]);
for(int k=1;k<=i-1;k++){
if(duplicateVIN.equals(Arrays.toString(inputArray[k]))){
countk=countk+1;
foundFlag=true;
} else{
//
if(countk>=2){
//if(j!=k){
System.out.println(duplicateVIN + countk);
//} // see if removes the first duplicate
}
foundFlag=false;
countk=0;
}
} //ends k loop
countj=j;
} //ends j loop
} //Completes the try
[2q3CDZC90JH1qqqqq], 3
[2q4RC1NG1JR1qqqqq], 4
[2q3CDZC96KH1qqqqq], 2
[1q4PJMDN8KD1qqqqq], 7

I'm using Java 7 in NetBeans 8.2 and it doesn't appear to like using the .set or hashmap.
Your first step should be to figure out what you're doing wrong with a map. A hashmap is the perfect solution for this problem, and is really what you should be using.
Here's a broad example of how the solution would work, using the information you provided.
Map<String,Integer> countMap = new HashMap<String,Integer>();
while ((thisLine = inputStream.readLine()) != null) {
l = thisLine.substring(1, 18);
if(countMap.containsKey(l)){
countMap.put(l, countMap.get(l)+1);
}else{
countMap.put(l,1);
}
}
I'm assuming that the while loop your provided is properly iterating over all VIN numbers.
After this while loop is completed you would just need to output the values of each key, similar to this:
for(String vin : countMap.keySet()){
System.out.println("VIN: "+vin+" COUNT: "+countMap.get(vin));
}
If I've read your problem correctly, there is no need for a nested loop.

Manipulating CSV file with a non standardized content

I have a CSV file with a non standardized content, it goes something like this:
John, 001
01/01/2015, hamburger
02/01/2015, pizza
03/01/2015, ice cream
Mary, 002
01/01/2015, hamburger
02/01/2015, pizza
John, 003
04/01/2015, chocolate
Now, what I'm trying to do is to write a logic in java to separate them.I would like "John, 001" as the header and to put all the rows under John, before Mary to be John's.
Will this be possible? Or should I just do it manually?
Edit:
For the input, even though it is not standardized, a noticeable pattern is that the row that do not have names will always starts with a date.
My output goal would be a java object, where I can store it in the database eventually in the format below.
Name, hamburger, pizza, ice cream, chocolate
John, 01/01/2015, 02/01/2015, 03/01/2015, NA
Mary, 01/01/2015, 02/01/2015, NA, NA
John, NA, NA, NA, 04/01/2015

You could just read the file into a list
List<String> lines = Files.readAllLines(Paths.get(path), StandardCharsets.UTF_8);
Afterwards iterate over the list and split them for wanted delimiters (",").
Now you could just use if-else or switch blocks to check for specific entries.
List<DataObject> objects = new ArrayList<>();
DataObject dataObject = null;
for(String s : lines) {
String [] splitLine = s.split(",");
if(splitLine[0].matches("(\d{2}\/){2}\d{4}")) {
// We found a data
if(dataObject != null && splitLine.length == 2) {
String date = splitLine[0];
String dish = splitLine[1];
dataObject.add(date, dish);
} else {
// Handle error
}
} else if(splitLine.length == 2) {
// We can create a new data object
if(dataObject != null) {
objects.add(dataObject);
}
String name = splitLine[0];
String id = splitLine[1];
dataObject = new DataObject(name, id);
} else {
// Handle error
}
}
Now you can sort them into your specific categories.
Edit: Changed the loop and added a regex (which may not be optimal) for matching the date strings and using them to decide whether to add them to the last data object.
The DataObject class can contain data structures holding the dates/dishes. When the CSV is parsed you can iterate over the objects List and do whatever you want. I hope this answer helps :)

If I have correctly understood, the specs are :
input is text, one record per line (fields are comma delimited)
2 kinds of records :
headers consisting of a name and a number (number is ignored)
actual records consisting of a date and a meal
output should contain :
one header containing the constant Name, and the meals in order of occurence
on record per name consisting with the name and the dates corresponding to the meals - an absent field will have NA constant string
we assume that we will never get for a name the same date for different input records.
The algorithm is in pseudo code :
Data structures :
one list of struct< string name, hash< int meal index, date> > for the names : base
one list of strings for the meals : meals
Code :
name = null
iname = -1
Loop per input lines {
if first field is date {
if name == null {
throw Exception("incorrect structure");
}
meal = second field
look for index of meal in meals
if not found {
index = len(meals);
add meal at end of list meals
}
base[iname].hash[index] = date
}
else {
name = first field
iname += 1
add a new struc { name, empty hash } at end of list base
}
}
close input file
open output file
// headers
print "names"
for meal in meals {
print ",", meal
}
print newline
for (i=0; i<=iname; i++) {
print base[i].name
for meal in meals {
look for meal in base[i].hash.keys
if found {
print ",", base[i].hash[meal]
}
else {
print ",NA"
}
}
print newline
}
close output file
Just code it in correct Java and come back here if you have any problem.

Use uniVocity-parsers to handle this for you. It comes with a master-detail row processor.
// 1st, Create a RowProcessor to process all "detail" elements (dates/ingredients)
ObjectRowListProcessor detailProcessor = new ObjectRowListProcessor();
// 2nd, Create MasterDetailProcessor to identify whether or not a row is the master row (first value of the row is a name, second is an integer).
MasterDetailListProcessor masterRowProcessor = new MasterDetailListProcessor(RowPlacement.TOP, detailProcessor) {
#Override
protected boolean isMasterRecord(String[] row, ParsingContext context) {
try{
//tries to convert the second value of the row to an Integer.
Integer.parseInt(String.valueOf(row[1]));
return true;
} catch(NumberFormatException ex){
return false;
}
}
};
CsvParserSettings parserSettings = new CsvParserSettings();
// Set the RowProcessor to the masterRowProcessor.
parserSettings.setRowProcessor(masterRowProcessor);
CsvParser parser = new CsvParser(parserSettings);
parser.parse(new FileReader(yourFile));
// Here we get the MasterDetailRecord elements.
List<MasterDetailRecord> rows = masterRowProcessor.getRecords();
// Each master record has one master row and multiple detail rows.
MasterDetailRecord masterRecord = rows.get(0);
Object[] masterRow = masterRecord.getMasterRow();
List<Object[]> detailRows = masterRecord.getDetailRows();
Disclosure: I am the author of this library. It's open-source and free (Apache V2.0 license).

How to implement array processing with read file data?

I'm trying to figure out how to read data from a file, using an array. The data in the file is listed like this:
Clarkson 80000
Seacrest 100000
Dunkleman 75000
...
I want to store that information using an array. Currently I have something like this to read the data and use it:
String name1 = in1.next();
int vote1 = in1.nextInt();
//System.out.println(name1 +" " + vote1);
String name2 = in1.next();
int vote2 = in1.nextInt();
//System.out.println(name2 +" " + vote2);
String name3 = in1.next();
int vote3 = in1.nextInt();
...
//for all names
Problem is, the way I'm doing it means I can never manipulate the file data for more contestants or whatnot.
While I can use this way and handle all the math within different methods and get the expected output...its really inefficient I think.
Output expected:
American Idol Fake Results for 2099
Idol Name Votes Received % of Total Votes
__________________________________________________
Clarkson 80,000 14.4%
Seacrest 100,000 18.0%
Dunkleman 75,000 13.5%
Cowell 110,000 19.7%
Abdul 125,000 22.4%
Jackson 67,000 12.0%
Total Votes 557,000
The winner is Abdul!
I figure reading input file data into arrays is likely easy using java.io.BufferedReader is there a way not to use that?
I looked at this: Java: How to read a text file but I'm stuck thinking this is a different implementation.
I want to try to process all the information through understandable arrays and maybe at least 2-3 methods (in addition to the main method that reads and stores all data for runtime). But say I want to use that data and find percentages and stuff (like the output). Figure out the winner...and maybe even alphabetize the results!
I want to try something and learn how the code works to get a feel of the concept at hand. ;c

int i=0
while (in.hasNextLine()) {
name = in.nextLine();
vote = in.nextInt();
//Do whatever here: print, save name and vote, etc..
//f.e: create an array and save info there. Assuming both name and vote are
//string, create a 2d String array.
array[i][0]=name;
array[i][1]=vote;
//if you want to individually store name and votes, create two arrays.
nameArray[i] = name;
voteArray[i] = vote;
i++;
}
This will loop until he automatically finds you don't have any more lines to read. Inside the loop, you can do anything you want (Print name and votes, etc..). In this case, you save all the values into the array[][].
array[][] will be this:
array[0][0]= Clarkson
array[0][1]= 80,000
array[1][0]= Seacrest
array[1][1]= 100,000
...and so on.
Also, I can see that you have to do some maths. So, if you save it as a String, you should convert it to double this way:
double votesInDouble= Double.parseDouble(array[linePosition][1]);

You have several options:
create a Class to represent your File data, then have an array of those Objects
maintain two arrays in parallel, one of the names and the other of the votes
Use a Map, where the name of the person is the key and the number of votes is the value
a) gives you direct access like an array
b) you don't need to create a class
Option 1:
public class Idol
{
private String name;
private int votes;
public Idol(String name, int votes)
{
// ...
}
}
int index = 0;
Idol[] idols = new Idol[SIZE];
// read from file
String name1 = in1.next();
int vote1 = in1.nextInt();
//create Idol
Idol i = new Idol(name1, vote1);
// insert into array, increment index
idols[index++] = i;
Option 2:
int index = 0;
String[] names = new String[SIZE];
int[] votes = new int[SIZE];
// read from file
String name1 = in1.next();
int vote1 = in1.nextInt();
// insert into arrays
names[index] = name1;
votes[index++] = vote1;
Option 3:
// create Map
Map<String, Integer> idolMap = new HashMap<>();
// read from file
String name1 = in1.next();
int vote1 = in1.nextInt();
// insert into Map
idolMap.put(name1, vote1);
Now you can go back any manipulate the data to your hearts content.

OpenCSV returns a string even when there is no value in the CSV row

I'm writing a program to check if the first two rows, excluding the header, contain any data or not. If they do not, the file is to be ignored, and if either of the first two rows contain data, then process the file. I am using OpenCSV to retrieve the header, the first row and the second row into 3 different arrays and then checking them for my requirements. My problem is that even if the first two rows are empty, the reader returns something like [Ljava.lang.String;#13f17c9e as the output of the first and/or second row (depending on my test files).
Why does it return anything at all, other than a null, that is?

I'm not at my computer right now so excuse any mistakes~ The OpenCSV API Javadocs is rather brief but there doesn't seem to be much to it. Reading a line should parse the content into an array of strings. An empty line should result in an empty string array which gives something like [Ljava.lang.String;#13f17c9e if you try to print it out...
I would assume that the following example file:
1 |
2 |
3 | "The above lines are empty", 12345, "foo"
would produce the following if you did myCSVReader.readAll()
// List<String[]> result = myCSVReader.readAll();
0 : []
1 : []
2 : ["The above lines are empty","12345","foo"]
To perform what you describe in your question, test for length instead of some sort of null checking or string comparison.
List<String> lines = myCSVReader.readAll();
// lets print the output of the first three lines
for (int i=0, i<3, i++) {
String[] lineTokens = lines.get(i);
System.out.println("line:" + (i+1) + "\tlength:" + lineTokens.length);
// print each of the tokens
for (String token : lineTokens) {
System.out.println("\ttoken: " + token);
}
}
// only process the file if lines two or three aren't empty
if (lineTokens.get(1).length > 0 || lineTokens.get(2).length > 0) {
System.out.println("Process this file!");
processFile(lineTokens);
}
else {
System.out.println("Skipping...!");
}
// EXPECTED OUTPUT:
// line:1 length:0
// line:2 length:0
// line:3 length:3
// token: The above lines are empty
// token: 12345
// token: foo
// Process this file!

How to pull data from a txt file into a HTML table Java/JSP

I have a Text file (.txt) that contains data strings using comma separated values, i.e
jordan,hello,12th Feb 15:23, pending
I would like to then pull this data into a HTML table with the ",' separating each column. For example the headings of the table would be:
Name Question Date Status
Thus Jordan would go in the name column and hello under question etc.
Currently I have output the full string but I need the individual elements.
Any advice would be appreciated.

You need a parser to read csv file and create individual elements. You can either use String.split(...) or even better leverage CSV parsing libraries. Create a class Data and populate it with the parsed data (each row has a corresponding Data object). You should have a List<Data> after the entire file has been parsed which you can pass to the JSP page. JSP then iterates through the List and creates the table.

Assuming that you don't have a , in the data, simply call split(",") on each line and create a custom formatted HTML table, something like this (haven't tested):
out.println("<table>")
for (int i=0; i<lines.length; ++i) {
out.println("<tr>" )
String[] data = line[i].split(",");
for (String val : data) {
out.println("<td>" + val + "</td>")
}
out.println("</tr>" )
}
out.println("</table>")

You can use the String#Split method to convert your actual String to an array of Strings with all the values:
String s = "jordan,hello,12th Feb 15:23, pending"
String[] sArray = s.split(",");
for(String si : sArray) {
//you can print them in your output
}

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Converting .txt Spark Output to .csv - java

Related

Looking for elegant way of searching through an array of strings for duplicate entries. Brute force method works

Manipulating CSV file with a non standardized content

How to implement array processing with read file data?

OpenCSV returns a string even when there is no value in the CSV row

How to pull data from a txt file into a HTML table Java/JSP

Categories

Resources