Univocity parser - Handling lines with weird constructs - java

I am trying to figure the best way to use University parser to handle CSV log file with lines looks like below,
"23.62.3.74",80,"testUserName",147653,"Log Collection Device 100","31/02/15 00:05:10 GMT",-1,"10.37.255.3","TCP","destination_ip=192.62.3.74|product_id=0071|option1_type=(s-dns)|proxy_machine_ip=10.1.255.3"
As you can see this is a comma delimited file but the last column has got bunch of values prefixed with its field names. My requirement is to capture values from normal fields and
selectively from this last big field.
I know the master details row processor in Univocity but I doubt if this fit into that category. Could you guide me to the right direction please?
Note: I can handle the name prefixed fields in rowProcessed(String[] row, ParsingContext context) if I implement a row processor but I am looking for something native to Univocity if possible?
Thanks,
R

There's nothing native in the parser for that. Probably the easiest way to go about it is to have your RowProcessor as you mentioned.
One thing you can try to do to make your life easier is to use another instance of CsvParser to parse that last record:
//initialize a parser for the pipe separated bit
CsvParserSettings detailSettings = new CsvParserSettings();
detailSettings.getFormat().setDelimiter('=');
detailSettings.getFormat().setLineSeparator("|");
CsvParser detailParser = new CsvParser(detailSettings);
//here is the content of the last column (assuming you got it from the parser)
String details = "destination_ip=192.62.3.74|product_id=0071|option1_type=(s-dns)|proxy_machine_ip=10.1.255.3";
//The result will be a list of pairs
List<String[]> pairs = detailParser.parseAll(new StringReader(details));
//You can add the pairs to a map
Map<String, String> map = new HashMap<String, String>();
for (String[] pair : pairs) {
map.put(pair[0], pair[1]);
}
//this should print: {destination_ip=192.62.3.74, product_id=0071, proxy_machine_ip=10.1.255.3, option1_type=(s-dns)}
System.out.println(map);
That won't be extremely fast but at least it's easy to work with a map if that input can have random column names and values associated with them.

Related

Using I/O stream to parse CSV file

I have a CSV file of US population data for every county in the US. I need to get each population from the 8th column of the file. I'm using a fileReader() and bufferedStream() and not sure how to use the split method to accomplish this. I know this isn't much information but I know that I'll be using my args[0] as the destination in my class.
I'm at a loss to where to being to be honest.
import java.io.FileReader;
public class Main {
public static void main(String[] args) {
BufferedReader() buff = new BufferedReader(new FileReader(args[0]));
String
}
try {
}
}
The output should be an integer of the total US population. Any help with pointing me in the right direction would be great.
Don't reinvent the wheel, don't parse CSV yourself: use a library. Even such a simple format as CSV has nuances: fields can be escaped with quotes or unescaped, the file can have or have not a header and so on. Besides that you have to test and maintain the code you've wrote. So writing less code and reusing libraries is good.
There are a plenty of libraries for CSV in Java:
Apache Commons CSV
OpenCSV
Super CSV
Univocity
flatpack
IMHO, the first two are the most popular.
Here is an example for Apache Commons CSV:
final Reader in = new FileReader("counties.csv");
final Iterable<CSVRecord> records = CSVFormat.DEFAULT.parse(in);
for (final CSVRecord record : records) { // Simply iterate over the records via foreach loop. All the parsing is handler for you
String populationString = record.get(7); // Indexes are zero-based
String populationString = record.get("population"); // Or, if your file has headers, you can just use them
… // Do whatever you want with the population
}
Look how easy it is! And it will be similar with other parsers.

Univocity - Detect missing column when parsing CSV

I'm using Univocity library to parse CSV and it works perfectly, but I need a way to detect if the file being parsed has less columns than required
For example, if I'm expecting a 3 columns file, with columns mapped to [H1,H2,H3] then I received a file (which has no headers) that looks like
V1_H1,V1_H2
V2_H1,V2_H2
When using
record.getString("H3");
this would return null, instead, I need this file to either fail to be parsed or I can check if it misses a column and stop processing it
Is there any way to achieve this?
So since my main issue here is to make sure that the headers count is the same as the number of columns provided in the CSV file, and since I'm using an iterator to iterate over records, I've added a check like:
CsvParser parser = new CsvParser(settings);
ResultIterator<Record, ParsingContext> iterator = parser.iterateRecords(inputStream).iterator();
if(iterator.getContext().parsedHeaders().length != settings.getHeaders().length){
throw new Exception("Invalid file");
}
It's working for me, not sure if there is a better way to do it.
I've watched Univocity documentation and I've found here that there is a way to add annotations to the destination objects you are going to generate from the CSV input
#Parsed
#Validate
public String notNulNotBlank; //This should fail if the field is null or blank
#Parsed
#Validate(nullable = true)
public String nullButNotBlank;
#Parsed
#Validate(allowBlanks = true)
public String notNullButBlank;
This will also help you to use the objects instead of having to work with fields.
Hope that helps :-)

How to handle importing a CSV file with differing column lengths

Im working on a project for school and am having a really hard time figuring out how to import and format a CSV file into a usable format. The CSV contains a movie name in the first column, and showtimes in the rows, so it would look something like this.
movie1, 7pm, 8pm, 9pm, 10pm
movie2, 5pm, 8pm
movie3, 3pm, 7pm, 10pm
I think I want to split each row into its own array, maybe an arraylist of the arrays? I really dont know where to even start so any pointers would be appreciated.
Preferably dont want to use any external libraries.
I would go with a Map having movie name as key and timings as list like the one below:
Map<String, List<String>> movieTimings = new HashMap<>();
It will read through csv file and put the values into this map. If the key already exists then we just need to add the value into the list. You can use computeIfAbsent method of Map (Java 8) to see whether the entry exists or not, e.g.:
public static void main(String[] args) {
Map<String, List<String>> movieTimings = new HashMap<>();
String timing = "7pm";//It will be read from csv
movieTimings.computeIfAbsent("test", s -> new ArrayList<>()).add(timing);
System.out.println(movieTimings);
}
This will populate your map once the file is read. As far as reading of file is concerned, you can use BuffferedReader or OpenCSV (if your project allows you to use third party libraries).
I have no affiliation with Univocity - but their Java CSV parser is amazing and free. When I had a question, one of the developers got back to me immediately. http://www.univocity.com/pages/about-parsers
You read in a line and then cycle through the fields. Since you know the movie name is always there and at least one movie time, you can set it up any way you like including an arraylist of arraylists (so both are variable length arrays).
It works well with our without quotes around the fields (necessary when there are apostrophes or commas in the movie names). In the problem I solved, all rows had the same number of columns, but I did not know the number of columns before I parsed the file and each file often had a different number of columns and column names and it worked perfectly.
You can use opencsv to read the CSV file and add each String[] to an ArrayList. There are examples in the FAQ section of opencsv's website.
Edit: If you don't want to use external libraries you can read the CSV using a BufferedReader and split the lines by commas.
BufferedReader br = null;
try{
List<String[]> data = new ArrayList<String[]>();
br = new BufferedReader(new FileReader(new File("csvfile")));
String line;
while((line = br.readLine()) != null){
String[] lineData = line.split(",");
data.add(lineData);
}
}catch(Exception e){
e.printStackTrace();
}finally{
if(br != null) try{ br.close(); } catch(Exception e){}
}

Spark : How to merge the transformations

I have 1000 json files, i need to do some transformations on each of the file, and then create a merged output file, which can have overlapping operations on values, (for example, say, it should not have repeated values)
So, if i read the files as wholeTextFiles, as a title,content pair, and then in the map function, i parse the content as json tree and perform the transformation, where and how do i merge the output?
Do i need to have another transform on the resultant RDD's to merge the values, and how would this work? Can i have a shared object(a List or a Map or RDD(?)) amongst all map blocks, which will be updated as part of the transformation, so that i can check for repeated values there?
P.S: Even if the output creates part files, i would still like to have no repititions.
Code:
//read the files as JavaPairRDD , which gives <filename, content> pairs
String filename = "/sample_jsons";
JavaPairRDD<String,String> distFile = sc.wholeTextFiles(filename);
//then create a JavaRDD from the content.
JavaRDD<String> jsonContent = distFile.map(x -> x._2);
//apply transformations, the map function will return an ArrayList which would
//have property names.
JavaRDD<ArrayList<String>> apm = jsonContent.map(
new Function< String, ArrayList<String> >() {
#Override
public ArrayList<String> call(String arg0) throws Exception {
JsonNode rootNode = mapper.readTree(arg0);
return parseJsonAndFindKey(rootNode, "type", "rootParent");
}
});
So, this way i am able to get all first level properties in an ArrayList, from each json file.
Now i need a final ArrayList, as a union of all these arraylists, removing duplicates. How can i achieve that ?
Why do you need 1000 RDDs for 1000 json files?
Do you see any issue with merging the 1000 json files in the input stage into one RDD?
If you'll be using one RDD from the input stage, it shouldn't be hard to perform all the needed actions on this RDD.

Parsing a CSV file for a unique row using the new Java 8 Streams API

I am trying to use the new Java 8 Streams API (for which I am a complete newbie) to parse for a particular row (the one with 'Neda' in the name column) in a CSV file. Using the following article for motivation, I modified and fixed some errors so that I could parse the file containing 3 columns - 'name', 'age' and 'height'.
name,age,height
Marianne,12,61
Julie,13,73
Neda,14,66
Julia,15,62
Maryam,18,70
The parsing code is as follows:
#Override
public void init() throws Exception {
Map<String, String> params = getParameters().getNamed();
if (params.containsKey("csvfile")) {
Path path = Paths.get(params.get("csvfile"));
if (Files.exists(path)){
// use the new java 8 streams api to read the CSV column headings
Stream<String> lines = Files.lines(path);
List<String> columns = lines
.findFirst()
.map((line) -> Arrays.asList(line.split(",")))
.get();
columns.forEach((l)->System.out.println(l));
// find the relevant sections from the CSV file
// we are only interested in the row with Neda's name
int nameIndex = columns.indexOf("name");
int ageIndex columns.indexOf("age");
int heightIndex = columns.indexOf("height");
// we need to know the index positions of the
// have to re-read the csv file to extract the values
lines = Files.lines(path);
List<List<String>> values = lines
.skip(1)
.map((line) -> Arrays.asList(line.split(",")))
.collect(Collectors.toList());
values.forEach((l)->System.out.println(l));
}
}
}
Is there any way to avoid re-reading the file following the extraction of the header line? Although this is a very small example file, I will be applying this logic to a large CSV file.
Is there technique to use the streams API to create a map between the extracted column names (in the first scan of the file) to the values in the remaining rows?
How can I return just one row in the form of List<String> (instead of List<List<String>> containing all the rows). I would prefer to just find the row as a mapping between the column names and their corresponding values. (a bit like a result set in JDBC). I see a Collectors.mapMerger function that might be helpful here, but I have no idea how to use it.
Use a BufferedReader explicitly:
List<String> columns;
List<List<String>> values;
try(BufferedReader br=Files.newBufferedReader(path)) {
String firstLine=br.readLine();
if(firstLine==null) throw new IOException("empty file");
columns=Arrays.asList(firstLine.split(","));
values = br.lines()
.map(line -> Arrays.asList(line.split(",")))
.collect(Collectors.toList());
}
Files.lines(…) also resorts to BufferedReader.lines(…). The only difference is that Files.lines will configure the stream so that closing the stream will close the reader, which we don’t need here, as the explicit try(…) statement already ensures the closing of the BufferedReader.
Note that there is no guarantee about the state of the reader after the stream returned by lines() has been processed, but we can safely read lines before performing the stream operation.
First, your concern that this code is reading the file twice is not founded. Actually, Files.lines returns a Stream of the lines that is lazy-populated. So, the first part of the code only reads the first line and the second part of the code reads the rest (it does read the first line a second time though, even if ignored). Quoting its documentation:
Read all lines from a file as a Stream. Unlike readAllLines, this method does not read all lines into a List, but instead populates lazily as the stream is consumed.
Onto your second concern about returning just a single row. In functional programming, what you are trying to do is called filtering. The Stream API provides such a method with the help of Stream.filter. This method takes a Predicate as argument, which is a function that returns true for all the items that should be kept, and false otherwise.
In this case, we want a Predicate that would return true when the name is equal to "Neda". This could be written as the lambda expression s -> s.equals("Neda").
So in the second part of your code, you could have:
lines = Files.lines(path);
List<List<String>> values = lines
.skip(1)
.map(line -> Arrays.asList(line.split(",")))
.filter(list -> list.get(0).equals("Neda")) // keep only items where the name is "Neda"
.collect(Collectors.toList());
Note however that this does not ensure that there is only a single item where the name is "Neda", it collects all possible items into a List<List<String>>. You could add some logic to find the first item or throw an exception if no items are found, depending on your business requirement.
Note still that calling twice Files.lines(path) can be avoided by using directly a BufferedReader as in #Holger's answer.
Using a CSV-processing library
Other Answers are good. But I recommend using a CSV-processing library to read your input files. As others noted, the CSV format is not as simple as it may seem. To begin with, the values may or may not be nested in quote-marks. And there are many variations of CSV, such as those used in Postgres, MySQL, Mongo, Microsoft Excel, and so on.
The Java ecosystem offers several such libraries. I use Apache Commons CSV.
The Apache Commons CSV library does make not use of streams. But you have no need for streams for your work if using a library to do the scut work. The library makes easy work of looping the rows from the file, without loading large file into memory.
create a map between the extracted column names (in the first scan of the file) to the values in the remaining rows?
Apache Commons CSV does this automatically when you call withHeader.
return just one row in the form of List
Yes, easy to do.
As you requested, we can fill List with each of the 3 field values for one particular row. This List acts as a tuple.
List < String > tuple = List.of(); // Our goal is to fill this list of values from a single row. Initialize to an empty nonmodifiable list.
We specify the format we expect of our input file: standard CSV (RFC 4180), with the first row populated by column names.
CSVFormat format = CSVFormat.RFC4180.withHeader() ;
We specify the file path where to find our input file.
Path path = Path.of("/Users/basilbourque/people.csv");
We use try-with-resources syntax (see Tutorial) to automatically close our parser.
As we read in each row, we check for the name being Neda. If found, we report file our tuple List with that row's field values. And we interrupt the looping. We use List.of to conveniently return a List object of some unknown concrete class that is unmodifiable, meaning you cannot add nor remove elements from the list.
try (
CSVParser parser =CSVParser.parse( path , StandardCharsets.UTF_8, format ) ;
)
{
for ( CSVRecord record : parser )
{
if ( record.get( "name" ).equals( "Neda" ) )
{
tuple = List.of( record.get( "name" ) , record.get( "age" ) , record.get( "height" ) );
break ;
}
}
}
catch ( FileNotFoundException e )
{
e.printStackTrace();
}
catch ( IOException e )
{
e.printStackTrace();
}
If we found success, we should see some items in our List.
if ( tuple.isEmpty() )
{
System.out.println( "Bummer. Failed to report a row for `Neda` name." );
} else
{
System.out.println( "Success. Found this row for name of `Neda`:" );
System.out.println( tuple.toString() );
}
When run.
Success. Found this row for name of Neda:
[Neda, 14, 66]
Instead of using a List as a tuple, I suggest your define a Person class to represent this data with proper data types. Our code here would return a Person instance rather than a List<String>.
I know I'm responding so late, but maybe it will help someone in the future
I've made a csv parser/writer , easy to use thanks to its builder pattern
For your case: you can filter the lines you want to parse using
csvLineFilter(Predicate<String>)
Hope you find it handy, here is the source code
https://github.com/i7paradise/CsvUtils-Java8/
I've joined a main class Demo.java to display how it works

Categories