Univocity - Detect missing column when parsing CSV

Univocity - Detect missing column when parsing CSV - java

I'm using Univocity library to parse CSV and it works perfectly, but I need a way to detect if the file being parsed has less columns than required
For example, if I'm expecting a 3 columns file, with columns mapped to [H1,H2,H3] then I received a file (which has no headers) that looks like
V1_H1,V1_H2
V2_H1,V2_H2
When using
record.getString("H3");
this would return null, instead, I need this file to either fail to be parsed or I can check if it misses a column and stop processing it
Is there any way to achieve this?

So since my main issue here is to make sure that the headers count is the same as the number of columns provided in the CSV file, and since I'm using an iterator to iterate over records, I've added a check like:
CsvParser parser = new CsvParser(settings);
ResultIterator<Record, ParsingContext> iterator = parser.iterateRecords(inputStream).iterator();
if(iterator.getContext().parsedHeaders().length != settings.getHeaders().length){
throw new Exception("Invalid file");
}
It's working for me, not sure if there is a better way to do it.

I've watched Univocity documentation and I've found here that there is a way to add annotations to the destination objects you are going to generate from the CSV input
#Parsed
#Validate
public String notNulNotBlank; //This should fail if the field is null or blank
#Parsed
#Validate(nullable = true)
public String nullButNotBlank;
#Parsed
#Validate(allowBlanks = true)
public String notNullButBlank;
This will also help you to use the objects instead of having to work with fields.
Hope that helps :-)

Related

Univocity Error Handler for parsing fixed width multi-schema text file

I am using univocity 2.9.1 file parser for a fixed width text file which is multi-schema.
I have setup the different record types using settings addFormatForLookahead method.
settings.addFormatForLookahead("11", headerFields);
settings.addFormatForLookahead("12", accountFields);
settings.addFormatForLookahead("13", footerFields);
The fields have been defined for the different record types.
I am able to parse the records when the data is good, using parseAll or with a row processor and beginParsing/parseNext/stopParsing.
This issue is when file contains incorrect data, the parser throws a TextParsingException and stops parsing.
I would like the parser to continue parsing the rest of the file.
I see a RetryableErrorHandler mentioned in documentation, but I cannot see how it applies in this scenario. I tried it and it didn't work.
ie
settings.setProcessorErrorHandler(new RetryableErrorHandler<ParsingContext>() {
#Override
public void handleError(DataProcessingException error, Object[] inputRow, ParsingContext context) {
//if there's an error in the first column, assign 50 and proceed with the record.
if(error.getColumnIndex() == 0){
setDefaultValue(50);
} else { //else keep the record anyway. Null will be used instead.
keepRecord();
}
}
});
I guess i can create a default FixedWidthField field with one field, which is full length. In this case, if it doesn't identify the header, account, or footer records then it will use the default field, and will continue parsing on unknown records.
Please advise.
Thanks,
B

Java: Using InputStream and Apache Commons CSV without line numbers

This is probably very simple, but I have not been able to find an option to do this. I'm trying to Apache Commons CSV to read a file for later validations. The CSV in question is submitted as an Input Stream, which seems to add an additional column to the file when it reads it, containing the line numbers. I would like to be able to ignore it, if possible, as the header row does not contain a number, which causes an error. Is there an option already in InputStream to do this, or will I have to set up some kind of post processing?
The code I'm using is as follows:
public String validateFile(InputStream filePath) throws Exception{
System.out.println("Sending file to reader");
System.out.println(filePath);
InputStreamReader in = new InputStreamReader(filePath);
//CSVFormat parse needs a reader object
System.out.println("sending reader to CSV parse");
for (CSVRecord record : CSVFormat.DEFAULT.withHeader().parse(in)) {
for (String field : record) {
System.out.print("\"" + field + "\", ");
}
System.out.println();
}
return null;
}
When using withHeader(), I end up with the following error:
java.lang.IllegalArgumentException: A header name is missing in [, Employee_ID, Department, Email]
and I can't simply skip it, as I will need to do some validations on the header row.
Also, here is an example CSV file:
"Employee_ID", "Department", "Email"
"0123456","Department of Hello World","John.Doe#gmail.com"
EDIT: Also, The end goal is to validate the following:
That there are columns called "Employee_ID", "Department", and "Email". For this, I think I'll need to remove .withHeader().
Each line is comma delimited.
There are no empty cells values

Newer versions of Commons-CSV have trouble with empty headers.
Maybe that's the case here as well?
You just mentioned "no empty cell values" not sure if this included headers as well...
Also see: https://issues.apache.org/jira/browse/CSV-257
Setting .setAllowMissingColumnNames(true) did the trick for me.
final CSVFormat csvFormat = CSVFormat.Builder.create()
.setHeader(HEADERS)
.setAllowMissingColumnNames(true)
.build();
final Iterable<CSVRecord> records = csvFormat.parse(reader);

Parsing a text file using java with multiple values per line to be extracted

I'm not going to lie I'm really bad at making regular expressions. I'm currently trying to parse a text file that is giving me a lot of issues. The goal is to extract the data between their respective "tags/titles". The file in question is a .qbo file laid out as follows personal information replaced with "DATA": The parts that I care about retrieving are between the "STMTTRM" and "/STMTTRM" tags as the rest I don't plan on putting in my database, but I figured it would help others see the file content I'm working with. I apologize for any confusion prior to this update.
FXHEADER:100
DATA:OFXSGML
VERSION:102
SECURITY:NONE
ENCODING:USASCII
CHARSET:1252
COMPRESSION:NONE
OLDFILEUID:NONE
NEWFILEUID:NONE
<OFX>
<SIGNONMSGSRSV1><SONRS>
<STATUS><CODE>0</CODE><SEVERITY>INFO</SEVERITY></STATUS>
<DTSERVER>20190917133617.000[-4:EDT]</DTSERVER>
<LANGUAGE>ENG</LANGUAGE>
<FI>
<ORG>DATA</ORG>
<FID>DATA</FID>
</FI>
<INTU.BID>DATA</INTU.BID>
<INTU.USERID>DATA</INTU.USERID>
</SONRS></SIGNONMSGSRSV1>
<BANKMSGSRSV1>
<STMTTRNRS>
<TRNUID>0</TRNUID>
<STATUS><CODE>0</CODE><SEVERITY>INFO</SEVERITY></STATUS>
<STMTRS>
<CURDEF>USD</CURDEF>
<BANKACCTFROM>
<BANKID>DATA</BANKID>
<ACCTID>DATA</ACCTID>
<ACCTTYPE>CHECKING</ACCTTYPE>
<NICKNAME>FREEDOM CHECKING</NICKNAME>
</BANKACCTFROM>
<BANKTRANLIST>
<DTSTART>20190717</DTSTART><DTEND>20190917</DTEND>
<STMTTRN><TRNTYPE>POS</TRNTYPE><DTPOSTED>20190717071500</DTPOSTED><TRNAMT>-5.81</TRNAMT><FITID>3893120190717WO</FITID><NAME>DATA</NAME><MEMO>POS Withdrawal</MEMO></STMTTRN>
<STMTTRN><TRNTYPE>DIRECTDEBIT</TRNTYPE><DTPOSTED>20190717085000</DTPOSTED><TRNAMT>-728.11</TRNAMT><FITID>4649920190717WE</FITID><NAME>CHASE CREDIT CRD</NAME><MEMO>DATA</MEMO></STMTTRN>
<STMTTRN><TRNTYPE>ATM</TRNTYPE><DTPOSTED>20190717160900</DTPOSTED><TRNAMT>-201.99</TRNAMT><FITID>6674020190717WA</FITID><NAME>DATA</NAME><MEMO>ATM Withdrawal</MEMO></STMTTRN>
</BANKTRANLIST>
<LEDGERBAL><BALAMT>2024.16</BALAMT><DTASOF>20190917133617.000[-4:EDT]</DTASOF></LEDGERBAL>
<AVAILBAL><BALAMT>2020.66</BALAMT><DTASOF>20190917133617.000[-4:EDT]</DTASOF></AVAILBAL>
</STMTRS>
</STMTTRNRS>
</BANKMSGSRSV1>
</OFX>
I want to be able to end with data that looks or acts like the following so that each row of data can easily be added to a database:
Example Parse

As David has already answered, It is good to parse the POS output XML using Java. If you are more interested about about regex to get all the information, you can use this regular expression.
<[^>]+>|\\n+
You can test in the following sites.
https://rubular.com/
https://www.regextester.com/

Given this is XML, I would do one of two things:
either use the Java DOM objects to marshall/unmarshall to/from Java objects (nodes and elements), or
use JAXB to achieve something similar but with better POJO representation.
Mkyong has tutorials for both. Try the dom parsing or jaxb. His tutorials are simple and easy to follow.
JAXB requires more work and dependencies. So try DOM first.

I would propose the following approach.
Read file line by line with Files:
final List<String> lines = Files.readAllLines(Paths.get("/path/to/file"));
At this point you would have all file line separated and ready to convert the string lines into something more useful. But you should create class beforehand.
Create a class for your data in line, something like:
public class STMTTRN {
private String TRNTYPE;
private String DTPOSTED;
...
...
//constructors
//getters and setters
}
Now when you have a data in each separate string and a class to hold the data, you can convert lines to objects with Jackson:
final XmlMapper xmlMapper = new XmlMapper();
final STMTTRN stmttrn = xmlMapper.readValue(lines[0], STMTTRN.class);
You may want to create a loop or make use of stream with a mapper and a collector to get the list of STMTTRN objects:
final List<STMTTRN> stmttrnData = lines.stream().map(this::mapLine).collect(Collectors.toList());
Where the mapper might be:
private STMTTRN mapLine(final String line) {
final XmlMapper xmlMapper = new XmlMapper();
try {
return xmlMapper.readValue(line, STMTTRN.class);
} catch (IOException e) {
throw new RuntimeException(e);
}
}

Using I/O stream to parse CSV file

I have a CSV file of US population data for every county in the US. I need to get each population from the 8th column of the file. I'm using a fileReader() and bufferedStream() and not sure how to use the split method to accomplish this. I know this isn't much information but I know that I'll be using my args[0] as the destination in my class.
I'm at a loss to where to being to be honest.
import java.io.FileReader;
public class Main {
public static void main(String[] args) {
BufferedReader() buff = new BufferedReader(new FileReader(args[0]));
String
}
try {
}
}
The output should be an integer of the total US population. Any help with pointing me in the right direction would be great.

Don't reinvent the wheel, don't parse CSV yourself: use a library. Even such a simple format as CSV has nuances: fields can be escaped with quotes or unescaped, the file can have or have not a header and so on. Besides that you have to test and maintain the code you've wrote. So writing less code and reusing libraries is good.
There are a plenty of libraries for CSV in Java:
Apache Commons CSV
OpenCSV
Super CSV
Univocity
flatpack
IMHO, the first two are the most popular.
Here is an example for Apache Commons CSV:
final Reader in = new FileReader("counties.csv");
final Iterable<CSVRecord> records = CSVFormat.DEFAULT.parse(in);
for (final CSVRecord record : records) { // Simply iterate over the records via foreach loop. All the parsing is handler for you
String populationString = record.get(7); // Indexes are zero-based
String populationString = record.get("population"); // Or, if your file has headers, you can just use them
… // Do whatever you want with the population
}
Look how easy it is! And it will be similar with other parsers.

JSon to CSV with Java using CDL: possible to replace comma-sep. by semi-colum sep. values?

Everything is in the title :)
I'm using org.json.CDL to convert JSONArray into CSV data but it renders a string with ',' as separator.
I'd like to know if it's possible to replace with ';' ?
Here is a simple example of what i'm doing:
public String exportAsCsv() throws Exception {
return CDL.toString(
new JSONArray(
mapper.writeValueAsString(extractAccounts()))
);
}
Thanks in advance for any advice on that question.
Edit: No replacement solution of course, as this could have impact for large data, and of course the library used enable me to specify the field separator.
Edit2: Finally the solution to extract data as JSONArray (and String...) was not very good, especially for large data file.
So i made the following changes:
use a Java CSV library (for example: http://www.csvreader.com/java_csv_samples.php)
refactor code to stream data from json input source to csv output source
This is nicer for large data treatment. If you have comments do not hesitate.

String output = "Hello,This,is,separated,by,a,comma";
// Simple call the replaceAll method.
output = output.replace(',',';');
I found this in the String documentation.
Example
String value = "Hello,tthis,is,a,string";
value = value.replace(',', ';');
System.out.println(value);
// Outputs: Hello;tthis;is;a;string

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Univocity - Detect missing column when parsing CSV - java

Related

Univocity Error Handler for parsing fixed width multi-schema text file

Java: Using InputStream and Apache Commons CSV without line numbers

Parsing a text file using java with multiple values per line to be extracted

Using I/O stream to parse CSV file

JSon to CSV with Java using CDL: possible to replace comma-sep. by semi-colum sep. values?

Categories

Resources