Validate the column separator from a CSV file

Validate the column separator from a CSV file - java

I want to know if there's a way to validate the column separator for a CSV file, I'm using the CsvMapper library and I know that I can set the column separator using the .withColumnSeparator(). I have a CSV file and I want to validate that the file is using the right separator, I use as a separator the '|' character but sometimes the file can have a ';' separator or another one, but I want to validate that the separator is the '|' character.
E.g I have this CSV file with these two lines:
A|B|3|e
w|ew|34|w
This is a valid file because has as a separator the '|' char.
But sometimes I can received a CSV file like this:
A;B;3;e
w;ew;34;w
Which is a file separated by ';' Char and does not a '|', that's why I need to validate the column separator
Thanks a lot.

Related

read unique char: 'あ' from json file in java

I am reading a JSON file in Java using this code:
String data = Files.readFile(jsonFile)
.trim()
.replaceAll("[^\\x00-\\x7F]", "")
.replaceAll("[\\p{Cntrl}&&[^\r\n\t]]", "")
.replaceAll("\\p{C}", "");
In my JSON file, there is a unique char: 'あ' (12354) that is interpreted to: "" (nothing) when reading the file.
How can I make this char show up in my variable "data"?
Due to answers I've got, I understand that the data is cleaned from high ASCII characters by adding replaceAll("[^\\x00-\\x7F]", ""). But what can I do if I want all high ASCII characters to be cleaned except this one 'あ'?

The character you want is the unicode character HIRAGANA LETTER A and has code U+3042.
You can simply add it to the list of valid characters:
...
.replaceAll("[^\\x00-\\x7F\\u3042]", "")
...

Removing special characters in dynamic schema

Requirement:
Source file:
abc,''test,data'',valid
xyz,''sample,data'',invalid
the data in source file we need to read dynamically. We are reading entire data in one string column. One of the value and file delimiter have comma separator. I have to load data in target table as follows with out double quotes
Target table :
Col1|Col2|Col3
abc|test,data|valid
xyz|sample,data|invalid

Apache common CSV formatter: IOException: invalid char between encapsulated token and delimiter

I am trying to parse a CSV file using JakartaCommons-csv
Sample input file
Field1,Field2,Field3,Field4,Field5
"Ryan, R"u"bianes"," dummy#gmail.com","29445","626","South delhi, Rohini 122001"
Formatter: CSVFormat.newFormat(',').withIgnoreEmptyLines().withQuote('"')
CSV_DELIMITER is ,
Output
Field1 value after CSV parsing should be : Ryan, R"u"bianes
Field5 value after CSV parsing should be : South delhi, Rohini 122001
Exception: Caused by: java.io.IOException: (line 2) invalid char between encapsulated token and delimiter

The problem is that your file is not following the accepted standard for quoting in CSV files. The correct way to represent a quote in a quoted string is by repeating the quote. For example.
Field1,Field2,Field3,Field4,Field5
"Ryan, R""u""bianes"," dummy#gmail.com","29445","626","South delhi, Rohini 122001"
If you restrict yourself to the standard form of CSV quoting, the Apache Commons CSV parser should work.
Unfortunately, it is not feasible to write a consistent parser for your variant format because there is no way disambiguate an embedded comma and a field separator if you need to represent a field containing "Ryan R","baines".
The rules for quoting in CSV files are set out in various places including RFC 4180.

The problem here is that the quotes are not properly escaped. Your parser doesn't handle that. Try univocity-parsers as this is the only parser for java I know that can handle unescaped quotes inside a quoted value. It is also 4 times faster than Commons CSV. Try this code:
//configure the parser to handle your situation
CsvParserSettings settings = new CsvParserSettings();
settings.setHeaderExtractionEnabled(true); //uses first line as headers
settings.setUnescapedQuoteHandling(STOP_AT_CLOSING_QUOTE);
settings.trimQuotedValues(true); //trim whitespace around values in quotes
//create the parser
CsvParser parser = new CsvParser(settings);
String input = "" +
"Field1,Field2,Field3,Field4,Field5\n" +
"\"Ryan, R\"u\"bianes\",\" dummy#gmail.com\",\"29445\",\"626\",\"South delhi, Rohini 122001\"";
//parse your input
List<String[]> rows = parser.parseAll(new StringReader(input));
//print the parsed values
for(String[] row : rows){
for(String value : row){
System.out.println('[' + value + ']');
}
System.out.println("-----");
}
This will print:
[Ryan, R"u"bianes]
[dummy#gmail.com]
[29445]
[626]
[South delhi, Rohini 122001]
-----
Hope it helps.
Disclosure: I'm the author of this library, it's open source and free (Apache 2.0 license)

CsvParameterLayout in Log4j2 is inserting NUL character if data starts and ends with {}

I am using CsvParameterLayout to generate a CSV file. One of the column is a JSON string. Following is the logger method I am using:
logger.info("log:{}",json);
Following is the JSON string:
{"id":10,"name":"sumit"}
And following is the string which is getting printed in CSV output file:
NUL{"id":10,"name":"sumit"}NUL
The NUL character is the \x00 character. When I open the file in notepad++, then only this character appears. In notepad it shows as space. This poses a problem in CSV processing module.
If there is some alphabet before {, then the output in CSV is fine. It seems as if log4j is escaping '{' with a NUL character. It is happening if the string starts with {, (, [, ' or ".
Can someone please let me know if I can escape '{' and '"' in code so that log4j does not include NUL character in output?

How to distinguish in quotes delimiter vs out of quotes delimiter

I have a txt file that contains the following
SELECT TOP 20 personid AS "testQu;otes"
FROM myTable
WHERE lname LIKE '%pi%' OR lname LIKE '%m;i%';
SELECT TOP 10 personid AS "testQu;otes"
FROM myTable2
WHERE lname LIKE '%ti%' OR lname LIKE '%h;i%';
............
The above query can be any legit SQl statement (on one or multiple lines , i.e. any way user wishes to type in )
I need to split this txt and put into an array
File file ... blah blah blah
..........................
String myArray [] = text.split(";");
But this does not work properly because it take into account ALL ; . I need to ignore those ; that are within ";" AND ';'. For example ; in here '%h;i%' does not count because it is inside ''. How can I split correctly ?

Assuming that each ; you want to split on is at the end of line you can try to split on each ; + line separator after it like
text.split(";"+System.lineSeparator())
If your file has other line separators then default ones you can try with
text.split(";\n")
text.split(";\r\n")
text.split(";\r")
BTW if you want to include ; in split result (if you don't want to get rid of it) you can use look-behind mechanism like
text.split("(?<=;)"+System.lineSeparator())
In case you are dynamically reading file line-by-line just check if line.endsWith(";").

I see a 'new line' after your ';' - It is generalizable to the whole text file ?
If you must/want use regular expression you could split with a regex of the form
;$
The $ means "end of line", depending of the regex implementation of Java (don't remember).
I will not use regex for this kind of task. Parsing the text and counting the number of ' or " to be able to recognize the reals ";" delimiters is sufficient.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Validate the column separator from a CSV file - java

Related

read unique char: 'あ' from json file in java

Removing special characters in dynamic schema

Apache common CSV formatter: IOException: invalid char between encapsulated token and delimiter

CsvParameterLayout in Log4j2 is inserting NUL character if data starts and ends with {}

How to distinguish in quotes delimiter vs out of quotes delimiter

Categories

Resources