Regex to clean csv from confusing characters - java

My problem is:
I'm using a csv that came out from some software, and the issue is that this software is not handling csv so well cause there are some strings in the csv that have quote in them and what is wrapping a string is also quotes so then im having issues parsing it.
so this is normal csv:
"one","two","three"
and here is my case:
"one","tw"o","three"
So I'm having issues parsing strings like "tw"o". This is basically a problem with the software that is outputting the file, and I can't edit that software.
So I thought I could create a regex that will take the unnecessary quotes or commas and make sure that each string is wrapped in quotes and delimited by comma, does someone know how can i achieve it?
im using tototoshi library for scala

I tried Python csv module, and it was able to do that (sounds like a hack but the input file is wrong after all, and using regex would be a hack too):
import csv
z = '''"one","tw"o","three"'''
cr = csv.reader([z])
print(next(cr))
result:
['one', 'two"', 'three']
For some reason, the quote has been moved in the end of the string (a valid way to put a double quote in a field would be to double it).
To remove it you can do
print([x.replace('"',"") for x in next(cr)])
to get
['one', 'two', 'three']
note that csv will issue 4 fields with "one","tw",o","three" so if the quote is followed by a comma, nothing works, only human verification can fix this.

One pretty simple regex solution that may work for you is this:
regex: (?<=\w)"(?=\w) //global flag
replace: '' //blank string
As long as we can view "bad" double quotes as those that are surrounded by alphanumerics, this will work. It's just a lookbehind for a alphanumeric, a double quote, and lookahead for a alphanumeric. It would not match a double quote escaped with a backslash or another double quote, so "" or \" would be okay.
demo here

Looks like you can't predict what kind of values with unescaped quotes you might get. There's no way to clean this up reliably with regex.
Maybe try univocity-parsers as it has a CSV parser that can handle this sort of input properly. Example:
//first configure the parser
CsvParserSettings settings = new CsvParserSettings();
//override the default unescape quote handling. This seems more appropriate for your case.
settings.setUnescapedQuoteHandling(UnescapedQuoteHandling.STOP_AT_CLOSING_QUOTE);
//then create a parser and parse your input line:
CsvParser parser = new CsvParser(settings);
List<String[]> results = parser.parseAll(<your input here>);
Hope it helps.
Disclaimer: I'm the author of this library. It's open-source and free (Apache v2.0 license)

Related

storing CSV files containing floating point numbers with phone language set to Deutsch [duplicate]

I've got a two column CSV with a name and a number. Some people's name use commas, for example Joe Blow, CFA. This comma breaks the CSV format, since it's interpreted as a new column.
I've read up and the most common prescription seems to be replacing that character, or replacing the delimiter, with a new value (e.g. this|that|the, other).
I'd really like to keep the comma separator (I know excel supports other delimiters but other interpreters may not). I'd also like to keep the comma in the name, as Joe Blow| CFA looks pretty silly.
Is there a way to include commas in CSV columns without breaking the formatting, for example by escaping them?
To encode a field containing comma (,) or double-quote (") characters, enclose the field in double-quotes:
field1,"field, 2",field3, ...
Literal double-quote characters are typically represented by a pair of double-quotes (""). For example, a field exclusively containing one double-quote character is encoded as """".
For example:
Sheet: |Hello, World!|You "matter" to us.|
CSV: "Hello, World!","You ""matter"" to us."
More examples (sheet → csv):
regular_value → regular_value
Fresh, brown "eggs" → "Fresh, brown ""eggs"""
" → """"
"," → ""","""
,,," → ",,,"""
,"", → ","""","
""" → """"""""
See wikipedia.
I found that some applications like Numbers in Mac ignore the double quote if there is space before it.
a, "b,c" doesn't work while a,"b,c" works.
The problem with the CSV format, is there's not one spec, there are several accepted methods, with no way of distinguishing which should be used (for generate/interpret). I discussed all the methods to escape characters (newlines in that case, but same basic premise) in another post. Basically it comes down to using a CSV generation/escaping process for the intended users, and hoping the rest don't mind.
Reference spec document.
If you want to make that you said, you can use quotes. Something like this
$name = "Joe Blow, CFA.";
$arr[] = "\"".$name."\"";
so now, you can use comma in your name variable.
You need to quote that values.
Here is a more detailed spec.
In addition to the points in other answers: one thing to note if you are using quotes in Excel is the placement of your spaces. If you have a line of code like this:
print '%s, "%s", "%s", "%s"' % (value_1, value_2, value_3, value_4)
Excel will treat the initial quote as a literal quote instead of using it to escape commas. Your code will need to change to
print '%s,"%s","%s","%s"' % (value_1, value_2, value_3, value_4)
It was this subtlety that brought me here.
You can use Template literals (Template strings)
e.g -
`"${item}"`
CSV files can actually be formatted using different delimiters, comma is just the default.
You can use the sep flag to specify the delimiter you want for your CSV file.
Just add the line sep=; as the very first line in your CSV file, that is if you want your delimiter to be semi-colon. You can change it to any other character.
This isn't a perfect solution, but you can just replace all uses of commas with ‚ or a lower quote. It looks very very similar to a comma and will visually serve the same purpose. No quotes are required
in JS this would be
stringVal.replaceAll(',', '‚')
You will need to be super careful of cases where you need to directly compare that data though
Depending on your language, there may be a to_json method available. That will escape many things that break CSVs.
I faced the same problem and quoting the , did not help. Eventually, I replaced the , with +, finished the processing, saved the output into an outfile and replaced the + with ,. This may seem ugly but it worked for me.
May not be what is needed here but it's a very old question and the answer may help others. A tip I find useful with importing into Excel with a different separator is to open the file in a text editor and add a first line like:
sep=|
where | is the separator you wish Excel to use.
Alternatively you can change the default separator in Windows but a bit long-winded:
Control Panel>Clock & region>Region>Formats>Additional>Numbers>List separator [change from comma to your preferred alternative]. That means Excel will also default to exporting CSVs using the chosen separator.
You could encode your values, for example in PHP base64_encode($str) / base64_decode($str)
IMO this is simpler than doubling up quotes, etc.
https://www.php.net/manual/en/function.base64-encode.php
The encoded values will never contain a comma so every comma in your CSV will be a separator.
You can use the Text_Qualifier field in your Flat file connection manager to as ". This should wrap your data in quotes and only separate by commas which are outside the quotes.
First, if item value has double quote character ("), replace with 2 double quote character ("")
item = item.ToString().Replace("""", """""")
Finally, wrap item value:
ON LEFT: With double quote character (")
ON RIGHT: With double quote character (") and comma character (,)
csv += """" & item.ToString() & ""","
Double quotes not worked for me, it worked for me \". If you want to place a double quotes as example you can set \"\".
You can build formulas, as example:
fprintf(strout, "\"=if(C3=1,\"\"\"\",B3)\"\n");
will write in csv:
=IF(C3=1,"",B3)
A C# method for escaping delimiter characters and quotes in column text. It should be all you need to ensure your csv is not mangled.
private string EscapeDelimiter(string field)
{
if (field.Contains(yourEscapeCharacter))
{
field = field.Replace("\"", "\"\"");
field = $"\"{field}\"";
}
return field;
}

java regex to remove unwanted double quotes in csv

I have a csv file that has the following line. as you can see numbers are NOT enclosed in double quotes.
String theLine = "Corp:Industrial","5Nearest",51.93000000,"10:21:29","","","","10:21:29","7/5/2016","PER PHONE CALL WITH SAP, CORRECTING "C","359/317 97 SMRD 96.961 MADV",""
I try to read the above line and split using the regEX
String[] tokens = theLine.split(",(?=(?:[^\"]*\"[^\"]*\")*(?![^\"]*\"))");
this doesn't split at every comma like I want it.
"PER PHONE CALL WITH SAP, CORRECTING "C", is messing it up because it has additional ,(comma) and " (double quote). can some one please help me write a regex that will escape a additional double quote and a comma with in two double quotes.
I basically want :
"Corp:Industrial","5Nearest",51.93000000,"10:21:29","","","","10:21:29","7/5/2016","**PER PHONE CALL WITH SAP CORRECTING C**","359/317 97 SMRD 96.961 MADV",""
There are jobs that parsers are much better at than Regular Expressions, and this sort of thing is typically one of them. I'm not saying you can't make it work for you, but ... there are also open-source CSV Parsers you could press into service.
Having said that, your CSV looks suspect to me.
"PER PHONE CALL WITH SAP, CORRECTING "C",
That value has three quotes in it -- is it meant to represent a string with only a single quote inside? Or should the C be surrounded by quotes as well as the String?
Normally if you're going to include a double quote inside a double quote you need a special syntax for it. For CSV, the most common options would be doubling it, or escaping it with a character like a backslash:
"PER PHONE CALL WITH SAP, CORRECTING ""C""",
Or:
"PER PHONE CALL WITH SAP, CORRECTING \"C\"",
None of which will directly change your problem of using Regular Expressions, but once you have well-formed CSV, your odds of parsing it successfully go up.

How can I get a CSVPrinter to quote fields with spaces?

I'm using the CSVPrinter class from the Apache Commons in order to output a CSV file. What I would like to have happen is that if a given field contains any spaces in it, that it gets encapsulated in quotes. But if it's just a long string of numbers, or a date string, for example, then those do not need to be quoted.
Unfortunately the QuoteMode enum seems pretty limited; it offers the four following choices:
ALL Quotes all fields.
MINIMAL Quotes fields which contain special characters such as a delimiter, quote character or any of the characters in line separator.
NON_NUMERIC Quotes all non-numeric fields.
NONE Never quotes fields.
The MINIMAL option seems to be the closest to what I want to do here, but since the space character is not part of a line separator, that doesn't work. Is there any way to configure a CSVPrinter object to quote fields that have spaces in them?
CSVPrinter is not that flexible, it is also final so you can't override the printing implementation.
I suggest you find a different csv library or look at the source code of CSVPrinter and implement your own version with your own requirements. It is definitely not a requirement of the CSV format to quote strings with whitespace though. Any implementation that complies with the format should be able to read strings with whitespace in them (and not quoted).

Java Splitting a String

I have this string
G234101,Non-Essential,ATPases,Respiration chain complexes,"Auxotrophies, carbon and",PS00017,2,IONIC HOMEOSTASIS,mitochondria.
That I have been trying to split in java. The file is comma delimeted but some of the strings have commas within them and I don't want them to get split up. Currently in the above example
"Auxotrophies, carbon and"
is getting split into two strings.
Any suggestions on how to best split this up by comma's. Not all of the strings have the " " for example the following string:
G234103,Essential,Protein Kinases,?,Cell cycle defects,PS00479,2,CELLULAR COMMUNICATION/SIGNAL TRANSDUCTION,cytoplasm.
http://opencsv.sourceforge.net/
But if you really do need to reinvent the wheel (homework), you need to use a more complicated regular expression than just "what,ever".split(","). It's not simple though. And you might be better off creating your own custom Lexer. http://en.wikipedia.org/wiki/Lexical_analysis
This isn't too hard in your case. As you process your text character by character you just need to keep track of opening and closing quotes to decide when to ignore commas and when to act on them.
Also see StreamTokenizer for a built-in configurable Lexer - you should be able to use this to meet your requirements.
I would think that this would be a multi step process. First, find all the comma's in quotes from your original string, replace it with something like {comma}. You can do this with some regex. Then on the new string, split the new string with the comma symbol(,). Then go through your list, and replace the {comma} with the comma symbol {,}.

how to built regular expression to get value between two single quotes and if there is no single qoute, extract between commas

Problem that i face:
-I have an input string, a SQL statement that i need to parse
-extract the value that need to be insert base on the column name specify
-i can extract the value that is wrap in between 2 single quotes, but:
--?what about value that has no single quotes wrap at them? (like: integer or double)
--?what if the value inside already has single quotes? (like: 'James''s dictionary')
Below is the sample input string:
INSERT INTO LJS1_DX (base, doc, key1, key2, no, sq, eq, ln, en, date, line)
VALUES ('GET','','#000210','',' 0',' 1','5',1,0,'20100706','Street''James''s dictionary')
The Java Code i have below match value between two single quotes only:
Pattern p = Pattern.compile("'.*?'");
columnValues = "'GET0','','#000210','',' 0',' 1','5',1,0,'20100706','Street''James''s dictionary'";
Matcher m = p.matcher(columnValues); // get a matcher object
StringBuffer output = new StringBuffer();
while (m.find()) {
logger.trace(m.group());
}
Appreciate if anyone can provide any guideline or sample to this question.
Thank you!!
I agree with gnibbler that this is a job for a csv parser.
A regex that works on your example would be
'(?:''|[^'])*'|[^',]+
which looks challenging to debug and maintain, doesn't it?
Explanation:
' # First alternative: match an "opening" '
(?: # followed by either...
'' # two ' in a row (escaped ')
| # or...
[^'] # any character that is not a '
)* # zero or more times,
' # then match a "closing" '
| # or (second alternative):
[^',\s]+ # match any run of characters except ', comma or whitespace
It also works if there is whitespace around the values/commas (and will leave that out of the match).
Regex are not really suitable for this. You will always find cases that fail
A csv parser such as opencsv is probably a better option
In general, when you need to parse complex langauges, regexps are not the best tool - there's too much context to make sense of. So, if reading XML use an XML parser, if reading C code, use a C language parser and if reading SQL ...
There's a Java SQL parser here, I would use somethink like this.
For other languages it may be best to use a "YACC"-like parser. For example JACK
instead you can get all values using subString after Values keyword. Same way we can get names also. then you will have two comma-separated string which can be converted to array and you will have a arrays for names and values. you can then check which param has which value .
hope this helps.
I think Tim had the right idea; it just needs to be implemented more efficiently. Here's a much more efficient version:
'[^']*+(?:''[^']*+)*+'|[^',\s]++
It uses Friedl's "unrolled loop" technique to avoid excessive reliance on alternations that match one or two characters at a time (I think that's what did you in, Tim), plus possessive quantifiers throughout.
Regular expressions are not easy to use with this (but everything is possible).
I would suggest parsing it yourself, or use a library to do the parsing. By writing the parser yourself you are certain that it works exactly as you need it to.

Categories