java regex to remove unwanted double quotes in csv - java

I have a csv file that has the following line. as you can see numbers are NOT enclosed in double quotes.
String theLine = "Corp:Industrial","5Nearest",51.93000000,"10:21:29","","","","10:21:29","7/5/2016","PER PHONE CALL WITH SAP, CORRECTING "C","359/317 97 SMRD 96.961 MADV",""
I try to read the above line and split using the regEX
String[] tokens = theLine.split(",(?=(?:[^\"]*\"[^\"]*\")*(?![^\"]*\"))");
this doesn't split at every comma like I want it.
"PER PHONE CALL WITH SAP, CORRECTING "C", is messing it up because it has additional ,(comma) and " (double quote). can some one please help me write a regex that will escape a additional double quote and a comma with in two double quotes.
I basically want :
"Corp:Industrial","5Nearest",51.93000000,"10:21:29","","","","10:21:29","7/5/2016","**PER PHONE CALL WITH SAP CORRECTING C**","359/317 97 SMRD 96.961 MADV",""

There are jobs that parsers are much better at than Regular Expressions, and this sort of thing is typically one of them. I'm not saying you can't make it work for you, but ... there are also open-source CSV Parsers you could press into service.
Having said that, your CSV looks suspect to me.
"PER PHONE CALL WITH SAP, CORRECTING "C",
That value has three quotes in it -- is it meant to represent a string with only a single quote inside? Or should the C be surrounded by quotes as well as the String?
Normally if you're going to include a double quote inside a double quote you need a special syntax for it. For CSV, the most common options would be doubling it, or escaping it with a character like a backslash:
"PER PHONE CALL WITH SAP, CORRECTING ""C""",
Or:
"PER PHONE CALL WITH SAP, CORRECTING \"C\"",
None of which will directly change your problem of using Regular Expressions, but once you have well-formed CSV, your odds of parsing it successfully go up.

Related

storing CSV files containing floating point numbers with phone language set to Deutsch [duplicate]

I've got a two column CSV with a name and a number. Some people's name use commas, for example Joe Blow, CFA. This comma breaks the CSV format, since it's interpreted as a new column.
I've read up and the most common prescription seems to be replacing that character, or replacing the delimiter, with a new value (e.g. this|that|the, other).
I'd really like to keep the comma separator (I know excel supports other delimiters but other interpreters may not). I'd also like to keep the comma in the name, as Joe Blow| CFA looks pretty silly.
Is there a way to include commas in CSV columns without breaking the formatting, for example by escaping them?
To encode a field containing comma (,) or double-quote (") characters, enclose the field in double-quotes:
field1,"field, 2",field3, ...
Literal double-quote characters are typically represented by a pair of double-quotes (""). For example, a field exclusively containing one double-quote character is encoded as """".
For example:
Sheet: |Hello, World!|You "matter" to us.|
CSV: "Hello, World!","You ""matter"" to us."
More examples (sheet → csv):
regular_value → regular_value
Fresh, brown "eggs" → "Fresh, brown ""eggs"""
" → """"
"," → ""","""
,,," → ",,,"""
,"", → ","""","
""" → """"""""
See wikipedia.
I found that some applications like Numbers in Mac ignore the double quote if there is space before it.
a, "b,c" doesn't work while a,"b,c" works.
The problem with the CSV format, is there's not one spec, there are several accepted methods, with no way of distinguishing which should be used (for generate/interpret). I discussed all the methods to escape characters (newlines in that case, but same basic premise) in another post. Basically it comes down to using a CSV generation/escaping process for the intended users, and hoping the rest don't mind.
Reference spec document.
If you want to make that you said, you can use quotes. Something like this
$name = "Joe Blow, CFA.";
$arr[] = "\"".$name."\"";
so now, you can use comma in your name variable.
You need to quote that values.
Here is a more detailed spec.
In addition to the points in other answers: one thing to note if you are using quotes in Excel is the placement of your spaces. If you have a line of code like this:
print '%s, "%s", "%s", "%s"' % (value_1, value_2, value_3, value_4)
Excel will treat the initial quote as a literal quote instead of using it to escape commas. Your code will need to change to
print '%s,"%s","%s","%s"' % (value_1, value_2, value_3, value_4)
It was this subtlety that brought me here.
You can use Template literals (Template strings)
e.g -
`"${item}"`
CSV files can actually be formatted using different delimiters, comma is just the default.
You can use the sep flag to specify the delimiter you want for your CSV file.
Just add the line sep=; as the very first line in your CSV file, that is if you want your delimiter to be semi-colon. You can change it to any other character.
This isn't a perfect solution, but you can just replace all uses of commas with ‚ or a lower quote. It looks very very similar to a comma and will visually serve the same purpose. No quotes are required
in JS this would be
stringVal.replaceAll(',', '‚')
You will need to be super careful of cases where you need to directly compare that data though
Depending on your language, there may be a to_json method available. That will escape many things that break CSVs.
I faced the same problem and quoting the , did not help. Eventually, I replaced the , with +, finished the processing, saved the output into an outfile and replaced the + with ,. This may seem ugly but it worked for me.
May not be what is needed here but it's a very old question and the answer may help others. A tip I find useful with importing into Excel with a different separator is to open the file in a text editor and add a first line like:
sep=|
where | is the separator you wish Excel to use.
Alternatively you can change the default separator in Windows but a bit long-winded:
Control Panel>Clock & region>Region>Formats>Additional>Numbers>List separator [change from comma to your preferred alternative]. That means Excel will also default to exporting CSVs using the chosen separator.
You could encode your values, for example in PHP base64_encode($str) / base64_decode($str)
IMO this is simpler than doubling up quotes, etc.
https://www.php.net/manual/en/function.base64-encode.php
The encoded values will never contain a comma so every comma in your CSV will be a separator.
You can use the Text_Qualifier field in your Flat file connection manager to as ". This should wrap your data in quotes and only separate by commas which are outside the quotes.
First, if item value has double quote character ("), replace with 2 double quote character ("")
item = item.ToString().Replace("""", """""")
Finally, wrap item value:
ON LEFT: With double quote character (")
ON RIGHT: With double quote character (") and comma character (,)
csv += """" & item.ToString() & ""","
Double quotes not worked for me, it worked for me \". If you want to place a double quotes as example you can set \"\".
You can build formulas, as example:
fprintf(strout, "\"=if(C3=1,\"\"\"\",B3)\"\n");
will write in csv:
=IF(C3=1,"",B3)
A C# method for escaping delimiter characters and quotes in column text. It should be all you need to ensure your csv is not mangled.
private string EscapeDelimiter(string field)
{
if (field.Contains(yourEscapeCharacter))
{
field = field.Replace("\"", "\"\"");
field = $"\"{field}\"";
}
return field;
}

Regex to clean csv from confusing characters

My problem is:
I'm using a csv that came out from some software, and the issue is that this software is not handling csv so well cause there are some strings in the csv that have quote in them and what is wrapping a string is also quotes so then im having issues parsing it.
so this is normal csv:
"one","two","three"
and here is my case:
"one","tw"o","three"
So I'm having issues parsing strings like "tw"o". This is basically a problem with the software that is outputting the file, and I can't edit that software.
So I thought I could create a regex that will take the unnecessary quotes or commas and make sure that each string is wrapped in quotes and delimited by comma, does someone know how can i achieve it?
im using tototoshi library for scala
I tried Python csv module, and it was able to do that (sounds like a hack but the input file is wrong after all, and using regex would be a hack too):
import csv
z = '''"one","tw"o","three"'''
cr = csv.reader([z])
print(next(cr))
result:
['one', 'two"', 'three']
For some reason, the quote has been moved in the end of the string (a valid way to put a double quote in a field would be to double it).
To remove it you can do
print([x.replace('"',"") for x in next(cr)])
to get
['one', 'two', 'three']
note that csv will issue 4 fields with "one","tw",o","three" so if the quote is followed by a comma, nothing works, only human verification can fix this.
One pretty simple regex solution that may work for you is this:
regex: (?<=\w)"(?=\w) //global flag
replace: '' //blank string
As long as we can view "bad" double quotes as those that are surrounded by alphanumerics, this will work. It's just a lookbehind for a alphanumeric, a double quote, and lookahead for a alphanumeric. It would not match a double quote escaped with a backslash or another double quote, so "" or \" would be okay.
demo here
Looks like you can't predict what kind of values with unescaped quotes you might get. There's no way to clean this up reliably with regex.
Maybe try univocity-parsers as it has a CSV parser that can handle this sort of input properly. Example:
//first configure the parser
CsvParserSettings settings = new CsvParserSettings();
//override the default unescape quote handling. This seems more appropriate for your case.
settings.setUnescapedQuoteHandling(UnescapedQuoteHandling.STOP_AT_CLOSING_QUOTE);
//then create a parser and parse your input line:
CsvParser parser = new CsvParser(settings);
List<String[]> results = parser.parseAll(<your input here>);
Hope it helps.
Disclaimer: I'm the author of this library. It's open-source and free (Apache v2.0 license)

Java regex to match quoted numbers

I need to clean up a JSON including incorrectly quoted numbers via a short Java (not JS!) Regex snippet. Example for what I have:
[{"series":"a","x":"1","y":"111.71"},{"series":"a","x":"2","y":"120.25"}]
Example for what I would need to get:
[{"series":"a","x":1,"y":111.71},{"series":"a","x":2,"y":120.25}]
So I only need to match and eliminate quote characters if preceeded or followed by [0-9], but how to avoid replacing part of the number is beyond my lowly regex skills.
Any help greatly appreciated!
EDIT (2nd round):
Thanks for the fast feedback! I'm not too worried about false positives since I can control the contents of the descriptors, and I'll make sure they're text-only. Spaces can be avoided as well, only negative numbers might occur - good one! Separators are always commas (",") for the JSON, the arbitrary number of decimals in of the double values are always separated by dots ("."). I cannot fix the JSON source unfortunately, and I definitely want to clean this up in Java.
Trying out the suggestions now and reporting back. I'll also toy around with this: http://www.regular-expressions.info/lookaround.html#lookbehind
How about replaceAll("\"(-?\\d+([.]\\d+)?)\"","$1");
This works for your specific example, but would not work if other numbers have a different format (see my comment):
String s = "[{\"series\":\"a\",\"x\":\"1\",\"y\":\"111.71\"},{\"series\":\"a\",\"x\":\"2\",\"y\":\"120.25\"}]";
String clean = s.replaceAll("\"(\\d+\\.?\\d*)\"", "$1");
System.out.println(clean);
outputs:
[{"series":"a","x":1,"y":111.71},{"series":"a","x":2,"y":120.25}]

In Java (Pig) Regex, how could I do the following?

I have data coming in a txt file delimited by pipes. The unfortunate thing is 2 fields can have multiple values. To separate these multiples, the sender used pipes again, but put quotes around it. My regex worked for months until a certain rare situation...
Regex currently:
([^\|]*)\|"?([^"]*)"?\|([^\|]*)\|"?([^"]*)"?
And it worked for the following situation which happens most of the time:
abc|"part1|part2"|abc|"tool1|tool2"
But this case is where the ([^"]*) jumps ahead and takes all from the blank to the end of the quotes:
abc||abc|"tool1|tool2"
So I realize I must account for when there is a pipe next instead of a quote.
Just not sure how.............
P.S. For those PIG people that might be looking at this, I removed a backslash from each escape, to make it look more like Java, but in PIG you need 2, fyi.
In your expression you need to specify that the part between |s can be either quoted or not quoted. You can do it as follows:
(("[^"]*")|((?!")[^|]*))
Now you can repeat this part several times with |s in between, as you need.

Regular expression for splitting JSON text in lines after symbols

I am trying to use a regular expression to have this kind of string
{
"key1"
:
value1
,
"key2"
:
"value2"
,
"arrayKey"
:
[
{
"keyA"
:
valueA
,
"keyB"
:
"valueB"
,
"keyC"
:
[
0
,
1
,
2
]
}
]
}
from
JSONObject.toString()
that is one long line of text in my Android Java app
{"key1":"value1","key2":"value2","arrayKey":[{"keyA":"valueA","keyB":"valueB","keyC":[0,1,2]}]}
I found this regular expression for finding all commas.
/(,)(?=(?:[^"]|"[^"]*")*$)/
Now I need to know:
0- if this is reliable, that is, does what they say.
1- if this is works also with commas inside double-quotes.
2- if this takes into account escaped double-quotes.
3- if I have to take into account also single quotes, as this file is produced by my app but occasionally it could be manually edited by the user.
5- It has to be used with the multi-line flag to work with multi-line text.
6- It has to work with replaceAll().
The resulting regular expression will be be used for replacing each symbol with a two-char sequence made of the symbol itself plus \n character.
The resulting text has to be still JSON text.
Subsequent replace actions will take place also for the other symbols
: [ ] { }
and other symbols that can be found in JSON files outside the alphanumeric sequences between quotes (I do not know if the mentioned symbols are the only ones).
Its not that much simple, but yes if you want to do then you need to filter characters([,{,",',:) and replace then with a new line character against it.
like:
[ should get replaced with [\n
Answer to your question is Yes its very much reliable and good to implement its just a single line of code doing all. Thats what regex is made for.
0- if this is reliable, that is, does what they say.
Let's break down the expression a little:
(,) is a capturing group that matches a single comma
(?=...) would mean a positive lookahead meaning the comma would need to be followed by a match of that group's content
(?:...)* would be a non-capturing group that can occur 0 to many times
[^"]|"[^"]*" would match either any character except a double quote ([^"]) or (|) a pair of double quotes with any character in between except other double quotes ("[^"]*")
As you can see especially the last part could make it unreliable if there are escaped double quotes in a text value, so the answer would be "this is reliable if the input is simple enough".
1- if this is works also with commas inside double-quotes.
If the double quote pairs are correctly identified any commas in between would be ignored.
2- if this takes into account escaped double-quotes.
Here's one of the major problems: escaped double quotes would need to be handled. This can get quite complex if you want to handle arbitrary cases, especially if the texts could contain commas as well.
3- if I have to take into account also single quotes, as this file is produced by my app but occasionally it could be manually edited by the user.
Single quotes aren't allowed by the JSON sepcification but many parsers support them because humans tend to use them anyway. Thus you might need to take them into account and that makes no. 2 even more complex because now there might be an unescaped double quote in a single quote text.
5- It has to be used with the multi-line flag to work with multi-line text.
I'm not entirely sure about that but adding the multi-line flag shouldn't hurt. You could add it to the expression itself though, i.e. by prepeding (?m).
6- It has to work with replaceAll().
In its current form the regex would work with String#replaceAll() because it only matches the comma - the lookahead is used to determine a match but won't result in the wrong parts being replaced. The matches themselves might not be correct though, as described above.
That being said, you should note that JSON is not a regular language and only regular languages are a perfect fit for regular expressions.
Thus I'd recommend using a proper JSON parser (there are quite a lot out there) to parse the JSON into POJOs (might just be a bunch of generic JsonObject and JsonArray instances) and reformat that according to your needs.
Here's an example of how Jackson could be used to accomplish that: https://kodejava.org/how-to-pretty-print-json-string-using-jackson/
In fact, since you're already using JSONObject.toString() you probably don't need the parser itself but just a proper formatter (if you want/need to roll your own you could have a look at the org.json.JSONObject sources ).

Categories