Parsing CSV with a RegEx in java - escape double quote within cell - java

I am looking for a java regex which will escape the doublequote within an excel cell.
I have followed this example but need another change in the regular expression to make it work for escaping doublequote within one of the cells.
Parsing CSV input with a RegEx in java
private final Pattern pattern = Pattern.compile("\"([^\"]*)\"|(?<=,|^)([^,]*)(?=,|$)");
Example Data:
"A,B","2" size","text1,text2, text3"
The regex from above fails at 2".
I want the output to be as below .Doesn't matter if the outer double quotes are there or not.
"A,B"
"2" size"
"text1,text2, text3"

while I agree, that using regex for parsing a CVS is not really the best way, a slightly better pattern is:
Pattern pattern = Pattern.compile("^\"([^\"]*)\",|,\"([^\"]*)\",|,\"([^\"]*)\"$|(?<=,|^)([^,]*)(?=,|$)");
This will terminate a cell value only after quote and comma, or start it after a command and a quote.

well as F.J commented, the input data is ambiguous. But for your example input, you could try
string.split("\",\"") method to get a String[].
after this, you got an array with 3 elements:
[
"A,B,
2" size,
text1,text2, text3"
]
remove the first character (which is double quote) of the first element of the array
remove the last character (which is double quote) of the last element of the array

Related

Using .split() for multiple characters in Java

I'm trying to split an input by ".,:;()[]"'\/!? " chars and add the words to a list. I've tried .split("\\W+?") and .split("\\W"), but both of them are returning empty elements in the list.
Additionally, I've tried .split("\\W+"), which returns only words without any special characters that should go along with them (for instance, if one of the input words is "C#", it writes "C" in the list). Lastly, I've also tried to put all of the special chars above into the .split() method: .split("\\.,:;\\(\\)\\[]\"'\\\\/!\\? "), but this isn't splitting the input at all. Could anyone advise please?
split() function accepts a regex.
This is not the regex you're looking for .split("\\.,:;\\(\\)\\[]\"'\\\\/!\\? ")
Try creating a character class like [.,:;()\[\]'\\\/!\?\s"] and add + to match one or more occurences.
I also suggest to change the character space with the generic \s who takes all the space variations like \t.
If you're sure about the list of characters you have selected as splitters, this should be your correct split with the correct Java string literal as #Andreas suggested:
.split("[.,:;()\\[\\]'\\\\\\/!\\?\\s\"]+")
BTW: I've found a particularly useful eclipse editor option which escapes the string when you're pasting them into the quotes. Go to Window/Preferences, under Java/Editor/Typing/, check the box next to Escape text when pasting into a string literal

Can't split at dot - Velocity

I would like to split a date at the dots between day.month.year. For example: 14.01.2015 to {14, 01, 2015}
This is the code that i used:
dates3.get(0) contains the String "14.01.2015" which i get from a textfield of the page.
##Splitting startingDate by point
#set($dates4 = [])
#foreach($id in $dates3.get(0).split(".")) ##BUG
#set($foo = $dates4.add($id))
$id<br>
#end
The array does not contain anything afterwards and when i print $id it just prints an empty line.
I figured that when i use - as a delimiter it works, but only for the month value.
I have to put a - at the start and the end for it work (like this: "-14-01-2015-") and get the indexes 1-3 not 0-2, so that it works for all three values.
split() wants a regular expression (regex). The dot in a regex stands for "any character", so you need to escape it:
.split("\.")
(for the generic reader: in other contexts the backslash must me escaped by another backslash in order to survive the syntax of strings: .split("\\."))

Remove inner double quote in CSV file

I have a CSV file which contains double quote inside the data.
EI_SS
EI_HDR,"Test FileReport, for" Testing"
EI_DT,tx,tx,tx,tx,tx,tx,tx,tx,tx,tx,tx,tx,tx,tx,tx,dt8,tx
EI_COL,"Carrier, Name","Carrier ID","Account Name","Account ID","Group Name","Group ID","Action Code","File ID","Contract","PBP ID","Response Status","Claim Number","Claim Seq","HICN","Cardholder ID","Date of Service","NDC"
"Test Carrier" ,"MPDH5427" ,"1234" ,"CSNP" ,"TestD" Test" ,"FH14077" ,"" ,"PD14079042" ,"H5427" ,"077" ,"REJ" ,"133658279751004" ,"999" ,"304443938A" ,"P0002067501" ,01/01/2014,"50742010110" ,"C"
"Test, Carrier1" ,"BCRIMA" ,"Carrier" ,"14" ,"123333" ,"00000MCA0014001" ,"" ,"PD14024142" ,"H4152" ,"013" ,"REJ" ,"133658317280023" ,"999" ,"035225520A" ,"ZBM200416667" ,01/01/2014,"00378350505"
The Updated Actual CSV
Now I want to remove the inner quotes from these data but need to keep outer double quotes for each data.
For processing file, I have used "\"[a-zA-Z0-9 ]+[,][ a-zA-Z0-9]+\"" pattern to split the file. But if there is any inner quote in any row then the code breaks.
I need to convert this into XLSX by keeping comma's and replacing inner quotes (if not possible then removing those inner quotes.
Please help me to solve this issue.
I think it is not possible, because the way you demarcate two values is ambiguous. For example, how to split the following value?
""I am", "a single", ", value""
Is it meant to be:
I am
a single
, value
or
I am
a single, , value
or even
I am, a single, , value
?
first of all why don't you use the proper char of regex ?
actually there is a char : \w which means [a-zA-Z_0-9] instead of your [a-zA-Z0-9] (quite the same just adding _ but much more readable I think ^^)
For your pattern, as other said, the best way is to correct the way you generate the csv at first ;)
If your data is having just one double quote: ,"abc "def", - Following should help:
test.txt
"abc","def"gh","ijk"
"lmn","o"pq","rst"
sed -i 's/([^,])\"([^,])/\1\"\"\2/g' test.txt
Command above looks for a set of 3 characters which matches a pattern - ?"? where ? is anything but a comma. Implies - search for 3 characters which is not like ,", and replace " with ""
Command split:
([^,]) - character that is not a comma - () are for remembering this character
\" - Double quote
\1 - First character which is remembered
\2 - Second character which is remembered.
Note: This does not work if you have two double quotes in the encapsulated. The above command does not escape " in ,"a"b"cc",
Hope this helps a bit.

Regular Expression to select first five CSVs from a string

I have a CSV string like apple404, orange pie, wind\,cool, sun\\mooon, earth, in Java. To be precise each value of the csv string could be any thing provided commas and backslash are escaped using a back slash.
I need a regular expression to find the first five values. After some goggling I came up with the following. But it wont allow escaped commas within the values.
Pattern pattern = Pattern.compile("([^,]+,){0,5}");
Matcher matcher = pattern.matcher("apple404, orange pie, wind\\,cool, sun\\\\mooon, earth,");
if (matcher.find()) {
System.out.println(matcher.group());
} else {
System.out.println("No match found.");
}
Does anybody know how to make it work for escaped commas within values?
Following negative look-behind based regex will work:
Pattern pattern = Pattern.compile("(?:.*?(?<!(?:(?<!\\\\)\\\\)),){0,5}");
However for full fledged CSV parsing better use a dedicated CSV parser like JavaCSV.
You can use String.split() here. By specifying the limit as 6 the first five elements (index 0 to 4) would always be the first five column values from your CSV string. If in case any extra column values are present they would all overflow to index 5.
The regex (?<!\\\\), makes sure the CSV string is only split at a , comma not preceded with a \.
String[] cols = "apple404, orange pie, wind\\,cool, sun\\\\mooon, earth, " +
"mars, venus, pluto".split("(?<!\\\\),", 6);
System.out.println(cols.length); // 6
System.out.println(Arrays.toString(cols));
// [apple404, orange pie, wind\,cool, sun\\mooon, earth, mars, venus, pluto]
System.out.println(cols[4]); // 5th = earth
System.out.println(cols[5]); // 6th discarded = mars, venus, pluto
This regular expression works well. It also properly recognizes not only backslash-escaped commas, but also backslash-escaped backslashes. Also, the matches it produces do not contain the commas.
/(?:\\\\|\\,|[^,])*/g
(I am using standard regular expression notation with the understanding that you would replace the delimiters with quote marks and double all backslashes when representing this regular expression within a Java string literal.)
example input
"apple404, orange pie, wind\,cool, sun\\,mooon, earth"
produces this output
"apple404"
" orange pie"
" wind\,cool"
" sun\\"
"mooon"
Note that the double backslash after "sun" is escaped and therefore does not escape the following comma.
The way this regular expression works is by atomizing the input into longest sequences first, beginning with double backslashes (treating them as one possible multi-byte character value alternative), followed by escaped commas (a second possible multi-byte character alternative), followed by any non-comma value. Any number of these atoms are matched, followed by a literal comma.
In order to obtain the first N fields, one may simply splice the array of matches from the previous answer or surround the main expression in additional parentheses, include an optional comma in order to match the contents between fields, anchor it to the beginning of the string to prevent the engine from returning further groups of N fields, and quantify it (with N = 5 here):
/^((?:\\\\|\\,|[^,])*,?){0,5}/g
Once again, I am using standard regular expression notation, but here I will also do the trivial exercise of quoting this as a Java string:
"^((?:\\\\\\\\|\\\\,|[^,])*,?){0,5}"
This is the only solution on this page so far which actually answers both parts of the precise requirements specified by the OP, "...commas and backslash are escaped using a back slash." For the input fi\,eld1\\,field2\\,field3\\,field4\\,field5\\,field6\\,, it properly matches only the first five fields fi\,eld1\\,field2\\,field3\\,field4\\,field5\\,.
Note: my first answer made the same assumption that is implicitly part of the OP's original code and example data, which required a comma to be following every field. The problem was that if input is exactly 5 fields or less, and the last field not followed by a comma (equivalently, by an empty field), then final field would not be matched. I did not like this, and so I updated both of my answers so that they do not require following commas.
The shortcoming with this answer is that it follows the OP's assumption that values between commas contain "anything" plus escaped commas or escaped backslashes (i.e., no distinction between strings in double quotes, etc., but only recognition of escaped commas and backslashes). My answer fulfills the criteria of that imaginary scenario. But in the real world, someone would expect to be able to use double quotes around a CSV field in order to include commas within a field without using backslashes.
So I echo the words of #anubhava and suggest that a "real" CSV parser should always be used when handling CSV data. Doing otherwise is just being a script kiddie and not in any way truly "handling" CSV data.

Regex to find commas within quoted elems of csv

I'm trying to replaces commas with placeholder text within double-quoted elements of a CSV.
For instance, given this line in a CSV:
1,2,"three,four,five",6,7,8,"nine,ten",11,12
Using this regex (quotes escaped for Java):
(?<=\")([^"]+?),([^"]+?)(?=\")
I replace the first match with:
$1<COMMA>$2
Which gives me this result string:
1,2,"three<COMMA> four, five",6,7,8,"nine,ten",11,12
I repeat these steps against the resultString until there are no more matches. Here are the progressive result strings:
1,2,"three<COMMA> four, five",6,7,8,"nine,ten",11,12
1,2,"three<COMMA> four<COMMA> five",6,7,8,"nine,ten",11,12
1,2,"three<COMMA> four<COMMA> five",6<COMMA>7,8,"nine,ten",11,12
1,2,"three<COMMA> four<COMMA> five",6<COMMA>7<COMMA>8,"nine,ten",11,12
1,2,"three<COMMA> four<COMMA> five",6<COMMA>7<COMMA>8,"nine<COMMA>ten",11,12
1,2,"three<COMMA> four<COMMA> five",6<COMMA>7<COMMA>8,"nine<COMMA>ten",11,12
How can I tweak my regex so it only replaces the "," within the list items and not the delimiters themselves? In the 3rd iteration, I'm getting a match on: ",6,7,8,"
I tried to prevent this by having my lookbehind match only against one dbl quote with no dble quotes around it, or groups of three dbl quotes, but ran into "Look-behind group does not have an obvious maximum length" error,
You could change it so that the first matched character inside quotation marks can't be a comma: (?<=\")([^",][^"]*?),([^"]+?)(?=\").
Having said that, I don't think iterating it until it stops iterating like this is a very nice way of doing it. Personally I'd probably split the line into an array of strings using the unescaped columns, then iterate through the array and do a search-and-replace on each "-delimited string in the array with the /g modifier. But it's personal choice I suppose.
After a quick google:
^(("(?:[^"]|"")*"|[^,]*)(,("(?:[^"]|"")*"|[^,]*))*)$
This matches single elements in line of the csv file.
http://www.kimgentes.com/worshiptech-web-tools-page/2008/10/14/regex-pattern-for-parsing-csv-files-with-embedded-commas-dou.html

Categories