Remove inner double quote in CSV file

Remove inner double quote in CSV file - java

I have a CSV file which contains double quote inside the data.
EI_SS
EI_HDR,"Test FileReport, for" Testing"
EI_DT,tx,tx,tx,tx,tx,tx,tx,tx,tx,tx,tx,tx,tx,tx,tx,dt8,tx
EI_COL,"Carrier, Name","Carrier ID","Account Name","Account ID","Group Name","Group ID","Action Code","File ID","Contract","PBP ID","Response Status","Claim Number","Claim Seq","HICN","Cardholder ID","Date of Service","NDC"
"Test Carrier" ,"MPDH5427" ,"1234" ,"CSNP" ,"TestD" Test" ,"FH14077" ,"" ,"PD14079042" ,"H5427" ,"077" ,"REJ" ,"133658279751004" ,"999" ,"304443938A" ,"P0002067501" ,01/01/2014,"50742010110" ,"C"
"Test, Carrier1" ,"BCRIMA" ,"Carrier" ,"14" ,"123333" ,"00000MCA0014001" ,"" ,"PD14024142" ,"H4152" ,"013" ,"REJ" ,"133658317280023" ,"999" ,"035225520A" ,"ZBM200416667" ,01/01/2014,"00378350505"
The Updated Actual CSV
Now I want to remove the inner quotes from these data but need to keep outer double quotes for each data.
For processing file, I have used "\"[a-zA-Z0-9 ]+[,][ a-zA-Z0-9]+\"" pattern to split the file. But if there is any inner quote in any row then the code breaks.
I need to convert this into XLSX by keeping comma's and replacing inner quotes (if not possible then removing those inner quotes.
Please help me to solve this issue.

I think it is not possible, because the way you demarcate two values is ambiguous. For example, how to split the following value?
""I am", "a single", ", value""
Is it meant to be:
I am
a single
, value
or
I am
a single, , value
or even
I am, a single, , value
?

first of all why don't you use the proper char of regex ?
actually there is a char : \w which means [a-zA-Z_0-9] instead of your [a-zA-Z0-9] (quite the same just adding _ but much more readable I think ^^)
For your pattern, as other said, the best way is to correct the way you generate the csv at first ;)

If your data is having just one double quote: ,"abc "def", - Following should help:
test.txt
"abc","def"gh","ijk"
"lmn","o"pq","rst"
sed -i 's/([^,])\"([^,])/\1\"\"\2/g' test.txt
Command above looks for a set of 3 characters which matches a pattern - ?"? where ? is anything but a comma. Implies - search for 3 characters which is not like ,", and replace " with ""
Command split:
([^,]) - character that is not a comma - () are for remembering this character
\" - Double quote
\1 - First character which is remembered
\2 - Second character which is remembered.
Note: This does not work if you have two double quotes in the encapsulated. The above command does not escape " in ,"a"b"cc",
Hope this helps a bit.

Related

Split String While Ignoring Escaped Character

I want to split a string along spaces, ignoring spaces if they are contained inside single quotes, and ignoring single quotes if they are escaped (i.e., \' )
I have the following completed from another question.
String s = "Some message I want to split 'but keeping this a\'s a single string' Voila!";
for (String a : s.split(" (?=([^\']*\'[^\"]*\')*[^\']*$)")) {
System.out.println(a);
}
The output of the above code is
Some
message
I
want
to
split
'but
keeping
this
'a's a single string'
Voila!
However, I need single quotes to be ignored if they are escaped ( \' ), which the above does not do. Also, I need the first and last single quotes and forward slashes removed, if and only if it (the forward slashes) are escaping a single quote (to where 'this is a \'string' would become this is a 'string). I have no idea how to use regex. How would I accomplish this?

You need to use a negative lookbehind to take care of escaped single quotes:
String str =
"Some message I want to split 'but keeping this a\\'s a single string' Voila!";
String[] toks = str.split( " +(?=((.*?(?<!\\\\)'){2})*[^']*$)" );
for (String tok: toks)
System.out.printf("<%s>%n", tok);
output:
<Some>
<message>
<I>
<want>
<to>
<split>
<'but keeping this a\'s a single string'>
<Voila!>
PS: As you noted that escaped single quote needs to be typed as \\' in String assignment otherwise it will be treated as plain '

or you could use this pattern to capture what you want
('(?:[^']|(?!<\\\\)')*'|\S+)
Demo

I was really overthinking this one.
This should work, and the best part is that it doesn't use lookarounds at all (so it works in nearly ever regex implementation, most famously javascript)
('[^']*?(?:\\'[^']*?)*'|[^\s]+)
Instead of using a split, use a match to build an array with this regex.
My objectives were
It can discern between an escaped apostrophe and not (of course)
It's fast. The behemoth I wrote before actually took time
It worked with multiple subquotes, a lot of suggestions here don't.
Demo
Test String: Discerning between 'the single quote\'s double purpose' as a 'quote marker', like ", and a 'a cotraction\'s marker.'.
If you asked the author and he was speaking in the third person, he would say 'CFQueryParam\'s example is contrived, and he knew that but he had the world\'s most difficult time thinking up an example.'
Some message I want to split 'but keeping this a\'s a single string' Voila!
Result: Discerning, between, 'the single quote\'s double purpose', as, a, 'quote marker',,, like, ",, and, a, 'a cotraction\'s marker.',.,
If, you, asked, the, author, and, he, was, speaking, in, the, third, person,, he, would, say, 'CFQueryParam\'s example is contrived, and he knew that but he had the world\'s most difficult time thinking up an example.',
Some, message, I, want, to, split, 'but keeping this a\'s a single string', Voila!

Regular Expression to select first five CSVs from a string

I have a CSV string like apple404, orange pie, wind\,cool, sun\\mooon, earth, in Java. To be precise each value of the csv string could be any thing provided commas and backslash are escaped using a back slash.
I need a regular expression to find the first five values. After some goggling I came up with the following. But it wont allow escaped commas within the values.
Pattern pattern = Pattern.compile("([^,]+,){0,5}");
Matcher matcher = pattern.matcher("apple404, orange pie, wind\\,cool, sun\\\\mooon, earth,");
if (matcher.find()) {
System.out.println(matcher.group());
} else {
System.out.println("No match found.");
}
Does anybody know how to make it work for escaped commas within values?

Following negative look-behind based regex will work:
Pattern pattern = Pattern.compile("(?:.*?(?<!(?:(?<!\\\\)\\\\)),){0,5}");
However for full fledged CSV parsing better use a dedicated CSV parser like JavaCSV.

You can use String.split() here. By specifying the limit as 6 the first five elements (index 0 to 4) would always be the first five column values from your CSV string. If in case any extra column values are present they would all overflow to index 5.
The regex (?<!\\\\), makes sure the CSV string is only split at a , comma not preceded with a \.
String[] cols = "apple404, orange pie, wind\\,cool, sun\\\\mooon, earth, " +
"mars, venus, pluto".split("(?<!\\\\),", 6);
System.out.println(cols.length); // 6
System.out.println(Arrays.toString(cols));
// [apple404, orange pie, wind\,cool, sun\\mooon, earth, mars, venus, pluto]
System.out.println(cols[4]); // 5th = earth
System.out.println(cols[5]); // 6th discarded = mars, venus, pluto

This regular expression works well. It also properly recognizes not only backslash-escaped commas, but also backslash-escaped backslashes. Also, the matches it produces do not contain the commas.
/(?:\\\\|\\,|[^,])*/g
(I am using standard regular expression notation with the understanding that you would replace the delimiters with quote marks and double all backslashes when representing this regular expression within a Java string literal.)
example input
"apple404, orange pie, wind\,cool, sun\\,mooon, earth"
produces this output
"apple404"
" orange pie"
" wind\,cool"
" sun\\"
"mooon"
Note that the double backslash after "sun" is escaped and therefore does not escape the following comma.
The way this regular expression works is by atomizing the input into longest sequences first, beginning with double backslashes (treating them as one possible multi-byte character value alternative), followed by escaped commas (a second possible multi-byte character alternative), followed by any non-comma value. Any number of these atoms are matched, followed by a literal comma.
In order to obtain the first N fields, one may simply splice the array of matches from the previous answer or surround the main expression in additional parentheses, include an optional comma in order to match the contents between fields, anchor it to the beginning of the string to prevent the engine from returning further groups of N fields, and quantify it (with N = 5 here):
/^((?:\\\\|\\,|[^,])*,?){0,5}/g
Once again, I am using standard regular expression notation, but here I will also do the trivial exercise of quoting this as a Java string:
"^((?:\\\\\\\\|\\\\,|[^,])*,?){0,5}"
This is the only solution on this page so far which actually answers both parts of the precise requirements specified by the OP, "...commas and backslash are escaped using a back slash." For the input fi\,eld1\\,field2\\,field3\\,field4\\,field5\\,field6\\,, it properly matches only the first five fields fi\,eld1\\,field2\\,field3\\,field4\\,field5\\,.
Note: my first answer made the same assumption that is implicitly part of the OP's original code and example data, which required a comma to be following every field. The problem was that if input is exactly 5 fields or less, and the last field not followed by a comma (equivalently, by an empty field), then final field would not be matched. I did not like this, and so I updated both of my answers so that they do not require following commas.
The shortcoming with this answer is that it follows the OP's assumption that values between commas contain "anything" plus escaped commas or escaped backslashes (i.e., no distinction between strings in double quotes, etc., but only recognition of escaped commas and backslashes). My answer fulfills the criteria of that imaginary scenario. But in the real world, someone would expect to be able to use double quotes around a CSV field in order to include commas within a field without using backslashes.
So I echo the words of #anubhava and suggest that a "real" CSV parser should always be used when handling CSV data. Doing otherwise is just being a script kiddie and not in any way truly "handling" CSV data.

Parsing CSV with a RegEx in java - escape double quote within cell

I am looking for a java regex which will escape the doublequote within an excel cell.
I have followed this example but need another change in the regular expression to make it work for escaping doublequote within one of the cells.
Parsing CSV input with a RegEx in java
private final Pattern pattern = Pattern.compile("\"([^\"]*)\"|(?<=,|^)([^,]*)(?=,|$)");
Example Data:
"A,B","2" size","text1,text2, text3"
The regex from above fails at 2".
I want the output to be as below .Doesn't matter if the outer double quotes are there or not.
"A,B"
"2" size"
"text1,text2, text3"

while I agree, that using regex for parsing a CVS is not really the best way, a slightly better pattern is:
Pattern pattern = Pattern.compile("^\"([^\"]*)\",|,\"([^\"]*)\",|,\"([^\"]*)\"$|(?<=,|^)([^,]*)(?=,|$)");
This will terminate a cell value only after quote and comma, or start it after a command and a quote.

well as F.J commented, the input data is ambiguous. But for your example input, you could try
string.split("\",\"") method to get a String[].
after this, you got an array with 3 elements:
[
"A,B,
2" size,
text1,text2, text3"
]
remove the first character (which is double quote) of the first element of the array
remove the last character (which is double quote) of the last element of the array

java regex string split by " not \"

actually I need to write just a simple program in JAVA to convert MySQL INSERTS lines into CSV files (each mysql table equals one CSV file)
is the best solution to use regex in JAVA?
My main problem how to match correctly value like this: 'this is \'cool\'...'
(how to ignore escaped ')
example:
INSERT INTO `table1` VALUES ('this is \'cool\'...' ,'some2');
INSERT INTO `table1` (`field1`,`field2`) VALUES ('this is \'cool\'...' ,'some2');
Thanks

Assuming that your SQL statements are syntactically valid, you could use
Pattern regex = Pattern.compile("'(?:\\\\.|[^'\\\\])*'");
to get a regex that matches all single-quoted strings, ignoring escaped characters inside them.
Explanation without all those extra backslashes:
' # Match '
(?: # Either match...
\\. # an escaped character
| # or
[^'\\] # any character except ' or \
)* # any number of times.
' # Match '
Given the string
'this', 'is a \' valid', 'string\\', 'even \\\' with', 'escaped quotes.\\\''
this matches
'this'
'is a \' valid'
'string\\'
'even \\\' with'
'escaped quotes.\\\''

You can match on chars within non-escaped quotes by using this regex:
(?<!\\)'([^'])(?<!\\)`
This is using a negative look-behind to assert that the character before the quote is not a bask-slash.
In jave, you have to double-escape (once for the String, once for the regex), so it looks like:
String regex = "(?<!\\\\)'([^'])(?<!\\\\)`";
If you are working in linux, I would be using sed to do all the work.

Four backslashes (two to represent a backslash) plus dot. "'(\\\\.|.)*'"

Although regexes give you a very powerful mechanism to parse text, I think you might be better off with a non-regex parser. I think you code will be easier to write, easier to understand and have fewer bugs.
Something like:
find "INSERT INTO"
find table name
find column names
find "VALUES"
find value set (loop this part)
Writing the regex to do all of the above, with optional column values and an optional number of value sets is non-trivial and error-prone.

You have to use \\\\. In Java Strings \\is one \, because the backslash is used to do whitespace or control characters (\n,\t, ...). But in regex a backslash is also represented by '\'.

Regular expression to match strings enclosed in square brackets or double quotes

I need 2 simple reg exps that will:
Match if a string is contained within square brackets ([] e.g [word])
Match if string is contained within double quotes ("" e.g "word")

\[\w+\]
"\w+"
Explanation:
The \[ and \] escape the special bracket characters to match their literals.
The \w means "any word character", usually considered same as alphanumeric or underscore.
The + means one or more of the preceding item.
The " are literal characters.
NOTE: If you want to ensure the whole string matches (not just part of it), prefix with ^ and suffix with $.
And next time, you should be able to answer this yourself, by reading regular-expressions.info
Update:
Ok, so based on your comment, what you appear to be wanting to know is if the first character is [ and the last ] or if the first and last are both " ?
If so, these will match those:
^\[.*\]$ (or ^\\[.*\\]$ in a Java String)
"^.*$"
However, unless you need to do some special checking with the centre characters, simply doing:
if ( MyString.startsWith("[") && MyString.endsWith("]") )
and
if ( MyString.startsWith("\"") && MyString.endsWith("\"") )
Which I suspect would be faster than a regex.

Important issues that may make this hard/impossible in a regex:
Can [] be nested (e.g. [foo [bar]])? If so, then a traditional regex cannot help you. Perl's extended regexes can, but it is probably better to write a parser.
Can [, ], or " appear escaped (e.g. "foo said \"bar\"") in the string? If so, see How can I match double-quoted strings with escaped double-quote characters?
Is it possible for there to be more than one instance of these in the string you are matching? If so, you probably want to use the non-greedy quantifier modifier (i.e. ?) to get the smallest string that matches: /(".*?"|\[.*?\])/g
Based on comments, you seem to want to match things like "this is a "long" word"
#!/usr/bin/perl
use strict;
use warnings;
my $s = 'The non-string "this is a crazy "string"" is bad (has own delimiter)';
print $s =~ /^.*?(".*").*?$/, "\n";

Are they two separate expressions?
[[A-Za-z]+]
\"[A-Za-z]+\"
If they are in a single expression:
[[\"]+[a-zA-Z]+[]\"]+
Remember that in .net you'll need to escape the double quotes " by ""

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Remove inner double quote in CSV file - java

I think it is not possible, because the way you demarcate two values is ambiguous. For example, how to split the following value? ""I am", "a single", ", value"" Is it meant to be: I am a single , value or I am a single, , value or even I am, a single, , value ?

Related

Split String While Ignoring Escaped Character

Regular Expression to select first five CSVs from a string

Parsing CSV with a RegEx in java - escape double quote within cell

java regex string split by " not \"

Regular expression to match strings enclosed in square brackets or double quotes

Categories

Resources