I'm trying to replaces commas with placeholder text within double-quoted elements of a CSV.
For instance, given this line in a CSV:
1,2,"three,four,five",6,7,8,"nine,ten",11,12
Using this regex (quotes escaped for Java):
(?<=\")([^"]+?),([^"]+?)(?=\")
I replace the first match with:
$1<COMMA>$2
Which gives me this result string:
1,2,"three<COMMA> four, five",6,7,8,"nine,ten",11,12
I repeat these steps against the resultString until there are no more matches. Here are the progressive result strings:
1,2,"three<COMMA> four, five",6,7,8,"nine,ten",11,12
1,2,"three<COMMA> four<COMMA> five",6,7,8,"nine,ten",11,12
1,2,"three<COMMA> four<COMMA> five",6<COMMA>7,8,"nine,ten",11,12
1,2,"three<COMMA> four<COMMA> five",6<COMMA>7<COMMA>8,"nine,ten",11,12
1,2,"three<COMMA> four<COMMA> five",6<COMMA>7<COMMA>8,"nine<COMMA>ten",11,12
1,2,"three<COMMA> four<COMMA> five",6<COMMA>7<COMMA>8,"nine<COMMA>ten",11,12
How can I tweak my regex so it only replaces the "," within the list items and not the delimiters themselves? In the 3rd iteration, I'm getting a match on: ",6,7,8,"
I tried to prevent this by having my lookbehind match only against one dbl quote with no dble quotes around it, or groups of three dbl quotes, but ran into "Look-behind group does not have an obvious maximum length" error,
You could change it so that the first matched character inside quotation marks can't be a comma: (?<=\")([^",][^"]*?),([^"]+?)(?=\").
Having said that, I don't think iterating it until it stops iterating like this is a very nice way of doing it. Personally I'd probably split the line into an array of strings using the unescaped columns, then iterate through the array and do a search-and-replace on each "-delimited string in the array with the /g modifier. But it's personal choice I suppose.
After a quick google:
^(("(?:[^"]|"")*"|[^,]*)(,("(?:[^"]|"")*"|[^,]*))*)$
This matches single elements in line of the csv file.
http://www.kimgentes.com/worshiptech-web-tools-page/2008/10/14/regex-pattern-for-parsing-csv-files-with-embedded-commas-dou.html
Related
I'm trying to split an input by ".,:;()[]"'\/!? " chars and add the words to a list. I've tried .split("\\W+?") and .split("\\W"), but both of them are returning empty elements in the list.
Additionally, I've tried .split("\\W+"), which returns only words without any special characters that should go along with them (for instance, if one of the input words is "C#", it writes "C" in the list). Lastly, I've also tried to put all of the special chars above into the .split() method: .split("\\.,:;\\(\\)\\[]\"'\\\\/!\\? "), but this isn't splitting the input at all. Could anyone advise please?
split() function accepts a regex.
This is not the regex you're looking for .split("\\.,:;\\(\\)\\[]\"'\\\\/!\\? ")
Try creating a character class like [.,:;()\[\]'\\\/!\?\s"] and add + to match one or more occurences.
I also suggest to change the character space with the generic \s who takes all the space variations like \t.
If you're sure about the list of characters you have selected as splitters, this should be your correct split with the correct Java string literal as #Andreas suggested:
.split("[.,:;()\\[\\]'\\\\\\/!\\?\\s\"]+")
BTW: I've found a particularly useful eclipse editor option which escapes the string when you're pasting them into the quotes. Go to Window/Preferences, under Java/Editor/Typing/, check the box next to Escape text when pasting into a string literal
I have a problem that I can't seem to find an answer here for, so I'm asking it.
The thing is that I have a string and I have delimiters. I want to create an array of strings from the things which are between those delimiters (might be words, numbers, etc). However, if I have two delimiters next to one another, the split method will return an empty string for one of the instances.
I tested this against even more delimiters that are in succession. I found out that if I have n delimiters, I will have n-1 empty strings in the result array. In other words, if I have both "," and " " as delimiters, and the sentence "This is a very nice day, isn't it", then the array with results would be like:
{... , "day", "", "isn't" ...}
I want to get those extra empty strings out and I can't figure out how to do that. A sample regex for the delimiters that I have is:
"[\\s,.-\\'\\[\\]\\(\\)]"
Also can you explain why there are extra empty strings in the result array?
P.S. I read some of the similar posts which included information about the second parameter of the regex. I tried both negative, zero, and positive numbers, and I didn't get the result that I'm looking for. (one of the questions had an answer saying that -1 as a parameter might solve the problem, but it didn't.
You can use this regex for splitting:
[\\s,.'\\[\\]()-]+
Keep unescaped hyphen at first or last position in character class otherwise it is treated as range like A-Z or 0-9
You must use quantifier + for matching 1 more delimiters
I think your problem is just the regex itself. You should use a greedy quantifier:
"[\\s,.-\\'\\[\\]\\(\\)]+"
See http://docs.oracle.com/javase/7/docs/api/java/util/regex/Pattern.html#sum
X+ ... X, one or more times
Your regular expression describes just one single character. If you want it to match multiple separators at once, use a quantifier:
String s = "This is a very nice day, isn't it";
String[] tokens = s.split("[\\s,.\\-\\[\\]()']+");
(Note the '+' at the end of the expression)
If you want to get rid of empty strings, you can use the Guava project Splitter class.
on method:
Returns a splitter that uses the given fixed string as a separator.
Example (ignoring empty strings):
System.out.println(
Splitter.on(',')
.trimResults()
.omitEmptyStrings()
.split("foo,bar,, qux")
);
Output:
[foo, bar, qux]
onPattern method:
Returns a splitter that considers any subsequence matching a given
pattern (regular expression) to be a separator.
Example (ignoring empty strings):
System.out.println(
Splitter
.onPattern("([,.|])")
.trimResults()
.omitEmptyStrings()
.split("foo|bar,, qux.hi")
);
Output:
[foo, bar, qux, hi]
For more details, consult Splitter documentation.
I have a CSV string like apple404, orange pie, wind\,cool, sun\\mooon, earth, in Java. To be precise each value of the csv string could be any thing provided commas and backslash are escaped using a back slash.
I need a regular expression to find the first five values. After some goggling I came up with the following. But it wont allow escaped commas within the values.
Pattern pattern = Pattern.compile("([^,]+,){0,5}");
Matcher matcher = pattern.matcher("apple404, orange pie, wind\\,cool, sun\\\\mooon, earth,");
if (matcher.find()) {
System.out.println(matcher.group());
} else {
System.out.println("No match found.");
}
Does anybody know how to make it work for escaped commas within values?
Following negative look-behind based regex will work:
Pattern pattern = Pattern.compile("(?:.*?(?<!(?:(?<!\\\\)\\\\)),){0,5}");
However for full fledged CSV parsing better use a dedicated CSV parser like JavaCSV.
You can use String.split() here. By specifying the limit as 6 the first five elements (index 0 to 4) would always be the first five column values from your CSV string. If in case any extra column values are present they would all overflow to index 5.
The regex (?<!\\\\), makes sure the CSV string is only split at a , comma not preceded with a \.
String[] cols = "apple404, orange pie, wind\\,cool, sun\\\\mooon, earth, " +
"mars, venus, pluto".split("(?<!\\\\),", 6);
System.out.println(cols.length); // 6
System.out.println(Arrays.toString(cols));
// [apple404, orange pie, wind\,cool, sun\\mooon, earth, mars, venus, pluto]
System.out.println(cols[4]); // 5th = earth
System.out.println(cols[5]); // 6th discarded = mars, venus, pluto
This regular expression works well. It also properly recognizes not only backslash-escaped commas, but also backslash-escaped backslashes. Also, the matches it produces do not contain the commas.
/(?:\\\\|\\,|[^,])*/g
(I am using standard regular expression notation with the understanding that you would replace the delimiters with quote marks and double all backslashes when representing this regular expression within a Java string literal.)
example input
"apple404, orange pie, wind\,cool, sun\\,mooon, earth"
produces this output
"apple404"
" orange pie"
" wind\,cool"
" sun\\"
"mooon"
Note that the double backslash after "sun" is escaped and therefore does not escape the following comma.
The way this regular expression works is by atomizing the input into longest sequences first, beginning with double backslashes (treating them as one possible multi-byte character value alternative), followed by escaped commas (a second possible multi-byte character alternative), followed by any non-comma value. Any number of these atoms are matched, followed by a literal comma.
In order to obtain the first N fields, one may simply splice the array of matches from the previous answer or surround the main expression in additional parentheses, include an optional comma in order to match the contents between fields, anchor it to the beginning of the string to prevent the engine from returning further groups of N fields, and quantify it (with N = 5 here):
/^((?:\\\\|\\,|[^,])*,?){0,5}/g
Once again, I am using standard regular expression notation, but here I will also do the trivial exercise of quoting this as a Java string:
"^((?:\\\\\\\\|\\\\,|[^,])*,?){0,5}"
This is the only solution on this page so far which actually answers both parts of the precise requirements specified by the OP, "...commas and backslash are escaped using a back slash." For the input fi\,eld1\\,field2\\,field3\\,field4\\,field5\\,field6\\,, it properly matches only the first five fields fi\,eld1\\,field2\\,field3\\,field4\\,field5\\,.
Note: my first answer made the same assumption that is implicitly part of the OP's original code and example data, which required a comma to be following every field. The problem was that if input is exactly 5 fields or less, and the last field not followed by a comma (equivalently, by an empty field), then final field would not be matched. I did not like this, and so I updated both of my answers so that they do not require following commas.
The shortcoming with this answer is that it follows the OP's assumption that values between commas contain "anything" plus escaped commas or escaped backslashes (i.e., no distinction between strings in double quotes, etc., but only recognition of escaped commas and backslashes). My answer fulfills the criteria of that imaginary scenario. But in the real world, someone would expect to be able to use double quotes around a CSV field in order to include commas within a field without using backslashes.
So I echo the words of #anubhava and suggest that a "real" CSV parser should always be used when handling CSV data. Doing otherwise is just being a script kiddie and not in any way truly "handling" CSV data.
I hava a string(A list of author names for a book) which is of the following format:
author_name1, author_name2, author_name3 and author_name4
How can I parse the string so that I get the list of author names as an array of String. (The delimiters in this case are , and the word and. I'm not sure how I can split the string based on these delimiters (since the delimiter here is a word and not a single character).
You can use myString.split(",|and") it will do what you want :)
You should use regular expressions:
"someString".split("(,|and)")
Try:
yourString.split("\\s*(,|and)\\s*")
\\s* means zero or more whitespace characters (so the surrounding spaces aren't included in your split).
(,|and) means , or and.
Test (Arrays.toString prints the array in the form - [element1, element2, ..., elementN]).
Java regex reference.
I think you need to include the regex OR operator:
String[]tokens = someString.split(",|and");
I am looking for a java regex which will escape the doublequote within an excel cell.
I have followed this example but need another change in the regular expression to make it work for escaping doublequote within one of the cells.
Parsing CSV input with a RegEx in java
private final Pattern pattern = Pattern.compile("\"([^\"]*)\"|(?<=,|^)([^,]*)(?=,|$)");
Example Data:
"A,B","2" size","text1,text2, text3"
The regex from above fails at 2".
I want the output to be as below .Doesn't matter if the outer double quotes are there or not.
"A,B"
"2" size"
"text1,text2, text3"
while I agree, that using regex for parsing a CVS is not really the best way, a slightly better pattern is:
Pattern pattern = Pattern.compile("^\"([^\"]*)\",|,\"([^\"]*)\",|,\"([^\"]*)\"$|(?<=,|^)([^,]*)(?=,|$)");
This will terminate a cell value only after quote and comma, or start it after a command and a quote.
well as F.J commented, the input data is ambiguous. But for your example input, you could try
string.split("\",\"") method to get a String[].
after this, you got an array with 3 elements:
[
"A,B,
2" size,
text1,text2, text3"
]
remove the first character (which is double quote) of the first element of the array
remove the last character (which is double quote) of the last element of the array