Regular Expression to select first five CSVs from a string - java

I have a CSV string like apple404, orange pie, wind\,cool, sun\\mooon, earth, in Java. To be precise each value of the csv string could be any thing provided commas and backslash are escaped using a back slash.
I need a regular expression to find the first five values. After some goggling I came up with the following. But it wont allow escaped commas within the values.
Pattern pattern = Pattern.compile("([^,]+,){0,5}");
Matcher matcher = pattern.matcher("apple404, orange pie, wind\\,cool, sun\\\\mooon, earth,");
if (matcher.find()) {
System.out.println(matcher.group());
} else {
System.out.println("No match found.");
}
Does anybody know how to make it work for escaped commas within values?

Following negative look-behind based regex will work:
Pattern pattern = Pattern.compile("(?:.*?(?<!(?:(?<!\\\\)\\\\)),){0,5}");
However for full fledged CSV parsing better use a dedicated CSV parser like JavaCSV.

You can use String.split() here. By specifying the limit as 6 the first five elements (index 0 to 4) would always be the first five column values from your CSV string. If in case any extra column values are present they would all overflow to index 5.
The regex (?<!\\\\), makes sure the CSV string is only split at a , comma not preceded with a \.
String[] cols = "apple404, orange pie, wind\\,cool, sun\\\\mooon, earth, " +
"mars, venus, pluto".split("(?<!\\\\),", 6);
System.out.println(cols.length); // 6
System.out.println(Arrays.toString(cols));
// [apple404, orange pie, wind\,cool, sun\\mooon, earth, mars, venus, pluto]
System.out.println(cols[4]); // 5th = earth
System.out.println(cols[5]); // 6th discarded = mars, venus, pluto

This regular expression works well. It also properly recognizes not only backslash-escaped commas, but also backslash-escaped backslashes. Also, the matches it produces do not contain the commas.
/(?:\\\\|\\,|[^,])*/g
(I am using standard regular expression notation with the understanding that you would replace the delimiters with quote marks and double all backslashes when representing this regular expression within a Java string literal.)
example input
"apple404, orange pie, wind\,cool, sun\\,mooon, earth"
produces this output
"apple404"
" orange pie"
" wind\,cool"
" sun\\"
"mooon"
Note that the double backslash after "sun" is escaped and therefore does not escape the following comma.
The way this regular expression works is by atomizing the input into longest sequences first, beginning with double backslashes (treating them as one possible multi-byte character value alternative), followed by escaped commas (a second possible multi-byte character alternative), followed by any non-comma value. Any number of these atoms are matched, followed by a literal comma.
In order to obtain the first N fields, one may simply splice the array of matches from the previous answer or surround the main expression in additional parentheses, include an optional comma in order to match the contents between fields, anchor it to the beginning of the string to prevent the engine from returning further groups of N fields, and quantify it (with N = 5 here):
/^((?:\\\\|\\,|[^,])*,?){0,5}/g
Once again, I am using standard regular expression notation, but here I will also do the trivial exercise of quoting this as a Java string:
"^((?:\\\\\\\\|\\\\,|[^,])*,?){0,5}"
This is the only solution on this page so far which actually answers both parts of the precise requirements specified by the OP, "...commas and backslash are escaped using a back slash." For the input fi\,eld1\\,field2\\,field3\\,field4\\,field5\\,field6\\,, it properly matches only the first five fields fi\,eld1\\,field2\\,field3\\,field4\\,field5\\,.
Note: my first answer made the same assumption that is implicitly part of the OP's original code and example data, which required a comma to be following every field. The problem was that if input is exactly 5 fields or less, and the last field not followed by a comma (equivalently, by an empty field), then final field would not be matched. I did not like this, and so I updated both of my answers so that they do not require following commas.
The shortcoming with this answer is that it follows the OP's assumption that values between commas contain "anything" plus escaped commas or escaped backslashes (i.e., no distinction between strings in double quotes, etc., but only recognition of escaped commas and backslashes). My answer fulfills the criteria of that imaginary scenario. But in the real world, someone would expect to be able to use double quotes around a CSV field in order to include commas within a field without using backslashes.
So I echo the words of #anubhava and suggest that a "real" CSV parser should always be used when handling CSV data. Doing otherwise is just being a script kiddie and not in any way truly "handling" CSV data.

Related

Matching The Arabic punctuation marks in Java

I want to edit on REGEX_PATTERN2 in this code to work with matches()method of The Arabic punctuation marks
String REGEX_PATTERN = "[\\.|,|:|;|!|_|\\?]+";
String s1 = "My life :is happy, stable";
String[] result = s1.split(REGEX_PATTERN);
for (String myString : result) {
System.out.println(myString);
}
String REGEX_PATTERN2 = "[\\.|,|:|;|!|_|،|؛|؟\\?]+";
String s2 = " حياتي ؛ سعيدة، مستقر";
String[] result2 = s2.split(REGEX_PATTERN2);
for (String myString : result2) {
System.out.println(myString);
}
The output I wanted
My life
is happy
stable
حياتي
سعيدة
مستقر
How I can edit to this code and use the matches() instead of split() method to get the same output with Arabic punctuation marks
There are a few problems here. First this example:
if (word.matches("[\\.|,|:|;|!|\\?]+"))
That is mildly1 incorrect for the following reason:
A . does not need to be escaped in a character class.
A | does not mean alternation in a character class.
A ? does not need to be escaped in a character class.
(For more details, read the javadoc or a tutorial on Java regexes.)
So you can rewrite the above as:
if (word.matches("[.,:;!?]+"))
... assuming that you don't want to classify the pipe character as punctuation.
Now this:
if (word.matches("[\.|,|:|;|!|،|؛|..|...|؟|\?]+"))
You have same problems as above. In addition, you seem to have used the two and three full-stop / period characters instead of (presumably) some Unicode character. I suspect they might be a \ufbb7 or u061e or \u06db, but I'm no linguist. (Certainly 2 or 3 full-stops is incorrect.)
So what are the punctuation characters in Arabic?
To be honest, I think that the answer depends on what source you look at, but Wikipedia states:
Only the Arabic question mark ⟨؟⟩ and the Arabic comma ⟨،⟩ are used in regular Arabic script typing and the comma is often substituted for the Latin script comma (,).
1 - By mildly incorrect, I mean that the mistakes in this example are mostly harmless. However, your inclusion of (multiple instances of) the | character n the class does mean that you will incorrectly classify a "pipe" as punctuation.
[] denotes a regex character class, which means it only matches single characters. ... is 3 characters, so it cannot be used in a character class.
In a character class, you don't separate characters with |, and you don't need to escape . and ?.
You probably meant this, which is a list of alternate character sequences:
"(?:\\.|,|:|;|!|\\?|،|؛|؟|\\.\\.|\\.\\.\\.)+"
You might get better performance if you do use a character class where you can:
"(?:\\.{1,3}|[,:;!?،؛؟])+"
Of course, with the + at the end, matching 1-3 periods in each iteration is rather redundant, so this will do:
"[.,:;!?،؛؟]+"
Here's a different approach, that uses Unicode properties instead of specific characters (In case you care about more Arabic marks than just the question mark and comma mentioned in another answer):
"(?=^[\\p{InArabic}.,:;!?]+$)^\\p{IsPunctuation}+$"
It matches an entire string of characters that have a punctuation category, that also are either in the Arabic block or are one of the other punctuation characters you listed in your efforts.
It'll match strings like "؟،" or "؟،:", but not "؟،ؠ" or "؟،a".

Using regex to only match those Strings which use escape character correctly (according to Java syntax)?

take these strings for example:
"hello world\n" (correct - regex should match this)
"I'm happy \ here" (this is incorrect as the escape character is not
used correctly - regex should not match this one)
I've tried searching on google but didn't find anything helpful.
I want this one to be used in a parser which only parses string literals from a java code file.
Here is the the regex I used:
"\\\"(\\[tbnrf\'\"\\])*[a-zA-Z0-9\\`\\~\\!\\#\\#\\$\\%\\^\\&\\*\\(\\)\\_\\-\\+\\=\\|\\{\\[\\}\\]\\;\\:\\'\\/\\?\\>\\.\\<\\,]\\\""
what am I doing wrong?
I guess you gave us the regex in Java String literal form, like
String regex = \"(\[tbnrf'"\])*[a-zA-Z0-9\`\~\!\#\#\$\%\^\&\*\(\)\_\-\+\=\|\{\[\}\]\;\:\'\/\?\>\.\<\,]\";
Unpacking that from Java's String escaping syntax gives the raw regex:
\"(\[tbnrf'"\])*[a-zA-Z0-9\`\~\!\#\#\$\%\^\&\*\(\)\_\-\+\=\|\{\[\}\]\;\:\'\/\?\>\.\<\,]\"
That consists of:
\" matching a double-quote character (Java String literal begins here). Escaping the double quotes with backslash isn't necessary: " on its own is ok as well.
(\[tbnrf'"\])*: a group, repeated 0...n times. I guess you want that to match against the various Java backslash escapes, but that should read (\\[tbnrf'"\\])* with a double backslash in front and inside the character class. And maybe you want to cover the Java octal escapes as well (see the language specification), giving (\\[tbnrf01234567'"\\])*
[a-zA-Z0-9\``\~\!\#\#\$\%\^\&\*\(\)\_\-\+\=\|\{\[\}\]\;\:\'\/\?\>\.\<\,]: a character class matching one character from a selected list of alphabetic and punctuation characters. I'd replace that with [^"\\], meaning anything but double quote or backslash.
\" matching a double-quote character (string literal ends here). Once again, no need to escape the double quote.
Besides the individual elements, the overall structure of the regex probably isn't what you want: You allow only strings beginning with any number of backslash escapes, followed by exactly one non-escape character, and this enclosed in a pair of double quotes.
The overall structure should instead be "(backslash_escape|simple_character)*"
So, the complete regex would be:
"(\\[tbnrf01234567'"\\]|[^"\\])*"
or, expressed in a Java literal:
String regex = "\"(\\\\[tbnrf01234567'\"\\\\]|[^\"\\\\])*\"";
And, although this is shorter than your original attempt, I'd still not call it readable and opt for a different implementation, not using regular expressions.
P.S. Although I did some testing with my regex, I'm not at all sure that it covers all relevant cases correctly.
P.P.S. There are the \uxxxx escapes, not yet covered by the regex.

regex not matching with some numbers

I have this regex: ^(\d*.?\d*)$ for all numbers, but some numbers won't match with this regex
Some Examples:
54139 // work
24.711 // won't work, not a float but dot is the separator
0 // won't work
60 // won't work
I used this regex in RegexValidator. I'm validating a textfield:
TextField textField = new TextField(caption);
textField.setValue(value);
textField.addValidator(new StringLengthValidator(value + " ...",10, 50, true));
textField.addValidator(new RegexpValidator("^(\\d*.?\\d*)$", value + " ..."));
I tried it with another regex: ^[0-9,.]+$
If I understood your problem correctly, you're validating the content of a multi-line TextField and want to accept a content consisting of one or multiple lines of non-empty sequences of .-separated floats and integers.
This statement can be simplified as follows : you're looking to match a sequence of numbers separated by . or linefeeds, where a number can contain a decimal part.
If would then use the following RegexpValidator :
new RegexpValidator("\\d+(,\\d+)?([.\\n]\\d+(,\\d+)?)*", true, value + "...")
In this regular expression, a number is represented as \\d+(,\\d+)?, which represents a mandatory integer part followed by an optional decimal part with its comma separator.
The global regular expression accepts a number such as defined above, followed by a (possibly empty) sequence of other numbers preceded by one of the accepted delimiters, a . or a linefeed.
I verified my answer with the following expression which returns true :
"54139\n12.5\n1,2.5,3\n12,5".matches("\\d+(,\\d+)?([.\\n]\\d+(,\\d+)?)*");
Note that I removed the anchors you used and used the complete parameter of RegexpValidator instead.
The regex could arguably be reduced to \d+([.,\n]\d+)*, but would then stop representing the difference between . which is a separator between numbers and , which is part of the numbers. It doesn't matter at the runtime for a validator, but could bring confusion to people maintaining your code, and couldn't easily be reused if you later want to match the different numbers.

Why the space appears as sub string in this split instruction?

I have string with spaces and some non-informative characters and substrings required to be excluded and just to keep some important sections. I used the split as below:
String myString[]={"01: Hi you look tired today? Can I help you?"};
myString=myString[0].split("[\\s+]");// Split based on any white spaces
for(int ii=0;ii<myString.length;ii++)
System.out.println(myString[ii]);
The result is :
01:
Hi
you
look
tired
today?
Can
I
help
you?
The spaces appeared after the split as sub strings when the regex is “[\s+]” but disappeared when the regex is "\s+". I am confused and not able to find answer in the related stack overflow pages. The link regex-Pattern made me more confused.
Please help, I am new with java.
19/1/2015:Edit
After your valuable advice, I reached to point in my program where a conditional statements is required to be decomposed and processed. The case I have is:
String s1="01:IF rd.h && dq.L && o.LL && v.L THEN la.VHB , av.VHR with 0.4610;";
String [] s2=s1.split(("[\\s\\&\\,]+"));
for(int ii=0;ii<s2.length;ii++)System.out.println(s2[ii]);
The result is fine till now as:
01:IF
rd.h
dq.L
o.LL
v.L
THEN
la.VHB
av.VHR
with
0.4610;
My next step is to add string "with" to the regex and get rid of this word while doing the split.
I tried it this way:
String s1="01:IF rd.h && dq.L && o.LL && v.L THEN la.VHB , av.VHR with 0.4610;";
String [] s2=s1.split(("[\\s\\&\\, with]+"));
for(int ii=0;ii<s2.length;ii++)System.out.println(s2[ii]);
The result not perfect, because I got unwonted extra split at every "h" letter as:
01:IF
rd.
dq.L
o.LL
v.L
THEN
la.VHB
av.VHR
0.4610;
Any advice on how to specify string with mixed white spaces and separation marks?
Many thanks.
inside square brackets, [\s+] will represent the whitespace character class with the plus sign added. it is only one character so a sequence of spaces will split many empty strings as Todd noted, and will also use + as separator.
you should use \s+ (without brackets) as the separator. that means one or more whitespace characters.
myString=myString[0].split("\\s+");
Your biggest problem is not understanding enough about regular expressions to write them properly. One key point you don't comprehend is that [...] is a character class, which is a list of characters any one of which can match. For example:
[abc] matches either a, b or c (it does not match "abc")
[\\s+] matches any whitespace or "+" character
[with] matches a single character that is either w, i, t or h
[.$&^?] matches those literal characters - most characters lose their special regex meaning when in a character class
To split on any number of whitespace, comma and ampersand and consume "with" (if it appears), do this:
String [] s2 = s1.split("[\\s,&]+(with[\\s,&]+)?");
You can try it easily here Online Regex and get useful comments.

Parsing CSV with a RegEx in java - escape double quote within cell

I am looking for a java regex which will escape the doublequote within an excel cell.
I have followed this example but need another change in the regular expression to make it work for escaping doublequote within one of the cells.
Parsing CSV input with a RegEx in java
private final Pattern pattern = Pattern.compile("\"([^\"]*)\"|(?<=,|^)([^,]*)(?=,|$)");
Example Data:
"A,B","2" size","text1,text2, text3"
The regex from above fails at 2".
I want the output to be as below .Doesn't matter if the outer double quotes are there or not.
"A,B"
"2" size"
"text1,text2, text3"
while I agree, that using regex for parsing a CVS is not really the best way, a slightly better pattern is:
Pattern pattern = Pattern.compile("^\"([^\"]*)\",|,\"([^\"]*)\",|,\"([^\"]*)\"$|(?<=,|^)([^,]*)(?=,|$)");
This will terminate a cell value only after quote and comma, or start it after a command and a quote.
well as F.J commented, the input data is ambiguous. But for your example input, you could try
string.split("\",\"") method to get a String[].
after this, you got an array with 3 elements:
[
"A,B,
2" size,
text1,text2, text3"
]
remove the first character (which is double quote) of the first element of the array
remove the last character (which is double quote) of the last element of the array

Categories