Found this code that breaks out CSV fields if contains double-quotes
But I don't really understand the pattern matching from regex
If someone can give me an step by step explanation of how this expression evaluates a pattern it would be appreciated
"([^\"]*)"|(?<=,|^)([^,]*)(?:,|$)
Thanks
====
Old posting
This is working well for me - either it matches on "two quotes and whatever is between them", or "something between the start of the line or a comma and the end of the line or a comma". Iterating through the matches gets me all the fields, even if they are empty. For instance,
the quick, "brown, fox jumps", over, "the",,"lazy dog" breaks down into
the quick "brown, fox jumps" over "the" "lazy dog"
import java.util.ArrayList;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class CSVParser {
/*
* This Pattern will match on either quoted text or text between commas, including
* whitespace, and accounting for beginning and end of line.
*/
private final Pattern csvPattern = Pattern.compile("\"([^\"]*)\"|(?<=,|^)([^,]*)(?:,|$)");
private ArrayList<String> allMatches = null;
private Matcher matcher = null;
private String match = null;
private int size;
public CSVParser() {
allMatches = new ArrayList<String>();
matcher = null;
match = null;
}
public String[] parse(String csvLine) {
matcher = csvPattern.matcher(csvLine);
allMatches.clear();
String match;
while (matcher.find()) {
match = matcher.group(1);
if (match!=null) {
allMatches.add(match);
}
else {
allMatches.add(matcher.group(2));
}
}
size = allMatches.size();
if (size > 0) {
return allMatches.toArray(new String[size]);
}
else {
return new String[0];
}
}
public static void main(String[] args) {
String lineinput = "the quick,\"brown, fox jumps\",over,\"the\",,\"lazy dog\"";
CSVParser myCSV = new CSVParser();
System.out.println("Testing CSVParser with: \n " + lineinput);
for (String s : myCSV.parse(lineinput)) {
System.out.println(s);
}
}
}
I try to give you hints and the needed vocabulary to find very good explanations on regular-expressions.info
"([^\"]*)"|(?<=,|^)([^,])(?:,|$)
() is a group
* is a quantifier
If there is a ? right after the opening bracket then it's a special group, here (?<=,|^) is a lookbehind assertion.
Square brackets declare a character class e.g. [^\"]. This one is a special one, because of the ^ at the start. It is a negated character class.
| denotes an alternation, i.e. an OR operator.
(?:,|$) is a non capturing group
$ is a special character in regex, it is an anchor (which matches the end of the string)
"([^\"]*)"|(?<=,|^)([^,]*)(?:,|$)
() capture group
(?:) non-capture group
[] any character within the bracket matches
\ escape character used to match operators aka "
(?<=) positive lookbehind (looks to see if the contained matches before the marker)
| either or operator (matches either side of the pipe)
^ beginning of line operator
* zero or more of the preceding character
$ or \z end of line operator
For future reference please bookmark a a good regex reference it can explain each part quite well.
Related
I have to split word when find ^ and _live in String. I am able to split only match ^ but I have to split when match ^ and _live. The result should be
[ab,cb,db,qw]
How will be done?
String usergroup="ab_live^cb_live^db_live^qw_live";
String[] userGroupParts = usergroup.split("\\^");
List<String> listUserGroupParts = Arrays.asList(userGroupParts);
Set<String> SMGroupDetails = new HashSet<String>(listUserGroupParts);
We can say that split separator should be _live^ or just _live at the end of the line.
That's why regular expression must consist of _live and capturing group (\^|$) witch includes two alternatives separated by | (or):
1st alternative \^ matches the character ^ literally (by using escape character before) and 2nd alternative $ asserts position at the end of a line.
String[] userGroupParts = usergroup.split("_live(\\^|$)");
This should do it...
public static void main(String[] args) {
String usergroup = "ab_live^cb_live^db_live^qw_live";
String[] userGroupParts = usergroup.split("\\^");
for (int i=0; i<userGroupParts.length; i++) userGroupParts[i] = userGroupParts[i].split("\\_")[0];
for (String s : userGroupParts) System.out.println(s);
}
i.e. you first split by ^ and then you cycle through the resulting strings splitting on _, retaining only the bit prior to the underscore
I would not use method split, of class java.lang.String, but rather regular expressions.
You want to create a list of all the occurrences of the letters that appear after the literal character ^ and before the string _live. The following code achieves this. (Explanations after the code.)
/* Required imports:
* java.util.ArrayList
* java.util.List
* java.util.regex.Matcher
* java.util.regex.Pattern
*/
String usergroup="ab_live^cb_live^db_live^qw_live";
Pattern pattern = Pattern.compile("\\^?(\\w+)_live");
Matcher matcher = pattern.matcher(usergroup);
List<String> listUserGroupParts = new ArrayList<>();
while (matcher.find()) {
listUserGroupParts.add(matcher.group(1));
}
System.out.println(listUserGroupParts);
The regular expression, i.e. the argument to method compile in the above code, looks for the following:
the literal character ^, followed by
at least one word character, followed by
the literal string _live
Note that part 2 is surrounded by brackets which means it is referred to as a group.
The while loop searches usergroup for the next occurrence of the regular expression and each time it finds an occurrence, it extracts the contents of the group and adds it to the List.
The output when running the above code is:
[ab, cb, db, qw]
I have the string "B2BNewQuoteProcess". When I use Guava to convert from Camel Case to Lower Hyphen as follows:
CaseFormat.UPPER_CAMEL.to(CaseFormat.LOWER_HYPHEN,"B2BNewQuoteProcess");
I get "b2-b-new-quote-process".
What I am looking for is "b2b-new-quote-process"...
How do I do this in Java?
Edit
To prevent - at the beginning of a line, use the following instead of my original answer:
(?!^)(?=[A-Z][a-z])
Code
See regex in use here
(?=[A-Z][a-z])
Replacement: -
Note: The regex above doesn't convert the uppercase characters to lowercase; it simply inserts - into the positions that should have them. The conversion of uppercase characters to lowercase character occurs in the Java code below using .toLowerCase().
Usage
See code in use here
import java.util.*;
import java.lang.*;
import java.io.*;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
/* Name of the class has to be "Main" only if the class is public. */
class Ideone
{
public static void main (String[] args) throws java.lang.Exception
{
final String regex = "(?=[A-Z][a-z])";
final String string = "B2BNewQuoteProcess";
final String subst = "-";
final Pattern pattern = Pattern.compile(regex);
final Matcher matcher = pattern.matcher(string);
// The substituted value will be contained in the result variable
final String result = matcher.replaceAll(subst);
System.out.println("Substitution result: " + result.toLowerCase());
}
}
Explanation
(?=[A-Z][a-z]) Positive lookahead ensuring what follows is an uppercase ASCII letter followed by a lowercase ASCII letter. This is used as an assertion for the position. The replacement simply inserts a hyphen - into the positions that match this lookahead.
Camel case
Lower hyphen
Should have been
B2BNewQuoteProcess
b2-b-new-quote-process
b2b-new-quote-process
BaBNewQuoteProcess
ba-b-new-quote-process
B2NewQuoteProcess
b2-new-quote-process
BABNewQuoteProcess
b-a-b-new-quote-process
bab-new-quote-process
So:
Digits after a capital count as capital too
With N+1 "capitals" the first N capitals form a word
A repair of the wrong result would be:
String expr = "b2-b-new-quote-process";
expr = Pattern.compile("\\b[a-z]\\d*(-[a-z]\\d*)+\\b")
.matcher(expr).replaceAll(mr ->
mr.group().replace("-", ""));
This searches between word boundaries (\b) a sequence of letter with any digits, followed by a repetition of hyphen plus letter with any digits.
METHOD FOR VERSIONS BELOW JAVA 8
Use this method to convert any camel case string. You can select any type of separator.
private String camelCaseToLowerHyphen(String s) {
StringBuilder parsedString = new StringBuilder(s.substring(0, 1).toLowerCase());
for (char c : s.substring(1).toCharArray()) {
if (Character.isUpperCase(c)) {
parsedString.append("_").append(Character.toLowerCase(c));
} else {
parsedString.append(c);
}
}
return parsedString.toString().toLowerCase();
}
}
Original code taken from: http://www.java2s.com/example/java-utility-method/string-camel-to-hyphen-index-0.html
When use java regular-expression pattern.matcher(), source does not match regex.But, my hope result is ,source matches regex.
String source = "ONE.TWO"
String regex = "^ONE\\.TWO\\..*"
import java.util.regex.Pattern;
public class Test {
public static void main(String[] args) {
test();
}
public static void test() {
Test stringDemo = new Test();
stringDemo.testMatcher();
}
public void testMatcher() {
String source = "ONE.TWO";
String regex = "^ONE\\.TWo\\..*";
// The result = false, "not match". But, the hope result is true, "match"
matcher(source, regex);
}
public void matcher(String source, String regex) {
Pattern pattern = Pattern.compile(regex);
boolean match = pattern.matcher(source).matches();
if (match) {
System.out.println("match");
} else {
System.out.println("not match");
}
}
}
In your code, your regular expression expects the o in TWO to be lower case and expects it to be followed by a ..
Try:
String source = "ONE.TWo.";
This will match your regular expression as coded in your question.
The expression \. means match a literal dot (rather than any character). When you code this into a Java String, you have to escape the backslash with another backslash, so it becomes "\\.".
The .* on the end of the expression means "match zero or more of any character (except line-break)".
So this would also match:
String source = "ONE.TWo.blah blah";
Well it doesn't match for two reasons:
Your regex "^ONE\\.TWo\\..*" isn't case sensitive so how do you expect TWo to match TWO.
And your regex expects a . character at the end while your string "ONE.TWO" doesn't have it.
Use the following Regex, to match your source string:
String regex = "^ONE\\.TWO\\.*.*";
Pattern matching is case sensitive by Default. In your case source has a uppercase O and regex a lowercase o.
So you have to add Pattern.CASE_INSENSITIVE or Change the case of o
Pattern pattern = Pattern.compile(regex,Pattern.CASE_INSENSITIVE );
or
String regex = "^ONE\\.TWO\\..*";
Your regex is a bit incorrect. You have an extra dot here:
String regex = "^ONE\.TWO\.(extra dot).*"
Try this one, without dot:
String regex = "^ONE\.TWO.*"
String regex = "^ONE\\.TWO\\..*"
The DOUBLE SLASH \\ in regex is escape sequence to match a SINGLE SLASH \ in Source string.
The .* at the end matches any character 0 or More times except line breaks.
To match the regex your source should be like
String source = "ONE\.TWO\three blah ##$ etc" OR
String source = "ONE\.TWO\.123##$ etc"
Basically its Any String which starts with ONE\.TWO\ and without line breaks.
I have following regular expression
(?i)\b((https?:\/\/www\.)|(https?:\/\/)|(www\.))?(localhost).*\b
and following url
http://localhost:8081/saman/ab/cde/fgh/ijkl.jsf?gdi=ff8081abcdef02a011b0af032170001&ci=
It matches when tried with both https://regex101.com/ and http://rubular.com/r/kyiKS9OlsM
But when there is any special character at the end, url does not match
import java.text.Format;
import java.text.MessageFormat;
import java.util.regex.Pattern;
public class JavaApplication1 {
/**
* #param args the command line arguments
*/
private static final String URL_MATCH_REGEX = "(?i)\\b((https?:\\/\\/www\\.)|(https?:\\/\\/)|(www\\.))?({0}).*\\b";
private static final Format format = new MessageFormat(URL_MATCH_REGEX);
static String regex = "";
static String url = "http://localhost:8081/saman/ab/cde/fgh/ijkl.jsf?gdi=ff8081abcdef02a011b0af032170001&ci=";
public static void main(String[] args) {
try {
regex = format.format(new Object[]{replaceDomainToUseInRegex("localhost")});
System.out.println(regex);
Pattern pattern = Pattern.compile(regex);
System.out.println(pattern.matcher( url ).matches());
} catch (Exception e) {
}
}
private static String replaceDomainToUseInRegex(String domain) {
return domain.replace(".", "\\.").replace("/", "\\/").replace("?", "\\?");
}
}
Can anyone help me to figure out the issue here?
Your problem is that you're using two different kinds of matches. Java's matches() requires the entire string to match the regular expression. regex101.com does not. So it says there's a match if any substring of your input string matches the regex. However, in regex101.com, you can get the same kind of match by putting ^ in the front of the regex and $ at the end; now it requires the entire string to match. And it doesn't match.
(\b matches a "word boundary"; it matches the "zero-width substring" between a non-word character and a word character (in either order), or between a word character and the beginning or end of the string. = is not a word character, thus \b doesn't match the position between = and the end of the string.)
I'm looking for the regex pattern, not the Java code, to match the last word in an English (or European language) sentence. If the last word is, in this case, "hi" then I want to match "hi" and not "hi."
The regex (\w+)\.$ will match "hi.", whereas the output should be just "hi". What's the correct regex?
thufir#dur:~/NetBeansProjects/regex$
thufir#dur:~/NetBeansProjects/regex$ java -jar dist/regex.jar
trying
a b cd efg hi
matches:
hi
trying
a b cd efg hi.
matches:
thufir#dur:~/NetBeansProjects/regex$
code:
package regex;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class Main {
public static void main(String[] args) {
String matchesLastWordFine = "a b cd efg hi";
lastWord(matchesLastWordFine);
String noMatchFound = matchesLastWordFine + ".";
lastWord(noMatchFound);
}
private static void lastWord(String sentence) {
System.out.println("\n\ntrying\n" + sentence + "\nmatches:");
Pattern pattern = Pattern.compile("(\\w+)$");
Matcher matcher = pattern.matcher(sentence);
String match = null;
while (matcher.find()) {
match = matcher.group();
System.out.println(match);
}
}
}
My code is in Java, but that's neither here nor there. I'm strictly looking for the regex, not the Java code. (Yes, I know it's possible to strip out the last character with Java.)
What regex should I put in the pattern?
You can use lookahead asserion. For example to match sentence without period:
[\w\s]+(?=\.)
and
[\w]+(?=\.)
For just last word (word before ".")
If you need to have the whole match be the last word you can use lookahead.
\w+(?=(\.))
This matches a set of word characters that are followed by a period, without matching the period.
If you want the last word in the line, regardless of wether the line ends on the end of a sentence or not you can use:
\w+(?=(\.?$))
Or if you want to also include ,!;: etc then
\w+(?=(\p{Punct}?$))
You can use matcher.group(1) to get the content of the first capturing group ((\w+) in your case). To say a little more, matcher.group(0) would return you the full match. So your regex is almost correct. An improvement is related to your use of $, which would catch the end of the line. Use this only if your sentence fill exactly the line!
With this regular expression (\w+)\p{Punct} you get a group count of 1, means you get one group with punctionation at matcher.group(0) and one without the punctuation at matcher.group(1).
To write the regular expression in Java, use: "(\\w+)\\p{Punct}"
To test your regular expressions online with Java (and actually a lot of other languages) see RegexPlanet
By using the $ operator you will only get a match at the end of a line. So if you have multiple sentences on one line you will not get a match in the middle one.
So you should just use:
(\w+)\.
the capture group will give the correct match.
You can see an example here
I don't understand why really, but this works:
package regex;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class Main {
public static void main(String[] args) {
String matchesLastWordFine = "a b cd efg hi";
lastWord(matchesLastWordFine);
String noMatchFound = matchesLastWordFine + ".";
lastWord(noMatchFound);
}
private static void lastWord(String sentence) {
System.out.println("\n\ntrying\n" + sentence + "\nmatches:");
Pattern pattern = Pattern.compile("(\\w+)"); //(\w+)\.
Matcher matcher = pattern.matcher(sentence);
String match = null;
while (matcher.find()) {
match = matcher.group();
}
System.out.println(match);
}
}
I guess regex \w+ will match all the words (doh). Then the last word is what I was after. Too simple, really, I was trying to exclude punctuation, but I guess regex does that automagically for you..?