I can replace ABC(10,5) with (10)%(5) using:
replaceAll("ABC\\(([^,]*)\\,([^,]*)\\)", "($1)%($2)")
but I'm unable to figure out how to do it for ABC(ABC(20,2),5) or ABC(ABC(30,2),3+2).
If I'm able to convert to ((20)%(2))%5 how can I convert back to ABC(ABC(20,2),5)?
Thanks,
j
I am going to answer about the first question. I was not able to do the task in a single replaceAll. I don't think it is even achievable. However if I use loop then this should do the work for you:
String termString = "([0-9+\\-*/()%]*)";
String pattern = "ABC\\(" + termString + "\\," + termString + "\\)";
String [] strings = {"ABC(10,5)", "ABC(ABC(20,2),5)", "ABC(ABC(30,2),3+2)"};
for (String str : strings) {
while (true) {
String replaced = str.replaceAll(pattern, "($1)%($2)");
if (replaced.equals(str)) {
break;
}
str = replaced;
}
System.out.println(str);
}
I am assuming you are writing parser for numeric expressions, thus the definition of term termString = "([0-9+\\-*/()%]*)". It outputs this:
(10)%(5)
((20)%(2))%(5)
((30)%(2))%(3+2)
EDIT As per the OP request I add the code for decoding the strings. It is a bit more hacky than the forward scenario:
String [] encoded = {"(10)%(5)", "((20)%(2))%(5)", "((30)%(2))%(3+2)"};
String decodeTerm = "([0-9+\\-*ABC\\[\\],]*)";
String decodePattern = "\\(" + decodeTerm + "\\)%\\(" + decodeTerm + "\\)";
for (String str : encoded) {
while (true) {
String replaced = str.replaceAll(decodePattern, "ABC[$1,$2]");
if (replaced.equals(str)) {
break;
}
str = replaced;
}
str = str.replaceAll("\\[", "(");
str = str.replaceAll("\\]", ")");
System.out.println(str);
}
And the output is:
ABC(10,5)
ABC(ABC(20,2),5)
ABC(ABC(30,2),3+2)
You can start evaluating the inner most reducable expressions first, till no more redux exists. However you have to take care of other ,, ( and ). The solution of #BorisStrandjev is better, more bullet proof.
String infix(String expr) {
// Use place holders for '(' and ')' to use regex [^,()].
expr = expr.replaceAll("(?!ABC)\\(", "<<");
expr = expr.replaceAll("(?!ABC)\\)", ">>");
for (;;) {
String expr2 = expr.replaceAll("ABC\\(([^,()]*)\\,([^,()]*)\\)",
"<<$1>>%<<$2>>");
if (expr2 == expr)
break;
expr = expr2;
}
expr = expr.replaceAll("<<", ")");
expr = expr.replaceAll(">>", ")");
return expr;
}
You could use this Regular Expressions library https://github.com/florianingerl/com.florianingerl.util.regex , that also supports Recursive Regular Expressions.
Converting ABC(ABC(20,2),5) to ((20)%(2))%(5) looks like this:
Pattern pattern = Pattern.compile("(?<abc>ABC\\((?<arg1>(?:(?'abc')|[^,])+)\\,(?<arg2>(?:(?'abc')|[^)])+)\\))");
Matcher matcher = pattern.matcher("ABC(ABC(20,2),5)");
String replacement = matcher.replaceAll(new DefaultCaptureReplacer() {
#Override
public String replace(CaptureTreeNode node) {
if ("abc".equals(node.getGroupName())) {
return "(" + replace(node.getChildren().get(0)) + ")%(" + replace(node.getChildren().get(1)) + ")";
} else
return super.replace(node);
}
});
System.out.println(replacement);
assertEquals("((20)%(2))%(5)", replacement);
Converting back again, i.e. from ((20)%(2))%(5) to ABC(ABC(20,2),5) looks like this:
Pattern pattern = Pattern.compile("(?<fraction>(?<arg>\\(((?:(?'fraction')|[^)])+)\\))%(?'arg'))");
Matcher matcher = pattern.matcher("((20)%(2))%(5)");
String replacement = matcher.replaceAll(new DefaultCaptureReplacer() {
#Override
public String replace(CaptureTreeNode node) {
if ("fraction".equals(node.getGroupName())) {
return "ABC(" + replace(node.getChildren().get(0)) + "," + replace(node.getChildren().get(1)) + ")";
} else if ("arg".equals(node.getGroupName())) {
return replace(node.getChildren().get(0));
} else
return super.replace(node);
}
});
System.out.println(replacement);
assertEquals("ABC(ABC(20,2),5)", replacement);
You can try to rewrite the string using the Polish notation and then replace any % X Y with ABC(X,Y).
Here's the wiki link for the Polish notation.
The problem is that you need to find out which rewrite of ABC(X,Y) occurred first when you recursively replaced them in your string. The Polish notation is useful for "deciphering" the order that these rewrites occur and is widely used in expression evaluation.
You can do this by using a stack and recording which replace occurred first: find the inner-most set of parentheses, push only that expression onto the stack, then remove that from your string. When you want to reconstruct the expression original expression, just start at the top of the stack and apply the reverse transformation (X)%(Y) -> ABC(X,Y).
This is somewhat a form of the Polish notation, with the only difference being that you don't store the entire expression as a string, but rather store it in a stack for easier processing.
In short, when replacing, start with the inner-most terms (the ones that have no parentheses in them) and apply the reverse replace.
It may be helpful to use (X)%(Y) -> ABC{X,Y} as an intermediary rewrite rule, then rewrite the curly brackets as round brackets. This way it will be easier to determine which is the inner-most term, as the new terms won't use round brackets. Also it is easier to implement, but not as elegant.
Related
I have a string vaguely like this:
foo,bar,c;qual="baz,blurb",d;junk="quux,syzygy"
that I want to split by commas -- but I need to ignore commas in quotes. How can I do this? Seems like a regexp approach fails; I suppose I can manually scan and enter a different mode when I see a quote, but it would be nice to use preexisting libraries. (edit: I guess I meant libraries that are already part of the JDK or already part of a commonly-used libraries like Apache Commons.)
the above string should split into:
foo
bar
c;qual="baz,blurb"
d;junk="quux,syzygy"
note: this is NOT a CSV file, it's a single string contained in a file with a larger overall structure
Try:
public class Main {
public static void main(String[] args) {
String line = "foo,bar,c;qual=\"baz,blurb\",d;junk=\"quux,syzygy\"";
String[] tokens = line.split(",(?=(?:[^\"]*\"[^\"]*\")*[^\"]*$)", -1);
for(String t : tokens) {
System.out.println("> "+t);
}
}
}
Output:
> foo
> bar
> c;qual="baz,blurb"
> d;junk="quux,syzygy"
In other words: split on the comma only if that comma has zero, or an even number of quotes ahead of it.
Or, a bit friendlier for the eyes:
public class Main {
public static void main(String[] args) {
String line = "foo,bar,c;qual=\"baz,blurb\",d;junk=\"quux,syzygy\"";
String otherThanQuote = " [^\"] ";
String quotedString = String.format(" \" %s* \" ", otherThanQuote);
String regex = String.format("(?x) "+ // enable comments, ignore white spaces
", "+ // match a comma
"(?= "+ // start positive look ahead
" (?: "+ // start non-capturing group 1
" %s* "+ // match 'otherThanQuote' zero or more times
" %s "+ // match 'quotedString'
" )* "+ // end group 1 and repeat it zero or more times
" %s* "+ // match 'otherThanQuote'
" $ "+ // match the end of the string
") ", // stop positive look ahead
otherThanQuote, quotedString, otherThanQuote);
String[] tokens = line.split(regex, -1);
for(String t : tokens) {
System.out.println("> "+t);
}
}
}
which produces the same as the first example.
EDIT
As mentioned by #MikeFHay in the comments:
I prefer using Guava's Splitter, as it has saner defaults (see discussion above about empty matches being trimmed by String#split(), so I did:
Splitter.on(Pattern.compile(",(?=(?:[^\"]*\"[^\"]*\")*[^\"]*$)"))
While I do like regular expressions in general, for this kind of state-dependent tokenization I believe a simple parser (which in this case is much simpler than that word might make it sound) is probably a cleaner solution, in particular with regards to maintainability, e.g.:
String input = "foo,bar,c;qual=\"baz,blurb\",d;junk=\"quux,syzygy\"";
List<String> result = new ArrayList<String>();
int start = 0;
boolean inQuotes = false;
for (int current = 0; current < input.length(); current++) {
if (input.charAt(current) == '\"') inQuotes = !inQuotes; // toggle state
else if (input.charAt(current) == ',' && !inQuotes) {
result.add(input.substring(start, current));
start = current + 1;
}
}
result.add(input.substring(start));
If you don't care about preserving the commas inside the quotes you could simplify this approach (no handling of start index, no last character special case) by replacing your commas in quotes by something else and then split at commas:
String input = "foo,bar,c;qual=\"baz,blurb\",d;junk=\"quux,syzygy\"";
StringBuilder builder = new StringBuilder(input);
boolean inQuotes = false;
for (int currentIndex = 0; currentIndex < builder.length(); currentIndex++) {
char currentChar = builder.charAt(currentIndex);
if (currentChar == '\"') inQuotes = !inQuotes; // toggle state
if (currentChar == ',' && inQuotes) {
builder.setCharAt(currentIndex, ';'); // or '♡', and replace later
}
}
List<String> result = Arrays.asList(builder.toString().split(","));
http://sourceforge.net/projects/javacsv/
https://github.com/pupi1985/JavaCSV-Reloaded
(fork of the previous library that will allow the generated output to have Windows line terminators \r\n when not running Windows)
http://opencsv.sourceforge.net/
CSV API for Java
Can you recommend a Java library for reading (and possibly writing) CSV files?
Java lib or app to convert CSV to XML file?
I would not advise a regex answer from Bart, I find parsing solution better in this particular case (as Fabian proposed). I've tried regex solution and own parsing implementation I have found that:
Parsing is much faster than splitting with regex with backreferences - ~20 times faster for short strings, ~40 times faster for long strings.
Regex fails to find empty string after last comma. That was not in original question though, it was mine requirement.
My solution and test below.
String tested = "foo,bar,c;qual=\"baz,blurb\",d;junk=\"quux,syzygy\",";
long start = System.nanoTime();
String[] tokens = tested.split(",(?=([^\"]*\"[^\"]*\")*[^\"]*$)");
long timeWithSplitting = System.nanoTime() - start;
start = System.nanoTime();
List<String> tokensList = new ArrayList<String>();
boolean inQuotes = false;
StringBuilder b = new StringBuilder();
for (char c : tested.toCharArray()) {
switch (c) {
case ',':
if (inQuotes) {
b.append(c);
} else {
tokensList.add(b.toString());
b = new StringBuilder();
}
break;
case '\"':
inQuotes = !inQuotes;
default:
b.append(c);
break;
}
}
tokensList.add(b.toString());
long timeWithParsing = System.nanoTime() - start;
System.out.println(Arrays.toString(tokens));
System.out.println(tokensList.toString());
System.out.printf("Time with splitting:\t%10d\n",timeWithSplitting);
System.out.printf("Time with parsing:\t%10d\n",timeWithParsing);
Of course you are free to change switch to else-ifs in this snippet if you feel uncomfortable with its ugliness. Note then lack of break after switch with separator. StringBuilder was chosen instead to StringBuffer by design to increase speed, where thread safety is irrelevant.
You're in that annoying boundary area where regexps almost won't do (as has been pointed out by Bart, escaping the quotes would make life hard) , and yet a full-blown parser seems like overkill.
If you are likely to need greater complexity any time soon I would go looking for a parser library. For example this one
I was impatient and chose not to wait for answers... for reference it doesn't look that hard to do something like this (which works for my application, I don't need to worry about escaped quotes, as the stuff in quotes is limited to a few constrained forms):
final static private Pattern splitSearchPattern = Pattern.compile("[\",]");
private List<String> splitByCommasNotInQuotes(String s) {
if (s == null)
return Collections.emptyList();
List<String> list = new ArrayList<String>();
Matcher m = splitSearchPattern.matcher(s);
int pos = 0;
boolean quoteMode = false;
while (m.find())
{
String sep = m.group();
if ("\"".equals(sep))
{
quoteMode = !quoteMode;
}
else if (!quoteMode && ",".equals(sep))
{
int toPos = m.start();
list.add(s.substring(pos, toPos));
pos = m.end();
}
}
if (pos < s.length())
list.add(s.substring(pos));
return list;
}
(exercise for the reader: extend to handling escaped quotes by looking for backslashes also.)
Try a lookaround like (?!\"),(?!\"). This should match , that are not surrounded by ".
The simplest approach is not to match delimiters, i.e. commas, with a complex additional logic to match what is actually intended (the data which might be quoted strings), just to exclude false delimiters, but rather match the intended data in the first place.
The pattern consists of two alternatives, a quoted string ("[^"]*" or ".*?") or everything up to the next comma ([^,]+). To support empty cells, we have to allow the unquoted item to be empty and to consume the next comma, if any, and use the \\G anchor:
Pattern p = Pattern.compile("\\G\"(.*?)\",?|([^,]*),?");
The pattern also contains two capturing groups to get either, the quoted string’s content or the plain content.
Then, with Java 9, we can get an array as
String[] a = p.matcher(input).results()
.map(m -> m.group(m.start(1)<0? 2: 1))
.toArray(String[]::new);
whereas older Java versions need a loop like
for(Matcher m = p.matcher(input); m.find(); ) {
String token = m.group(m.start(1)<0? 2: 1);
System.out.println("found: "+token);
}
Adding the items to a List or an array is left as an excise to the reader.
For Java 8, you can use the results() implementation of this answer, to do it like the Java 9 solution.
For mixed content with embedded strings, like in the question, you can simply use
Pattern p = Pattern.compile("\\G((\"(.*?)\"|[^,])*),?");
But then, the strings are kept in their quoted form.
what about a one-liner using String.split()?
String s = "foo,bar,c;qual=\"baz,blurb\",d;junk=\"quux,syzygy\"";
String[] split = s.split( "(?<!\".{0,255}[^\"]),|,(?![^\"].*\")" );
A regular expression is not capable of handling escaped characters. For my application, I needed the ability to escape quotes and spaces (my separator is spaces, but the code is the same).
Here is my solution in Kotlin (the language from this particular application), based on the one from Fabian Steeg:
fun parseString(input: String): List<String> {
val result = mutableListOf<String>()
var inQuotes = false
var inEscape = false
val current = StringBuilder()
for (i in input.indices) {
// If this character is escaped, add it without looking
if (inEscape) {
inEscape = false
current.append(input[i])
continue
}
when (val c = input[i]) {
'\\' -> inEscape = true // escape the next character, \ isn't added to result
',' -> if (inQuotes) {
current.append(c)
} else {
result += current.toString()
current.clear()
}
'"' -> inQuotes = !inQuotes
else -> current.append(c)
}
}
if (current.isNotEmpty()) {
result += current.toString()
}
return result
}
I think this is not a place to use regular expressions. Contrary to other opinions, I don't think a parser is overkill. It's about 20 lines and fairly easy to test.
Rather than use lookahead and other crazy regex, just pull out the quotes first. That is, for every quote grouping, replace that grouping with __IDENTIFIER_1 or some other indicator, and map that grouping to a map of string,string.
After you split on comma, replace all mapped identifiers with the original string values.
I would do something like this:
boolean foundQuote = false;
if(charAtIndex(currentStringIndex) == '"')
{
foundQuote = true;
}
if(foundQuote == true)
{
//do nothing
}
else
{
string[] split = currentString.split(',');
}
I am creating a function that strips the illegal wildcard patterns from the input string. The ideal solution should use a single regex expression, if at all possible.
The illegal wildcard patterns are: %% and %_%. Each instance of those should be replaced with %.
Here's the rub... I'm trying to perform some fuzz testing by running the function against various inputs to try to make it and break it.
It works for the most part; however, with complicated inputs, it doesn't.
The rest of this question has been updated:
The following inputs should return empty string (not an exhaustive list):
The following inputs should return % (not an exhaustive list).
%_%
%%
%%_%%
%_%%%
%%_%_%
%%_%%%_%%%_%
There will be cases where there are other characters with the input... like:
Foo123%_%
Should return "Foo123%"
B4r$%_%
Should return "B4r$%"
B4rs%%_%
Should return "B4rs%"
%%Lorem_%%
Should return "%Lorem_%"
I have tried using several different patterns and my tests are failing.
String input = "%_%%%%_%%%_%";
// old method:
public static String ancientMethod1(String input){
if (input == null)
return "";
return input.replaceAll("%_%", "").replaceAll("%%", ""); // Output: ""
}
// Attempt 1:
// Doesn't quite work right.
// "A%%" is returned as "A%%" instead of "A%"
public static String newMethod1(String input) {
String result = input;
while (result.contains("%%") || result.contains("%_%"))
result = result.replaceAll("%%","%").replaceAll("%_%","%");
if (result.equals("%"))
return "";
return input;
}
// Attempt 2:
// Succeeds, but I would like to simplify this:
public static String newMethod2(String input) {
if (input == null)
return "";
String illegalPattern1 = "%%";
String illegalPattern2 = "%_%";
String result = input;
while (result.contains(illegalPattern1) || result.contains(illegalPattern2)) {
result = result.replace(illegalPattern1, "%");
result = result.replace(illegalPattern2, "%");
}
if (result.equals("%") || result.equals("_"))
return "";
return result;
}
Here's a more complete defined example of how I'm using this: https://gist.github.com/sometowngeek/697c839a1bf1c9ee58be283b1396cf2e
This regular expression string matches all your examples:
"%(?:_?%)+"
It matches strings consisting of a '%' character followed by one or more sequences consisting of zero or one '_' character and one '%' character (close to literal translation), which is another way of saying what I did in comments: "a sequence of '%' and '_' characters, beginning and ending with '%', and not containing two consecutive '_' characters".
I'm not quite sure, if the listed inputs might have other instances, if not, maybe an expression with start and end anchor would be much applicable here, either one by one, or with something similar to:
^%{1,3}(_%{1,3})?(_%{1,3})?(_%)?$
Demo
Test
import java.util.regex.Matcher;
import java.util.regex.Pattern;
final String regex = "^%{1,3}(_%{1,3})?(_%{1,3})?(_%)?$";
final String string = "%_%\n"
+ "%%\n"
+ "%%_%%\n"
+ "%%%_%%%\n"
+ "%_%%%\n"
+ "%%%_%\n"
+ "%%_%_%\n"
+ "%%_%%%_%%%_%";
final Pattern pattern = Pattern.compile(regex, Pattern.MULTILINE);
final Matcher matcher = pattern.matcher(string);
while (matcher.find()) {
System.out.println("Full match: " + matcher.group(0));
for (int i = 1; i <= matcher.groupCount(); i++) {
System.out.println("Group " + i + ": " + matcher.group(i));
}
}
RegEx Circuit
jex.im visualizes regular expressions:
Your newMethod1 actually works, except you have a typo - you're returning the input parmeter, not the result of your processing!
Change:
return input; // oops!
to:
return result;
Also, because you're not using regex, you should use replace() rather than replaceAll(), ie:
result = result.replace("%%","%").replace("%_%","%"); // still replaces all occurrences
replace() still replaces all occurrences.
BTW, although not as strict, this works for all of your (currently) posted examples:
public static String myMethod(String input) {
return input.replaceAll("%[%_]*", "%");
}
It looks like all the patterns start with %, then have 0+ % or _ chars and end with %.
Use a mere
input = input.replaceAll("%[%_]*%", "%");
See the regex demo and the regex graph:
Details
% - a % char
[%_]* - 0 or more % or _ chars
% - a % char.
I have a string vaguely like this:
foo,bar,c;qual="baz,blurb",d;junk="quux,syzygy"
that I want to split by commas -- but I need to ignore commas in quotes. How can I do this? Seems like a regexp approach fails; I suppose I can manually scan and enter a different mode when I see a quote, but it would be nice to use preexisting libraries. (edit: I guess I meant libraries that are already part of the JDK or already part of a commonly-used libraries like Apache Commons.)
the above string should split into:
foo
bar
c;qual="baz,blurb"
d;junk="quux,syzygy"
note: this is NOT a CSV file, it's a single string contained in a file with a larger overall structure
Try:
public class Main {
public static void main(String[] args) {
String line = "foo,bar,c;qual=\"baz,blurb\",d;junk=\"quux,syzygy\"";
String[] tokens = line.split(",(?=(?:[^\"]*\"[^\"]*\")*[^\"]*$)", -1);
for(String t : tokens) {
System.out.println("> "+t);
}
}
}
Output:
> foo
> bar
> c;qual="baz,blurb"
> d;junk="quux,syzygy"
In other words: split on the comma only if that comma has zero, or an even number of quotes ahead of it.
Or, a bit friendlier for the eyes:
public class Main {
public static void main(String[] args) {
String line = "foo,bar,c;qual=\"baz,blurb\",d;junk=\"quux,syzygy\"";
String otherThanQuote = " [^\"] ";
String quotedString = String.format(" \" %s* \" ", otherThanQuote);
String regex = String.format("(?x) "+ // enable comments, ignore white spaces
", "+ // match a comma
"(?= "+ // start positive look ahead
" (?: "+ // start non-capturing group 1
" %s* "+ // match 'otherThanQuote' zero or more times
" %s "+ // match 'quotedString'
" )* "+ // end group 1 and repeat it zero or more times
" %s* "+ // match 'otherThanQuote'
" $ "+ // match the end of the string
") ", // stop positive look ahead
otherThanQuote, quotedString, otherThanQuote);
String[] tokens = line.split(regex, -1);
for(String t : tokens) {
System.out.println("> "+t);
}
}
}
which produces the same as the first example.
EDIT
As mentioned by #MikeFHay in the comments:
I prefer using Guava's Splitter, as it has saner defaults (see discussion above about empty matches being trimmed by String#split(), so I did:
Splitter.on(Pattern.compile(",(?=(?:[^\"]*\"[^\"]*\")*[^\"]*$)"))
While I do like regular expressions in general, for this kind of state-dependent tokenization I believe a simple parser (which in this case is much simpler than that word might make it sound) is probably a cleaner solution, in particular with regards to maintainability, e.g.:
String input = "foo,bar,c;qual=\"baz,blurb\",d;junk=\"quux,syzygy\"";
List<String> result = new ArrayList<String>();
int start = 0;
boolean inQuotes = false;
for (int current = 0; current < input.length(); current++) {
if (input.charAt(current) == '\"') inQuotes = !inQuotes; // toggle state
else if (input.charAt(current) == ',' && !inQuotes) {
result.add(input.substring(start, current));
start = current + 1;
}
}
result.add(input.substring(start));
If you don't care about preserving the commas inside the quotes you could simplify this approach (no handling of start index, no last character special case) by replacing your commas in quotes by something else and then split at commas:
String input = "foo,bar,c;qual=\"baz,blurb\",d;junk=\"quux,syzygy\"";
StringBuilder builder = new StringBuilder(input);
boolean inQuotes = false;
for (int currentIndex = 0; currentIndex < builder.length(); currentIndex++) {
char currentChar = builder.charAt(currentIndex);
if (currentChar == '\"') inQuotes = !inQuotes; // toggle state
if (currentChar == ',' && inQuotes) {
builder.setCharAt(currentIndex, ';'); // or '♡', and replace later
}
}
List<String> result = Arrays.asList(builder.toString().split(","));
http://sourceforge.net/projects/javacsv/
https://github.com/pupi1985/JavaCSV-Reloaded
(fork of the previous library that will allow the generated output to have Windows line terminators \r\n when not running Windows)
http://opencsv.sourceforge.net/
CSV API for Java
Can you recommend a Java library for reading (and possibly writing) CSV files?
Java lib or app to convert CSV to XML file?
I would not advise a regex answer from Bart, I find parsing solution better in this particular case (as Fabian proposed). I've tried regex solution and own parsing implementation I have found that:
Parsing is much faster than splitting with regex with backreferences - ~20 times faster for short strings, ~40 times faster for long strings.
Regex fails to find empty string after last comma. That was not in original question though, it was mine requirement.
My solution and test below.
String tested = "foo,bar,c;qual=\"baz,blurb\",d;junk=\"quux,syzygy\",";
long start = System.nanoTime();
String[] tokens = tested.split(",(?=([^\"]*\"[^\"]*\")*[^\"]*$)");
long timeWithSplitting = System.nanoTime() - start;
start = System.nanoTime();
List<String> tokensList = new ArrayList<String>();
boolean inQuotes = false;
StringBuilder b = new StringBuilder();
for (char c : tested.toCharArray()) {
switch (c) {
case ',':
if (inQuotes) {
b.append(c);
} else {
tokensList.add(b.toString());
b = new StringBuilder();
}
break;
case '\"':
inQuotes = !inQuotes;
default:
b.append(c);
break;
}
}
tokensList.add(b.toString());
long timeWithParsing = System.nanoTime() - start;
System.out.println(Arrays.toString(tokens));
System.out.println(tokensList.toString());
System.out.printf("Time with splitting:\t%10d\n",timeWithSplitting);
System.out.printf("Time with parsing:\t%10d\n",timeWithParsing);
Of course you are free to change switch to else-ifs in this snippet if you feel uncomfortable with its ugliness. Note then lack of break after switch with separator. StringBuilder was chosen instead to StringBuffer by design to increase speed, where thread safety is irrelevant.
You're in that annoying boundary area where regexps almost won't do (as has been pointed out by Bart, escaping the quotes would make life hard) , and yet a full-blown parser seems like overkill.
If you are likely to need greater complexity any time soon I would go looking for a parser library. For example this one
I was impatient and chose not to wait for answers... for reference it doesn't look that hard to do something like this (which works for my application, I don't need to worry about escaped quotes, as the stuff in quotes is limited to a few constrained forms):
final static private Pattern splitSearchPattern = Pattern.compile("[\",]");
private List<String> splitByCommasNotInQuotes(String s) {
if (s == null)
return Collections.emptyList();
List<String> list = new ArrayList<String>();
Matcher m = splitSearchPattern.matcher(s);
int pos = 0;
boolean quoteMode = false;
while (m.find())
{
String sep = m.group();
if ("\"".equals(sep))
{
quoteMode = !quoteMode;
}
else if (!quoteMode && ",".equals(sep))
{
int toPos = m.start();
list.add(s.substring(pos, toPos));
pos = m.end();
}
}
if (pos < s.length())
list.add(s.substring(pos));
return list;
}
(exercise for the reader: extend to handling escaped quotes by looking for backslashes also.)
Try a lookaround like (?!\"),(?!\"). This should match , that are not surrounded by ".
The simplest approach is not to match delimiters, i.e. commas, with a complex additional logic to match what is actually intended (the data which might be quoted strings), just to exclude false delimiters, but rather match the intended data in the first place.
The pattern consists of two alternatives, a quoted string ("[^"]*" or ".*?") or everything up to the next comma ([^,]+). To support empty cells, we have to allow the unquoted item to be empty and to consume the next comma, if any, and use the \\G anchor:
Pattern p = Pattern.compile("\\G\"(.*?)\",?|([^,]*),?");
The pattern also contains two capturing groups to get either, the quoted string’s content or the plain content.
Then, with Java 9, we can get an array as
String[] a = p.matcher(input).results()
.map(m -> m.group(m.start(1)<0? 2: 1))
.toArray(String[]::new);
whereas older Java versions need a loop like
for(Matcher m = p.matcher(input); m.find(); ) {
String token = m.group(m.start(1)<0? 2: 1);
System.out.println("found: "+token);
}
Adding the items to a List or an array is left as an excise to the reader.
For Java 8, you can use the results() implementation of this answer, to do it like the Java 9 solution.
For mixed content with embedded strings, like in the question, you can simply use
Pattern p = Pattern.compile("\\G((\"(.*?)\"|[^,])*),?");
But then, the strings are kept in their quoted form.
what about a one-liner using String.split()?
String s = "foo,bar,c;qual=\"baz,blurb\",d;junk=\"quux,syzygy\"";
String[] split = s.split( "(?<!\".{0,255}[^\"]),|,(?![^\"].*\")" );
A regular expression is not capable of handling escaped characters. For my application, I needed the ability to escape quotes and spaces (my separator is spaces, but the code is the same).
Here is my solution in Kotlin (the language from this particular application), based on the one from Fabian Steeg:
fun parseString(input: String): List<String> {
val result = mutableListOf<String>()
var inQuotes = false
var inEscape = false
val current = StringBuilder()
for (i in input.indices) {
// If this character is escaped, add it without looking
if (inEscape) {
inEscape = false
current.append(input[i])
continue
}
when (val c = input[i]) {
'\\' -> inEscape = true // escape the next character, \ isn't added to result
',' -> if (inQuotes) {
current.append(c)
} else {
result += current.toString()
current.clear()
}
'"' -> inQuotes = !inQuotes
else -> current.append(c)
}
}
if (current.isNotEmpty()) {
result += current.toString()
}
return result
}
I think this is not a place to use regular expressions. Contrary to other opinions, I don't think a parser is overkill. It's about 20 lines and fairly easy to test.
Rather than use lookahead and other crazy regex, just pull out the quotes first. That is, for every quote grouping, replace that grouping with __IDENTIFIER_1 or some other indicator, and map that grouping to a map of string,string.
After you split on comma, replace all mapped identifiers with the original string values.
I would do something like this:
boolean foundQuote = false;
if(charAtIndex(currentStringIndex) == '"')
{
foundQuote = true;
}
if(foundQuote == true)
{
//do nothing
}
else
{
string[] split = currentString.split(',');
}
My String is
String s="(decode(W_Employee_D_3.Fst_Name,NULL,"
+ "decode(W_Employee_D_3.Last_Name,NULL,"
+ "decode(W_Employee_D_3.Mid_Name,NULL,'emptyString','midnamevailable'),'lastnameavil'),"
+ "concat(CONCAT ( concat(W_Employee_D_3.last_Name, ' ,'),"
+ "W_Employee_D_3.Fst_Name ),W_Employee_D_3.Mid_Name)))";
I need to write some generalize logic which gives fun1=decode, fun2=decode, fun3=decode,fun4=concat,fun5=Concat,fun6=Concat and their respective parameter1, para2, para3 in any type of collection in Java.
Parameter are those which is passed in function,
for example
concat(W_Employee_D_3.last_Name, ' ,')
concat is function and parameters are W_Employee_D_3.last_Name & ','.
String can contains any number of function,parameter and can have different function also.
This looks like a mean kind of task. Maybe you must some ANTLR grammar?
I'll just offer an approach to the less tedious part of the work.
For such kind of nesting one would either
Parse from left to right using a stack, or
Use reducible expressions starting with the inner most found redexes, and building some result.
I use the latter, with a map to hold resulting structures, replaced in the string by some "variable."
We could introduce variables for
a string literal, as that could contain comma, parenthesis and other not to be parsed text,
a function call
If give a solution for a simplified case:
String s = "a(b(c),d(e),f,g(h(),i)";
// Variables are like "§0013" (4 digits)
Map<String, String> variables;
int maxVar = 0;
String expr = "(\\w+|§\\d{4})"; // Either simple term or var name.
Pattern callRedex = Pattern.compile("\\w+\\((" + expr + "(," + expr + ")*)?\\)");
boolean reduced;
do {
reduced = false;
Matcher m = callRedex.matcher(s);
StringBuffer sb = new StringBuffer();
while (m.find()) {
String value = m.group();
String var = String.format("§%04d", maxVar++);
variables.put(var, value);
m.appendReplacement(sb, var);
reduced = true;
}
m.appendTail(sb);
s = sb.toString();
} while (reduced);
Now one has the function calls as variables. Their value contain variable names, and have again to be replaced.
I try to split some String by the byte value.
Like "first\x00second" by 0x00 splitter. I found that compiler cannot combine \x token with variable.
static public ArrayList split_by_byte(String value, byte spliter) {
if (spliter < 0)
throw new IllegalArgumentException("Отрицательное значение разделителя: " + spliter);
ArrayList<String> result = new ArrayList();
String[] groups = value.split("[\\x" + spliter + "]");
for (String group : groups) {
result.add(group);
}
return result;
}
How can i use variable value in patterns like \xNN?
In regex you cannot use \x in a single-quoted / non-interpolated string. It must be seen by the lexer.
because tilde isn’t a meta-character.
Add use regex "debug" and you will see what is actually happening.
you can also use pattern and matcher classes and split method...