Regex (java) help

Regex (java) help - java

How do I split this comma+quote delimited String into a set of strings:
String test = "[\"String 1\",\"String, two\"]";
String[] embeddedStrings = test.split("<insert magic regex here>");
//note: It should also work for this string, with a space after the separating comma: "[\"String 1\", \"String, two\"]";
assertEquals("String 1", embeddedStrings[0]);
assertEquals("String, two", embeddedStrings[1]);
I'm fine with trimming the square brackets as a first step. But the catch is, even if I do that, I can't just split on a comma because embedded strings can have commas in them.
Using Apache StringUtils is also acceptable.

You could also use one of the many open source small libraries for parsing CSVs, e.g. opencsv or Commons CSV.

If you can remove [\" from the start of the outer string and \"] from the end of it
to become:
String test = "String 1\",\"String, two";
You can use:
test.split("\",\"");

This is extremely fragile and should be avoided, but you could match the string literals.
Pattern p = Pattern.compile("\"((?:[^\"]+|\\\\\")*)\"");
String test = "[\"String 1\",\"String, two\"]";
Matcher m = p.matcher(test);
ArrayList<String> embeddedStrings = new ArrayList<String>();
while (m.find()) {
embeddedStrings.add(m.group(1));
}
The regular expression assumes that double quotes in the input are escaped using \" and not "". The pattern would break if the input had an odd number of (unescaped) double quotes.

Brute-force method, some of this may be pseudocode and I think there's a fencepost problem when setting currStart and/or String.substring(). This assumes that brackets are already removed.
boolean inquote = false;
List strings = new ArrayList();
int currStart=0;
for (int i=0; i<test.length(); i++) {
char c = test.charAt(i);
if (c == ',' && ! inquote) {
strings.add(test.substring(currStart, i);
currStart = i;
}
else if (c == ' ' && currStart + == i)
currStart = i; // strip off spaces after a comma
else if (c == '"')
inquote != inquote;
}
strings.add(test.substring(currStart,i));
String embeddedStrings = strings.toArray();

Related

Splitting csv lines that use "escaped" delimiter [duplicate]

I have a string vaguely like this:
foo,bar,c;qual="baz,blurb",d;junk="quux,syzygy"
that I want to split by commas -- but I need to ignore commas in quotes. How can I do this? Seems like a regexp approach fails; I suppose I can manually scan and enter a different mode when I see a quote, but it would be nice to use preexisting libraries. (edit: I guess I meant libraries that are already part of the JDK or already part of a commonly-used libraries like Apache Commons.)
the above string should split into:
foo
bar
c;qual="baz,blurb"
d;junk="quux,syzygy"
note: this is NOT a CSV file, it's a single string contained in a file with a larger overall structure

Try:
public class Main {
public static void main(String[] args) {
String line = "foo,bar,c;qual=\"baz,blurb\",d;junk=\"quux,syzygy\"";
String[] tokens = line.split(",(?=(?:[^\"]*\"[^\"]*\")*[^\"]*$)", -1);
for(String t : tokens) {
System.out.println("> "+t);
}
}
}
Output:
> foo
> bar
> c;qual="baz,blurb"
> d;junk="quux,syzygy"
In other words: split on the comma only if that comma has zero, or an even number of quotes ahead of it.
Or, a bit friendlier for the eyes:
public class Main {
public static void main(String[] args) {
String line = "foo,bar,c;qual=\"baz,blurb\",d;junk=\"quux,syzygy\"";
String otherThanQuote = " [^\"] ";
String quotedString = String.format(" \" %s* \" ", otherThanQuote);
String regex = String.format("(?x) "+ // enable comments, ignore white spaces
", "+ // match a comma
"(?= "+ // start positive look ahead
" (?: "+ // start non-capturing group 1
" %s* "+ // match 'otherThanQuote' zero or more times
" %s "+ // match 'quotedString'
" )* "+ // end group 1 and repeat it zero or more times
" %s* "+ // match 'otherThanQuote'
" $ "+ // match the end of the string
") ", // stop positive look ahead
otherThanQuote, quotedString, otherThanQuote);
String[] tokens = line.split(regex, -1);
for(String t : tokens) {
System.out.println("> "+t);
}
}
}
which produces the same as the first example.
EDIT
As mentioned by #MikeFHay in the comments:
I prefer using Guava's Splitter, as it has saner defaults (see discussion above about empty matches being trimmed by String#split(), so I did:
Splitter.on(Pattern.compile(",(?=(?:[^\"]*\"[^\"]*\")*[^\"]*$)"))

While I do like regular expressions in general, for this kind of state-dependent tokenization I believe a simple parser (which in this case is much simpler than that word might make it sound) is probably a cleaner solution, in particular with regards to maintainability, e.g.:
String input = "foo,bar,c;qual=\"baz,blurb\",d;junk=\"quux,syzygy\"";
List<String> result = new ArrayList<String>();
int start = 0;
boolean inQuotes = false;
for (int current = 0; current < input.length(); current++) {
if (input.charAt(current) == '\"') inQuotes = !inQuotes; // toggle state
else if (input.charAt(current) == ',' && !inQuotes) {
result.add(input.substring(start, current));
start = current + 1;
}
}
result.add(input.substring(start));
If you don't care about preserving the commas inside the quotes you could simplify this approach (no handling of start index, no last character special case) by replacing your commas in quotes by something else and then split at commas:
String input = "foo,bar,c;qual=\"baz,blurb\",d;junk=\"quux,syzygy\"";
StringBuilder builder = new StringBuilder(input);
boolean inQuotes = false;
for (int currentIndex = 0; currentIndex < builder.length(); currentIndex++) {
char currentChar = builder.charAt(currentIndex);
if (currentChar == '\"') inQuotes = !inQuotes; // toggle state
if (currentChar == ',' && inQuotes) {
builder.setCharAt(currentIndex, ';'); // or '♡', and replace later
}
}
List<String> result = Arrays.asList(builder.toString().split(","));

http://sourceforge.net/projects/javacsv/
https://github.com/pupi1985/JavaCSV-Reloaded
(fork of the previous library that will allow the generated output to have Windows line terminators \r\n when not running Windows)
http://opencsv.sourceforge.net/
CSV API for Java
Can you recommend a Java library for reading (and possibly writing) CSV files?
Java lib or app to convert CSV to XML file?

I would not advise a regex answer from Bart, I find parsing solution better in this particular case (as Fabian proposed). I've tried regex solution and own parsing implementation I have found that:
Parsing is much faster than splitting with regex with backreferences - ~20 times faster for short strings, ~40 times faster for long strings.
Regex fails to find empty string after last comma. That was not in original question though, it was mine requirement.
My solution and test below.
String tested = "foo,bar,c;qual=\"baz,blurb\",d;junk=\"quux,syzygy\",";
long start = System.nanoTime();
String[] tokens = tested.split(",(?=([^\"]*\"[^\"]*\")*[^\"]*$)");
long timeWithSplitting = System.nanoTime() - start;
start = System.nanoTime();
List<String> tokensList = new ArrayList<String>();
boolean inQuotes = false;
StringBuilder b = new StringBuilder();
for (char c : tested.toCharArray()) {
switch (c) {
case ',':
if (inQuotes) {
b.append(c);
} else {
tokensList.add(b.toString());
b = new StringBuilder();
}
break;
case '\"':
inQuotes = !inQuotes;
default:
b.append(c);
break;
}
}
tokensList.add(b.toString());
long timeWithParsing = System.nanoTime() - start;
System.out.println(Arrays.toString(tokens));
System.out.println(tokensList.toString());
System.out.printf("Time with splitting:\t%10d\n",timeWithSplitting);
System.out.printf("Time with parsing:\t%10d\n",timeWithParsing);
Of course you are free to change switch to else-ifs in this snippet if you feel uncomfortable with its ugliness. Note then lack of break after switch with separator. StringBuilder was chosen instead to StringBuffer by design to increase speed, where thread safety is irrelevant.

You're in that annoying boundary area where regexps almost won't do (as has been pointed out by Bart, escaping the quotes would make life hard) , and yet a full-blown parser seems like overkill.
If you are likely to need greater complexity any time soon I would go looking for a parser library. For example this one

I was impatient and chose not to wait for answers... for reference it doesn't look that hard to do something like this (which works for my application, I don't need to worry about escaped quotes, as the stuff in quotes is limited to a few constrained forms):
final static private Pattern splitSearchPattern = Pattern.compile("[\",]");
private List<String> splitByCommasNotInQuotes(String s) {
if (s == null)
return Collections.emptyList();
List<String> list = new ArrayList<String>();
Matcher m = splitSearchPattern.matcher(s);
int pos = 0;
boolean quoteMode = false;
while (m.find())
{
String sep = m.group();
if ("\"".equals(sep))
{
quoteMode = !quoteMode;
}
else if (!quoteMode && ",".equals(sep))
{
int toPos = m.start();
list.add(s.substring(pos, toPos));
pos = m.end();
}
}
if (pos < s.length())
list.add(s.substring(pos));
return list;
}
(exercise for the reader: extend to handling escaped quotes by looking for backslashes also.)

Try a lookaround like (?!\"),(?!\"). This should match , that are not surrounded by ".

The simplest approach is not to match delimiters, i.e. commas, with a complex additional logic to match what is actually intended (the data which might be quoted strings), just to exclude false delimiters, but rather match the intended data in the first place.
The pattern consists of two alternatives, a quoted string ("[^"]*" or ".*?") or everything up to the next comma ([^,]+). To support empty cells, we have to allow the unquoted item to be empty and to consume the next comma, if any, and use the \\G anchor:
Pattern p = Pattern.compile("\\G\"(.*?)\",?|([^,]*),?");
The pattern also contains two capturing groups to get either, the quoted string’s content or the plain content.
Then, with Java 9, we can get an array as
String[] a = p.matcher(input).results()
.map(m -> m.group(m.start(1)<0? 2: 1))
.toArray(String[]::new);
whereas older Java versions need a loop like
for(Matcher m = p.matcher(input); m.find(); ) {
String token = m.group(m.start(1)<0? 2: 1);
System.out.println("found: "+token);
}
Adding the items to a List or an array is left as an excise to the reader.
For Java 8, you can use the results() implementation of this answer, to do it like the Java 9 solution.
For mixed content with embedded strings, like in the question, you can simply use
Pattern p = Pattern.compile("\\G((\"(.*?)\"|[^,])*),?");
But then, the strings are kept in their quoted form.

what about a one-liner using String.split()?
String s = "foo,bar,c;qual=\"baz,blurb\",d;junk=\"quux,syzygy\"";
String[] split = s.split( "(?<!\".{0,255}[^\"]),|,(?![^\"].*\")" );

A regular expression is not capable of handling escaped characters. For my application, I needed the ability to escape quotes and spaces (my separator is spaces, but the code is the same).
Here is my solution in Kotlin (the language from this particular application), based on the one from Fabian Steeg:
fun parseString(input: String): List<String> {
val result = mutableListOf<String>()
var inQuotes = false
var inEscape = false
val current = StringBuilder()
for (i in input.indices) {
// If this character is escaped, add it without looking
if (inEscape) {
inEscape = false
current.append(input[i])
continue
}
when (val c = input[i]) {
'\\' -> inEscape = true // escape the next character, \ isn't added to result
',' -> if (inQuotes) {
current.append(c)
} else {
result += current.toString()
current.clear()
}
'"' -> inQuotes = !inQuotes
else -> current.append(c)
}
}
if (current.isNotEmpty()) {
result += current.toString()
}
return result
}
I think this is not a place to use regular expressions. Contrary to other opinions, I don't think a parser is overkill. It's about 20 lines and fairly easy to test.

Rather than use lookahead and other crazy regex, just pull out the quotes first. That is, for every quote grouping, replace that grouping with __IDENTIFIER_1 or some other indicator, and map that grouping to a map of string,string.
After you split on comma, replace all mapped identifiers with the original string values.

I would do something like this:
boolean foundQuote = false;
if(charAtIndex(currentStringIndex) == '"')
{
foundQuote = true;
}
if(foundQuote == true)
{
//do nothing
}
else
{
string[] split = currentString.split(',');
}

Make regex not affecting Quotation mark [duplicate]

I have a string vaguely like this:
foo,bar,c;qual="baz,blurb",d;junk="quux,syzygy"
that I want to split by commas -- but I need to ignore commas in quotes. How can I do this? Seems like a regexp approach fails; I suppose I can manually scan and enter a different mode when I see a quote, but it would be nice to use preexisting libraries. (edit: I guess I meant libraries that are already part of the JDK or already part of a commonly-used libraries like Apache Commons.)
the above string should split into:
foo
bar
c;qual="baz,blurb"
d;junk="quux,syzygy"
note: this is NOT a CSV file, it's a single string contained in a file with a larger overall structure

Try:
public class Main {
public static void main(String[] args) {
String line = "foo,bar,c;qual=\"baz,blurb\",d;junk=\"quux,syzygy\"";
String[] tokens = line.split(",(?=(?:[^\"]*\"[^\"]*\")*[^\"]*$)", -1);
for(String t : tokens) {
System.out.println("> "+t);
}
}
}
Output:
> foo
> bar
> c;qual="baz,blurb"
> d;junk="quux,syzygy"
In other words: split on the comma only if that comma has zero, or an even number of quotes ahead of it.
Or, a bit friendlier for the eyes:
public class Main {
public static void main(String[] args) {
String line = "foo,bar,c;qual=\"baz,blurb\",d;junk=\"quux,syzygy\"";
String otherThanQuote = " [^\"] ";
String quotedString = String.format(" \" %s* \" ", otherThanQuote);
String regex = String.format("(?x) "+ // enable comments, ignore white spaces
", "+ // match a comma
"(?= "+ // start positive look ahead
" (?: "+ // start non-capturing group 1
" %s* "+ // match 'otherThanQuote' zero or more times
" %s "+ // match 'quotedString'
" )* "+ // end group 1 and repeat it zero or more times
" %s* "+ // match 'otherThanQuote'
" $ "+ // match the end of the string
") ", // stop positive look ahead
otherThanQuote, quotedString, otherThanQuote);
String[] tokens = line.split(regex, -1);
for(String t : tokens) {
System.out.println("> "+t);
}
}
}
which produces the same as the first example.
EDIT
As mentioned by #MikeFHay in the comments:
I prefer using Guava's Splitter, as it has saner defaults (see discussion above about empty matches being trimmed by String#split(), so I did:
Splitter.on(Pattern.compile(",(?=(?:[^\"]*\"[^\"]*\")*[^\"]*$)"))

While I do like regular expressions in general, for this kind of state-dependent tokenization I believe a simple parser (which in this case is much simpler than that word might make it sound) is probably a cleaner solution, in particular with regards to maintainability, e.g.:
String input = "foo,bar,c;qual=\"baz,blurb\",d;junk=\"quux,syzygy\"";
List<String> result = new ArrayList<String>();
int start = 0;
boolean inQuotes = false;
for (int current = 0; current < input.length(); current++) {
if (input.charAt(current) == '\"') inQuotes = !inQuotes; // toggle state
else if (input.charAt(current) == ',' && !inQuotes) {
result.add(input.substring(start, current));
start = current + 1;
}
}
result.add(input.substring(start));
If you don't care about preserving the commas inside the quotes you could simplify this approach (no handling of start index, no last character special case) by replacing your commas in quotes by something else and then split at commas:
String input = "foo,bar,c;qual=\"baz,blurb\",d;junk=\"quux,syzygy\"";
StringBuilder builder = new StringBuilder(input);
boolean inQuotes = false;
for (int currentIndex = 0; currentIndex < builder.length(); currentIndex++) {
char currentChar = builder.charAt(currentIndex);
if (currentChar == '\"') inQuotes = !inQuotes; // toggle state
if (currentChar == ',' && inQuotes) {
builder.setCharAt(currentIndex, ';'); // or '♡', and replace later
}
}
List<String> result = Arrays.asList(builder.toString().split(","));

http://sourceforge.net/projects/javacsv/
https://github.com/pupi1985/JavaCSV-Reloaded
(fork of the previous library that will allow the generated output to have Windows line terminators \r\n when not running Windows)
http://opencsv.sourceforge.net/
CSV API for Java
Can you recommend a Java library for reading (and possibly writing) CSV files?
Java lib or app to convert CSV to XML file?

I would not advise a regex answer from Bart, I find parsing solution better in this particular case (as Fabian proposed). I've tried regex solution and own parsing implementation I have found that:
Parsing is much faster than splitting with regex with backreferences - ~20 times faster for short strings, ~40 times faster for long strings.
Regex fails to find empty string after last comma. That was not in original question though, it was mine requirement.
My solution and test below.
String tested = "foo,bar,c;qual=\"baz,blurb\",d;junk=\"quux,syzygy\",";
long start = System.nanoTime();
String[] tokens = tested.split(",(?=([^\"]*\"[^\"]*\")*[^\"]*$)");
long timeWithSplitting = System.nanoTime() - start;
start = System.nanoTime();
List<String> tokensList = new ArrayList<String>();
boolean inQuotes = false;
StringBuilder b = new StringBuilder();
for (char c : tested.toCharArray()) {
switch (c) {
case ',':
if (inQuotes) {
b.append(c);
} else {
tokensList.add(b.toString());
b = new StringBuilder();
}
break;
case '\"':
inQuotes = !inQuotes;
default:
b.append(c);
break;
}
}
tokensList.add(b.toString());
long timeWithParsing = System.nanoTime() - start;
System.out.println(Arrays.toString(tokens));
System.out.println(tokensList.toString());
System.out.printf("Time with splitting:\t%10d\n",timeWithSplitting);
System.out.printf("Time with parsing:\t%10d\n",timeWithParsing);
Of course you are free to change switch to else-ifs in this snippet if you feel uncomfortable with its ugliness. Note then lack of break after switch with separator. StringBuilder was chosen instead to StringBuffer by design to increase speed, where thread safety is irrelevant.

You're in that annoying boundary area where regexps almost won't do (as has been pointed out by Bart, escaping the quotes would make life hard) , and yet a full-blown parser seems like overkill.
If you are likely to need greater complexity any time soon I would go looking for a parser library. For example this one

I was impatient and chose not to wait for answers... for reference it doesn't look that hard to do something like this (which works for my application, I don't need to worry about escaped quotes, as the stuff in quotes is limited to a few constrained forms):
final static private Pattern splitSearchPattern = Pattern.compile("[\",]");
private List<String> splitByCommasNotInQuotes(String s) {
if (s == null)
return Collections.emptyList();
List<String> list = new ArrayList<String>();
Matcher m = splitSearchPattern.matcher(s);
int pos = 0;
boolean quoteMode = false;
while (m.find())
{
String sep = m.group();
if ("\"".equals(sep))
{
quoteMode = !quoteMode;
}
else if (!quoteMode && ",".equals(sep))
{
int toPos = m.start();
list.add(s.substring(pos, toPos));
pos = m.end();
}
}
if (pos < s.length())
list.add(s.substring(pos));
return list;
}
(exercise for the reader: extend to handling escaped quotes by looking for backslashes also.)

Try a lookaround like (?!\"),(?!\"). This should match , that are not surrounded by ".

The simplest approach is not to match delimiters, i.e. commas, with a complex additional logic to match what is actually intended (the data which might be quoted strings), just to exclude false delimiters, but rather match the intended data in the first place.
The pattern consists of two alternatives, a quoted string ("[^"]*" or ".*?") or everything up to the next comma ([^,]+). To support empty cells, we have to allow the unquoted item to be empty and to consume the next comma, if any, and use the \\G anchor:
Pattern p = Pattern.compile("\\G\"(.*?)\",?|([^,]*),?");
The pattern also contains two capturing groups to get either, the quoted string’s content or the plain content.
Then, with Java 9, we can get an array as
String[] a = p.matcher(input).results()
.map(m -> m.group(m.start(1)<0? 2: 1))
.toArray(String[]::new);
whereas older Java versions need a loop like
for(Matcher m = p.matcher(input); m.find(); ) {
String token = m.group(m.start(1)<0? 2: 1);
System.out.println("found: "+token);
}
Adding the items to a List or an array is left as an excise to the reader.
For Java 8, you can use the results() implementation of this answer, to do it like the Java 9 solution.
For mixed content with embedded strings, like in the question, you can simply use
Pattern p = Pattern.compile("\\G((\"(.*?)\"|[^,])*),?");
But then, the strings are kept in their quoted form.

what about a one-liner using String.split()?
String s = "foo,bar,c;qual=\"baz,blurb\",d;junk=\"quux,syzygy\"";
String[] split = s.split( "(?<!\".{0,255}[^\"]),|,(?![^\"].*\")" );

A regular expression is not capable of handling escaped characters. For my application, I needed the ability to escape quotes and spaces (my separator is spaces, but the code is the same).
Here is my solution in Kotlin (the language from this particular application), based on the one from Fabian Steeg:
fun parseString(input: String): List<String> {
val result = mutableListOf<String>()
var inQuotes = false
var inEscape = false
val current = StringBuilder()
for (i in input.indices) {
// If this character is escaped, add it without looking
if (inEscape) {
inEscape = false
current.append(input[i])
continue
}
when (val c = input[i]) {
'\\' -> inEscape = true // escape the next character, \ isn't added to result
',' -> if (inQuotes) {
current.append(c)
} else {
result += current.toString()
current.clear()
}
'"' -> inQuotes = !inQuotes
else -> current.append(c)
}
}
if (current.isNotEmpty()) {
result += current.toString()
}
return result
}
I think this is not a place to use regular expressions. Contrary to other opinions, I don't think a parser is overkill. It's about 20 lines and fairly easy to test.

Rather than use lookahead and other crazy regex, just pull out the quotes first. That is, for every quote grouping, replace that grouping with __IDENTIFIER_1 or some other indicator, and map that grouping to a map of string,string.
After you split on comma, replace all mapped identifiers with the original string values.

I would do something like this:
boolean foundQuote = false;
if(charAtIndex(currentStringIndex) == '"')
{
foundQuote = true;
}
if(foundQuote == true)
{
//do nothing
}
else
{
string[] split = currentString.split(',');
}

How do I create a regular expression to replace a known string while keeping optional parameters intact?

I've been trying out now for a while, but don't get it right:
In Java, I am trying to create a regular expression to match and replace a (to me known) string out of a string while keeping optional parameters intact.
Example inputs:
{067e6162-3b6f-4ae2-a171-2470b63dff00}
{067e6162-3b6f-4ae2-a171-2470b63dff00,number}
{067e6162-3b6f-4ae2-a171-2470b63dff00,number,integer}
{067e6162-3b6f-4ae2-a171-2470b63dff00,choice,1#one more item|1<another {067e6162-3b6f-4ae2-a171-2470b63dff00,number,integer} items}
(Note that the last example contains a nested reference to the same input string).
The format is always enclosing the to-be-replaced string in curly brackets {...} but with an optional list of comma-separated parameter(s).
I want to replace the input string with a number, e.g. for above input strings the result should be:
{2}
{2,number}
{2,number,integer}
{2,choice,1#one more item|1<another {2,number,integer} items}
Ideally, I'd like to have a regex that is flexible enough to handle (almost) any string as pattern to be replaced, so not just UUID kind of strings as above but also something like this:
A test string with {the_known_input_value_to_be_replaced,number,integer} not replacing the_known_input_value_to_be_replaced if its not in curly brackets of course.
which should end up as e.g.:
A test string with {3,number,integer} not replacing the_known_input_value_to_be_replaced if its not in curly brackets of course.
Note that the substitution should only take place if the input string is in curly brackets.
In Java I will be able to construct the pattern at runtime, taking the to-be-replaced string into account verbosely.
I tried e.g. \{(067e6162-3b6f-4ae2-a171-2470b63dff00)(,?.*)\} (not java escaped yet) and more generic approaches like \{(+?)(,?.*)\} , but they all don't do it right.
Any advice from regex ninjas highly appreciated :)

If you the known old string always occurs right after { you can just use
String result = old_text.replace("{" + my_old_keyword, "{" + my_new_keyword);
If you really have multiple known strings inside curly brackets (and there are no escaped curly brackets to take care of), you can use the following code:
String input = "067e6162-3b6f-4ae2-a171-2470b63dff00 is outside {067e6162-3b6f-4ae2-a171-2470b63dff00,choice,067e6162-3b6f-4ae2-a171-2470b63dff00,1#one more item|1<another {067e6162-3b6f-4ae2-a171-2470b63dff00,number,067e6162-3b6f-4ae2-a171-2470b63dff00,integer} items} 067e6162-3b6f-4ae2-a171-2470b63dff00 is outside ";
String old_key = "067e6162-3b6f-4ae2-a171-2470b63dff00";
String new_key = "NEW_KEY";
List<String> chunks = replaceInBalancedSubstrings(input, '{', '}', old_key, new_key);
System.out.println(String.join("", chunks));
Result: 067e6162-3b6f-4ae2-a171-2470b63dff00 is outside {{NEW_KEY,choice,NEW_KEY,1#one more item|1<another {NEW_KEY,number,NEW_KEY,integer} items} 067e6162-3b6f-4ae2-a171-2470b63dff00 is outside
The replaceInBalancedSubstrings method will look like:
public static List<String> replaceInBalancedSubstrings(String s, Character markStart, Character markEnd, String old_key, String new_key) {
List<String> subTreeList = new ArrayList<String>();
int level = 0;
int prevStart = 0;
StringBuffer sb = new StringBuffer();
int lastOpenBracket = -1;
for (int i = 0; i < s.length(); i++) {
char c = s.charAt(i);
if (level == 0) {
sb.append(c);
}
if (c == markStart) {
level++;
if (level == 1) {
lastOpenBracket = i;
if (sb.length() > 0) {
subTreeList.add(sb.toString());
sb.delete(0, sb.length());
}
}
}
else if (c == markEnd) {
if (level == 1) {
subTreeList.add(s.substring(lastOpenBracket, i+1).replace(old_key, new_key)); // String replacement here
}
level--;
}
}
if (sb.length() > 0) {
subTreeList.add(sb.toString());
}
return subTreeList;
}
See IDEONE demo
This code will deal with replacements only inside substrings inside balanced (nested) curly braces.

Split String Using comma but ignore comma if it is brackets or quotes

I've seen many examples, but I am not getting the expected result.
Given a String:
"manikanta, Santhosh, ramakrishna(mani, santhosh), tester"
I would like to get the String array as follows:
manikanta,
Santhosh,
ramakrishna(mani, santhosh),
tester
I tried the following regex (got from another example):
"(\".*?\"|[^\",\\s]+)(?=\\s*,|\\s*$)"

This does this trick:
String[] parts = input.split(", (?![^(]*\\))");
which employs a negative lookahead to assert that the next bracket char is not a close bracket, and produces:
manikanta
Santhosh
ramakrishna(mani, santhosh)
tester
The desired output as per your question keeps the trailing commas, which I assume is an oversight, but if you really do want to keep the commas:
String[] parts = input.split("(?<=,) (?![^(]*\\))");
which produces the same, but with the trailing commas intact:
manikanta,
Santhosh,
ramakrishna(mani, santhosh),
tester

Suppose, we can split with whitespaces (due to your example), then you can try this regex \s+(?=([^\)]*\()|([^\)\(]*$)) like:
String str = "manikanta, Santhosh, ramakrishna(mani, santhosh), ramakrishna(mani, santhosh), tester";
String[] ar = str.split("\\s+(?=([^\\)]*\\()|([^\\)\\(]*$))");
Where:
\s+ any number of whitespaces
(?=...) positive lookahead, means that after current position must be the string, that matches to ([^\\)]*\\() or | to ([^\\)\\(]*$)
([^\\)]*\\() ignores whitespaces inside the ( and )
([^\\)\\(]*$)) all whitespaces, if they are not followed by ( and ), here is used to split a part with the tester word

As I stated in my comment to the question this problem may be impossible to solve by regular expressions.
The following code (java) gives a hint what to do:
private void parse() {
String string = null;
char[] chars = string.toCharArray();
List<String> parts = new ArrayList<String>();
boolean split = true;
int lastEnd = 0;
for (int i = 0; i < chars.length; i++) {
char c = chars[i];
switch (c) {
case '(':
split = false;
break;
case ')':
split = true;
break;
}
if (split && c == ',') {
parts.add(string.substring(lastEnd, i - 1));
lastEnd = i++;
}
}
}
Note that the code lacks some checks for constraints (provided string is null, array borders, ...).

Handling delimiter with escape characters in Java String.split() method

I have searched the web for my query, but didn't get the answer which fits my requirement exactly. I have my string like below:
A|B|C|The Steading\|Keir Allan\|Braco|E
My Output should look like below:
A
B
C
The Steading|Keir Allan|Braco
E
My requirement is to skip the delimiter if it is preceded by the escape sequence. I have tried the following using negative lookbehinds in String.split():
(?<!\\)\|
But, my problem is the delimiter will be defined by the end user dynamically and it need not be always |. It can be any character on the keyboard (no restrictions). Hence, my doubt is that the above regex might fail for some of the special characters which are not allowed in regex.
I just wanted to know if this is the perfect way to do it.

You can use Pattern.quote():
String regex = "(?<!\\\\)" + Pattern.quote(delim);
Using your example:
String delim = "|";
String regex = "(?<!\\\\)" + Pattern.quote(delim);
for (String s : "A|B|C|The Steading\\|Keir Allan\\|Braco|E".split(regex))
System.out.println(s);
A
B
C
The Steading\|Keir Allan\|Braco
E
You can extend this to use a custom escape sequence as well:
String delim = "|";
String esc = "+";
String regex = "(?<!" + Pattern.quote(esc) + ")" + Pattern.quote(delim);
for (String s : "A|B|C|The Steading+|Keir Allan+|Braco|E".split(regex))
System.out.println(s);
A
B
C
The Steading+|Keir Allan+|Braco
E

I know this is an old thread, but the lookbehind solution has an issue, that it doesn't allow escaping of the escape character (the split would not occur on A|B|C|The Steading\\|Keir Allan\|Braco|E)).
The positive matching solution in thread Regex and escaped and unescaped delimiter works better (with modification using Pattern.quote() if the delimiter is dynamic).

private static void splitString(String str, char escapeCharacter, char delimiter, Consumer<String> resultConsumer) {
final StringBuilder sb = new StringBuilder();
boolean isEscaped = false;
for (int i = 0; i < str.length(); i++) {
char c = str.charAt(i);
if (c == escapeCharacter) {
isEscaped = ! isEscaped;
sb.append(c);
} else if (c == delimiter) {
if (isEscaped) {
sb.append(c);
isEscaped = false;
} else {
resultConsumer.accept(sb.toString());
sb.setLength(0);
}
} else {
isEscaped = false;
sb.append(c);
}
}
resultConsumer.accept(sb.toString());
}

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Regex (java) help - java

You could also use one of the many open source small libraries for parsing CSVs, e.g. opencsv or Commons CSV.

If you can remove [\" from the start of the outer string and \"] from the end of it to become: String test = "String 1\",\"String, two"; You can use: test.split("\",\"");

Related

Splitting csv lines that use "escaped" delimiter [duplicate]

Make regex not affecting Quotation mark [duplicate]

How do I create a regular expression to replace a known string while keeping optional parameters intact?

Split String Using comma but ignore comma if it is brackets or quotes

Handling delimiter with escape characters in Java String.split() method

Categories

Resources