Splitting csv lines that use "escaped" delimiter [duplicate]

Splitting csv lines that use "escaped" delimiter [duplicate] - java

I have a string vaguely like this:
foo,bar,c;qual="baz,blurb",d;junk="quux,syzygy"
that I want to split by commas -- but I need to ignore commas in quotes. How can I do this? Seems like a regexp approach fails; I suppose I can manually scan and enter a different mode when I see a quote, but it would be nice to use preexisting libraries. (edit: I guess I meant libraries that are already part of the JDK or already part of a commonly-used libraries like Apache Commons.)
the above string should split into:
foo
bar
c;qual="baz,blurb"
d;junk="quux,syzygy"
note: this is NOT a CSV file, it's a single string contained in a file with a larger overall structure

Try:
public class Main {
public static void main(String[] args) {
String line = "foo,bar,c;qual=\"baz,blurb\",d;junk=\"quux,syzygy\"";
String[] tokens = line.split(",(?=(?:[^\"]*\"[^\"]*\")*[^\"]*$)", -1);
for(String t : tokens) {
System.out.println("> "+t);
}
}
}
Output:
> foo
> bar
> c;qual="baz,blurb"
> d;junk="quux,syzygy"
In other words: split on the comma only if that comma has zero, or an even number of quotes ahead of it.
Or, a bit friendlier for the eyes:
public class Main {
public static void main(String[] args) {
String line = "foo,bar,c;qual=\"baz,blurb\",d;junk=\"quux,syzygy\"";
String otherThanQuote = " [^\"] ";
String quotedString = String.format(" \" %s* \" ", otherThanQuote);
String regex = String.format("(?x) "+ // enable comments, ignore white spaces
", "+ // match a comma
"(?= "+ // start positive look ahead
" (?: "+ // start non-capturing group 1
" %s* "+ // match 'otherThanQuote' zero or more times
" %s "+ // match 'quotedString'
" )* "+ // end group 1 and repeat it zero or more times
" %s* "+ // match 'otherThanQuote'
" $ "+ // match the end of the string
") ", // stop positive look ahead
otherThanQuote, quotedString, otherThanQuote);
String[] tokens = line.split(regex, -1);
for(String t : tokens) {
System.out.println("> "+t);
}
}
}
which produces the same as the first example.
EDIT
As mentioned by #MikeFHay in the comments:
I prefer using Guava's Splitter, as it has saner defaults (see discussion above about empty matches being trimmed by String#split(), so I did:
Splitter.on(Pattern.compile(",(?=(?:[^\"]*\"[^\"]*\")*[^\"]*$)"))

While I do like regular expressions in general, for this kind of state-dependent tokenization I believe a simple parser (which in this case is much simpler than that word might make it sound) is probably a cleaner solution, in particular with regards to maintainability, e.g.:
String input = "foo,bar,c;qual=\"baz,blurb\",d;junk=\"quux,syzygy\"";
List<String> result = new ArrayList<String>();
int start = 0;
boolean inQuotes = false;
for (int current = 0; current < input.length(); current++) {
if (input.charAt(current) == '\"') inQuotes = !inQuotes; // toggle state
else if (input.charAt(current) == ',' && !inQuotes) {
result.add(input.substring(start, current));
start = current + 1;
}
}
result.add(input.substring(start));
If you don't care about preserving the commas inside the quotes you could simplify this approach (no handling of start index, no last character special case) by replacing your commas in quotes by something else and then split at commas:
String input = "foo,bar,c;qual=\"baz,blurb\",d;junk=\"quux,syzygy\"";
StringBuilder builder = new StringBuilder(input);
boolean inQuotes = false;
for (int currentIndex = 0; currentIndex < builder.length(); currentIndex++) {
char currentChar = builder.charAt(currentIndex);
if (currentChar == '\"') inQuotes = !inQuotes; // toggle state
if (currentChar == ',' && inQuotes) {
builder.setCharAt(currentIndex, ';'); // or '♡', and replace later
}
}
List<String> result = Arrays.asList(builder.toString().split(","));

http://sourceforge.net/projects/javacsv/
https://github.com/pupi1985/JavaCSV-Reloaded
(fork of the previous library that will allow the generated output to have Windows line terminators \r\n when not running Windows)
http://opencsv.sourceforge.net/
CSV API for Java
Can you recommend a Java library for reading (and possibly writing) CSV files?
Java lib or app to convert CSV to XML file?

I would not advise a regex answer from Bart, I find parsing solution better in this particular case (as Fabian proposed). I've tried regex solution and own parsing implementation I have found that:
Parsing is much faster than splitting with regex with backreferences - ~20 times faster for short strings, ~40 times faster for long strings.
Regex fails to find empty string after last comma. That was not in original question though, it was mine requirement.
My solution and test below.
String tested = "foo,bar,c;qual=\"baz,blurb\",d;junk=\"quux,syzygy\",";
long start = System.nanoTime();
String[] tokens = tested.split(",(?=([^\"]*\"[^\"]*\")*[^\"]*$)");
long timeWithSplitting = System.nanoTime() - start;
start = System.nanoTime();
List<String> tokensList = new ArrayList<String>();
boolean inQuotes = false;
StringBuilder b = new StringBuilder();
for (char c : tested.toCharArray()) {
switch (c) {
case ',':
if (inQuotes) {
b.append(c);
} else {
tokensList.add(b.toString());
b = new StringBuilder();
}
break;
case '\"':
inQuotes = !inQuotes;
default:
b.append(c);
break;
}
}
tokensList.add(b.toString());
long timeWithParsing = System.nanoTime() - start;
System.out.println(Arrays.toString(tokens));
System.out.println(tokensList.toString());
System.out.printf("Time with splitting:\t%10d\n",timeWithSplitting);
System.out.printf("Time with parsing:\t%10d\n",timeWithParsing);
Of course you are free to change switch to else-ifs in this snippet if you feel uncomfortable with its ugliness. Note then lack of break after switch with separator. StringBuilder was chosen instead to StringBuffer by design to increase speed, where thread safety is irrelevant.

You're in that annoying boundary area where regexps almost won't do (as has been pointed out by Bart, escaping the quotes would make life hard) , and yet a full-blown parser seems like overkill.
If you are likely to need greater complexity any time soon I would go looking for a parser library. For example this one

I was impatient and chose not to wait for answers... for reference it doesn't look that hard to do something like this (which works for my application, I don't need to worry about escaped quotes, as the stuff in quotes is limited to a few constrained forms):
final static private Pattern splitSearchPattern = Pattern.compile("[\",]");
private List<String> splitByCommasNotInQuotes(String s) {
if (s == null)
return Collections.emptyList();
List<String> list = new ArrayList<String>();
Matcher m = splitSearchPattern.matcher(s);
int pos = 0;
boolean quoteMode = false;
while (m.find())
{
String sep = m.group();
if ("\"".equals(sep))
{
quoteMode = !quoteMode;
}
else if (!quoteMode && ",".equals(sep))
{
int toPos = m.start();
list.add(s.substring(pos, toPos));
pos = m.end();
}
}
if (pos < s.length())
list.add(s.substring(pos));
return list;
}
(exercise for the reader: extend to handling escaped quotes by looking for backslashes also.)

Try a lookaround like (?!\"),(?!\"). This should match , that are not surrounded by ".

The simplest approach is not to match delimiters, i.e. commas, with a complex additional logic to match what is actually intended (the data which might be quoted strings), just to exclude false delimiters, but rather match the intended data in the first place.
The pattern consists of two alternatives, a quoted string ("[^"]*" or ".*?") or everything up to the next comma ([^,]+). To support empty cells, we have to allow the unquoted item to be empty and to consume the next comma, if any, and use the \\G anchor:
Pattern p = Pattern.compile("\\G\"(.*?)\",?|([^,]*),?");
The pattern also contains two capturing groups to get either, the quoted string’s content or the plain content.
Then, with Java 9, we can get an array as
String[] a = p.matcher(input).results()
.map(m -> m.group(m.start(1)<0? 2: 1))
.toArray(String[]::new);
whereas older Java versions need a loop like
for(Matcher m = p.matcher(input); m.find(); ) {
String token = m.group(m.start(1)<0? 2: 1);
System.out.println("found: "+token);
}
Adding the items to a List or an array is left as an excise to the reader.
For Java 8, you can use the results() implementation of this answer, to do it like the Java 9 solution.
For mixed content with embedded strings, like in the question, you can simply use
Pattern p = Pattern.compile("\\G((\"(.*?)\"|[^,])*),?");
But then, the strings are kept in their quoted form.

what about a one-liner using String.split()?
String s = "foo,bar,c;qual=\"baz,blurb\",d;junk=\"quux,syzygy\"";
String[] split = s.split( "(?<!\".{0,255}[^\"]),|,(?![^\"].*\")" );

A regular expression is not capable of handling escaped characters. For my application, I needed the ability to escape quotes and spaces (my separator is spaces, but the code is the same).
Here is my solution in Kotlin (the language from this particular application), based on the one from Fabian Steeg:
fun parseString(input: String): List<String> {
val result = mutableListOf<String>()
var inQuotes = false
var inEscape = false
val current = StringBuilder()
for (i in input.indices) {
// If this character is escaped, add it without looking
if (inEscape) {
inEscape = false
current.append(input[i])
continue
}
when (val c = input[i]) {
'\\' -> inEscape = true // escape the next character, \ isn't added to result
',' -> if (inQuotes) {
current.append(c)
} else {
result += current.toString()
current.clear()
}
'"' -> inQuotes = !inQuotes
else -> current.append(c)
}
}
if (current.isNotEmpty()) {
result += current.toString()
}
return result
}
I think this is not a place to use regular expressions. Contrary to other opinions, I don't think a parser is overkill. It's about 20 lines and fairly easy to test.

Rather than use lookahead and other crazy regex, just pull out the quotes first. That is, for every quote grouping, replace that grouping with __IDENTIFIER_1 or some other indicator, and map that grouping to a map of string,string.
After you split on comma, replace all mapped identifiers with the original string values.

I would do something like this:
boolean foundQuote = false;
if(charAtIndex(currentStringIndex) == '"')
{
foundQuote = true;
}
if(foundQuote == true)
{
//do nothing
}
else
{
string[] split = currentString.split(',');
}

Related

How to remove leading 0 in the time timestamp 02:25PM using java? [duplicate]

I've seen questions on how to prefix zeros here in SO. But not the other way!
Can you guys suggest me how to remove the leading zeros in alphanumeric text? Are there any built-in APIs or do I need to write a method to trim the leading zeros?
Example:
01234 converts to 1234
0001234a converts to 1234a
001234-a converts to 1234-a
101234 remains as 101234
2509398 remains as 2509398
123z remains as 123z
000002829839 converts to 2829839

Regex is the best tool for the job; what it should be depends on the problem specification. The following removes leading zeroes, but leaves one if necessary (i.e. it wouldn't just turn "0" to a blank string).
s.replaceFirst("^0+(?!$)", "")
The ^ anchor will make sure that the 0+ being matched is at the beginning of the input. The (?!$) negative lookahead ensures that not the entire string will be matched.
Test harness:
String[] in = {
"01234", // "[1234]"
"0001234a", // "[1234a]"
"101234", // "[101234]"
"000002829839", // "[2829839]"
"0", // "[0]"
"0000000", // "[0]"
"0000009", // "[9]"
"000000z", // "[z]"
"000000.z", // "[.z]"
};
for (String s : in) {
System.out.println("[" + s.replaceFirst("^0+(?!$)", "") + "]");
}
See also
regular-expressions.info
repetitions, lookarounds, and anchors
String.replaceFirst(String regex)

You can use the StringUtils class from Apache Commons Lang like this:
StringUtils.stripStart(yourString,"0");

If you are using Kotlin This is the only code that you need:
yourString.trimStart('0')

How about the regex way:
String s = "001234-a";
s = s.replaceFirst ("^0*", "");
The ^ anchors to the start of the string (I'm assuming from context your strings are not multi-line here, otherwise you may need to look into \A for start of input rather than start of line). The 0* means zero or more 0 characters (you could use 0+ as well). The replaceFirst just replaces all those 0 characters at the start with nothing.
And if, like Vadzim, your definition of leading zeros doesn't include turning "0" (or "000" or similar strings) into an empty string (a rational enough expectation), simply put it back if necessary:
String s = "00000000";
s = s.replaceFirst ("^0*", "");
if (s.isEmpty()) s = "0";

A clear way without any need of regExp and any external libraries.
public static String trimLeadingZeros(String source) {
for (int i = 0; i < source.length(); ++i) {
char c = source.charAt(i);
if (c != '0') {
return source.substring(i);
}
}
return ""; // or return "0";
}

To go with thelost's Apache Commons answer: using guava-libraries (Google's general-purpose Java utility library which I would argue should now be on the classpath of any non-trivial Java project), this would use CharMatcher:
CharMatcher.is('0').trimLeadingFrom(inputString);

You could just do:
String s = Integer.valueOf("0001007").toString();

Use this:
String x = "00123".replaceAll("^0*", ""); // -> 123

Use Apache Commons StringUtils class:
StringUtils.strip(String str, String stripChars);

Using Regexp with groups:
Pattern pattern = Pattern.compile("(0*)(.*)");
String result = "";
Matcher matcher = pattern.matcher(content);
if (matcher.matches())
{
// first group contains 0, second group the remaining characters
// 000abcd - > 000, abcd
result = matcher.group(2);
}
return result;

Using regex as some of the answers suggest is a good way to do that. If you don't want to use regex then you can use this code:
String s = "00a0a121";
while(s.length()>0 && s.charAt(0)=='0')
{
s = s.substring(1);
}

If you (like me) need to remove all the leading zeros from each "word" in a string, you can modify #polygenelubricants' answer to the following:
String s = "003 d0g 00ss 00 0 00";
s.replaceAll("\\b0+(?!\\b)", "");
which results in:
3 d0g ss 0 0 0

I think that it is so easy to do that. You can just loop over the string from the start and removing zeros until you found a not zero char.
int lastLeadZeroIndex = 0;
for (int i = 0; i < str.length(); i++) {
char c = str.charAt(i);
if (c == '0') {
lastLeadZeroIndex = i;
} else {
break;
}
}
str = str.subString(lastLeadZeroIndex+1, str.length());

Without using Regex or substring() function on String which will be inefficient -
public static String removeZero(String str){
StringBuffer sb = new StringBuffer(str);
while (sb.length()>1 && sb.charAt(0) == '0')
sb.deleteCharAt(0);
return sb.toString(); // return in String
}

Using kotlin it is easy
value.trimStart('0')

You could replace "^0*(.*)" to "$1" with regex

String s="0000000000046457657772752256266542=56256010000085100000";
String removeString="";
for(int i =0;i<s.length();i++){
if(s.charAt(i)=='0')
removeString=removeString+"0";
else
break;
}
System.out.println("original string - "+s);
System.out.println("after removing 0's -"+s.replaceFirst(removeString,""));

If you don't want to use regex or external library.
You can do with "for":
String input="0000008008451"
String output = input.trim();
for( ;output.length() > 1 && output.charAt(0) == '0'; output = output.substring(1));
System.out.println(output);//8008451

I made some benchmark tests and found, that the fastest way (by far) is this solution:
private static String removeLeadingZeros(String s) {
try {
Integer intVal = Integer.parseInt(s);
s = intVal.toString();
} catch (Exception ex) {
// whatever
}
return s;
}
Especially regular expressions are very slow in a long iteration. (I needed to find out the fastest way for a batchjob.)

And what about just searching for the first non-zero character?
[1-9]\d+
This regex finds the first digit between 1 and 9 followed by any number of digits, so for "00012345" it returns "12345".
It can be easily adapted for alphanumeric strings.

Make regex not affecting Quotation mark [duplicate]

I have a string vaguely like this:
foo,bar,c;qual="baz,blurb",d;junk="quux,syzygy"
that I want to split by commas -- but I need to ignore commas in quotes. How can I do this? Seems like a regexp approach fails; I suppose I can manually scan and enter a different mode when I see a quote, but it would be nice to use preexisting libraries. (edit: I guess I meant libraries that are already part of the JDK or already part of a commonly-used libraries like Apache Commons.)
the above string should split into:
foo
bar
c;qual="baz,blurb"
d;junk="quux,syzygy"
note: this is NOT a CSV file, it's a single string contained in a file with a larger overall structure

Try:
public class Main {
public static void main(String[] args) {
String line = "foo,bar,c;qual=\"baz,blurb\",d;junk=\"quux,syzygy\"";
String[] tokens = line.split(",(?=(?:[^\"]*\"[^\"]*\")*[^\"]*$)", -1);
for(String t : tokens) {
System.out.println("> "+t);
}
}
}
Output:
> foo
> bar
> c;qual="baz,blurb"
> d;junk="quux,syzygy"
In other words: split on the comma only if that comma has zero, or an even number of quotes ahead of it.
Or, a bit friendlier for the eyes:
public class Main {
public static void main(String[] args) {
String line = "foo,bar,c;qual=\"baz,blurb\",d;junk=\"quux,syzygy\"";
String otherThanQuote = " [^\"] ";
String quotedString = String.format(" \" %s* \" ", otherThanQuote);
String regex = String.format("(?x) "+ // enable comments, ignore white spaces
", "+ // match a comma
"(?= "+ // start positive look ahead
" (?: "+ // start non-capturing group 1
" %s* "+ // match 'otherThanQuote' zero or more times
" %s "+ // match 'quotedString'
" )* "+ // end group 1 and repeat it zero or more times
" %s* "+ // match 'otherThanQuote'
" $ "+ // match the end of the string
") ", // stop positive look ahead
otherThanQuote, quotedString, otherThanQuote);
String[] tokens = line.split(regex, -1);
for(String t : tokens) {
System.out.println("> "+t);
}
}
}
which produces the same as the first example.
EDIT
As mentioned by #MikeFHay in the comments:
I prefer using Guava's Splitter, as it has saner defaults (see discussion above about empty matches being trimmed by String#split(), so I did:
Splitter.on(Pattern.compile(",(?=(?:[^\"]*\"[^\"]*\")*[^\"]*$)"))

While I do like regular expressions in general, for this kind of state-dependent tokenization I believe a simple parser (which in this case is much simpler than that word might make it sound) is probably a cleaner solution, in particular with regards to maintainability, e.g.:
String input = "foo,bar,c;qual=\"baz,blurb\",d;junk=\"quux,syzygy\"";
List<String> result = new ArrayList<String>();
int start = 0;
boolean inQuotes = false;
for (int current = 0; current < input.length(); current++) {
if (input.charAt(current) == '\"') inQuotes = !inQuotes; // toggle state
else if (input.charAt(current) == ',' && !inQuotes) {
result.add(input.substring(start, current));
start = current + 1;
}
}
result.add(input.substring(start));
If you don't care about preserving the commas inside the quotes you could simplify this approach (no handling of start index, no last character special case) by replacing your commas in quotes by something else and then split at commas:
String input = "foo,bar,c;qual=\"baz,blurb\",d;junk=\"quux,syzygy\"";
StringBuilder builder = new StringBuilder(input);
boolean inQuotes = false;
for (int currentIndex = 0; currentIndex < builder.length(); currentIndex++) {
char currentChar = builder.charAt(currentIndex);
if (currentChar == '\"') inQuotes = !inQuotes; // toggle state
if (currentChar == ',' && inQuotes) {
builder.setCharAt(currentIndex, ';'); // or '♡', and replace later
}
}
List<String> result = Arrays.asList(builder.toString().split(","));

http://sourceforge.net/projects/javacsv/
https://github.com/pupi1985/JavaCSV-Reloaded
(fork of the previous library that will allow the generated output to have Windows line terminators \r\n when not running Windows)
http://opencsv.sourceforge.net/
CSV API for Java
Can you recommend a Java library for reading (and possibly writing) CSV files?
Java lib or app to convert CSV to XML file?

I would not advise a regex answer from Bart, I find parsing solution better in this particular case (as Fabian proposed). I've tried regex solution and own parsing implementation I have found that:
Parsing is much faster than splitting with regex with backreferences - ~20 times faster for short strings, ~40 times faster for long strings.
Regex fails to find empty string after last comma. That was not in original question though, it was mine requirement.
My solution and test below.
String tested = "foo,bar,c;qual=\"baz,blurb\",d;junk=\"quux,syzygy\",";
long start = System.nanoTime();
String[] tokens = tested.split(",(?=([^\"]*\"[^\"]*\")*[^\"]*$)");
long timeWithSplitting = System.nanoTime() - start;
start = System.nanoTime();
List<String> tokensList = new ArrayList<String>();
boolean inQuotes = false;
StringBuilder b = new StringBuilder();
for (char c : tested.toCharArray()) {
switch (c) {
case ',':
if (inQuotes) {
b.append(c);
} else {
tokensList.add(b.toString());
b = new StringBuilder();
}
break;
case '\"':
inQuotes = !inQuotes;
default:
b.append(c);
break;
}
}
tokensList.add(b.toString());
long timeWithParsing = System.nanoTime() - start;
System.out.println(Arrays.toString(tokens));
System.out.println(tokensList.toString());
System.out.printf("Time with splitting:\t%10d\n",timeWithSplitting);
System.out.printf("Time with parsing:\t%10d\n",timeWithParsing);
Of course you are free to change switch to else-ifs in this snippet if you feel uncomfortable with its ugliness. Note then lack of break after switch with separator. StringBuilder was chosen instead to StringBuffer by design to increase speed, where thread safety is irrelevant.

You're in that annoying boundary area where regexps almost won't do (as has been pointed out by Bart, escaping the quotes would make life hard) , and yet a full-blown parser seems like overkill.
If you are likely to need greater complexity any time soon I would go looking for a parser library. For example this one

I was impatient and chose not to wait for answers... for reference it doesn't look that hard to do something like this (which works for my application, I don't need to worry about escaped quotes, as the stuff in quotes is limited to a few constrained forms):
final static private Pattern splitSearchPattern = Pattern.compile("[\",]");
private List<String> splitByCommasNotInQuotes(String s) {
if (s == null)
return Collections.emptyList();
List<String> list = new ArrayList<String>();
Matcher m = splitSearchPattern.matcher(s);
int pos = 0;
boolean quoteMode = false;
while (m.find())
{
String sep = m.group();
if ("\"".equals(sep))
{
quoteMode = !quoteMode;
}
else if (!quoteMode && ",".equals(sep))
{
int toPos = m.start();
list.add(s.substring(pos, toPos));
pos = m.end();
}
}
if (pos < s.length())
list.add(s.substring(pos));
return list;
}
(exercise for the reader: extend to handling escaped quotes by looking for backslashes also.)

Try a lookaround like (?!\"),(?!\"). This should match , that are not surrounded by ".

The simplest approach is not to match delimiters, i.e. commas, with a complex additional logic to match what is actually intended (the data which might be quoted strings), just to exclude false delimiters, but rather match the intended data in the first place.
The pattern consists of two alternatives, a quoted string ("[^"]*" or ".*?") or everything up to the next comma ([^,]+). To support empty cells, we have to allow the unquoted item to be empty and to consume the next comma, if any, and use the \\G anchor:
Pattern p = Pattern.compile("\\G\"(.*?)\",?|([^,]*),?");
The pattern also contains two capturing groups to get either, the quoted string’s content or the plain content.
Then, with Java 9, we can get an array as
String[] a = p.matcher(input).results()
.map(m -> m.group(m.start(1)<0? 2: 1))
.toArray(String[]::new);
whereas older Java versions need a loop like
for(Matcher m = p.matcher(input); m.find(); ) {
String token = m.group(m.start(1)<0? 2: 1);
System.out.println("found: "+token);
}
Adding the items to a List or an array is left as an excise to the reader.
For Java 8, you can use the results() implementation of this answer, to do it like the Java 9 solution.
For mixed content with embedded strings, like in the question, you can simply use
Pattern p = Pattern.compile("\\G((\"(.*?)\"|[^,])*),?");
But then, the strings are kept in their quoted form.

what about a one-liner using String.split()?
String s = "foo,bar,c;qual=\"baz,blurb\",d;junk=\"quux,syzygy\"";
String[] split = s.split( "(?<!\".{0,255}[^\"]),|,(?![^\"].*\")" );

A regular expression is not capable of handling escaped characters. For my application, I needed the ability to escape quotes and spaces (my separator is spaces, but the code is the same).
Here is my solution in Kotlin (the language from this particular application), based on the one from Fabian Steeg:
fun parseString(input: String): List<String> {
val result = mutableListOf<String>()
var inQuotes = false
var inEscape = false
val current = StringBuilder()
for (i in input.indices) {
// If this character is escaped, add it without looking
if (inEscape) {
inEscape = false
current.append(input[i])
continue
}
when (val c = input[i]) {
'\\' -> inEscape = true // escape the next character, \ isn't added to result
',' -> if (inQuotes) {
current.append(c)
} else {
result += current.toString()
current.clear()
}
'"' -> inQuotes = !inQuotes
else -> current.append(c)
}
}
if (current.isNotEmpty()) {
result += current.toString()
}
return result
}
I think this is not a place to use regular expressions. Contrary to other opinions, I don't think a parser is overkill. It's about 20 lines and fairly easy to test.

Rather than use lookahead and other crazy regex, just pull out the quotes first. That is, for every quote grouping, replace that grouping with __IDENTIFIER_1 or some other indicator, and map that grouping to a map of string,string.
After you split on comma, replace all mapped identifiers with the original string values.

I would do something like this:
boolean foundQuote = false;
if(charAtIndex(currentStringIndex) == '"')
{
foundQuote = true;
}
if(foundQuote == true)
{
//do nothing
}
else
{
string[] split = currentString.split(',');
}

Efficient encoding all the special characters in a string into entities

I have a string like this "abcd !#&$%^^&*()<>!/". I have list of all the entity codes for characters in a separate string i.e. only encode those characters which are in another string "!=&4....^=9...". I want to convert all of special characters into their entities except alphanumeric by regex as using loop on characters on by one is too slow.
e.g. it should show "abc &#4..;&#4.." in other convert words all the special characters on keyboard.
Is there an efficient regex I can write ? I have tried this with loops but it is too slow to look at each character one by one and maintain a list of all special characters entities in other string
There are libraries but they do not convert all of the characters.
The code I wrote
// String to be encoded
String sDecoded = "abcd !##$%^&*();'m,";
// Special character entity list to put instead to special character. It is tokenized on cross and divide symbol as it cannot be entered by user on keyboard
String specialCharacters = "&÷$amp;×–÷–"
// Check the input
if (sDecoded == null || sDecoded.trim ().length () == 0)
return (sDecoded);
// Use StringTokenizer which is faster than split method
StringTokenizer st = new StringTokenizer(specialCharacters, "×");
String[] reg = null;
String[] charactersArray = sDecoded.split("");
String sEncoded = "";
// now loop on it and in each iteration, we will be getting a decodedCharacter:EncodedEntity pair
for(int i = 0; i < charactersArray.length; i++)
{
st = new StringTokenizer(specialCharacters, "×");
while(st.hasMoreElements())
{
reg = st.nextElement().toString().split("÷");
// This is an error, the character should not be blank ever because it will be character that we will encode
if(StringUtils.isBlank(reg[0]))
return sDecoded;
String c = charactersArray[i];
if(c.equalsIgnoreCase(reg[0]))
{
sEncoded = sEncoded + c.replace(reg[0], reg[1]);
break;
}
if(st.countTokens() == 0)
sEncoded = sEncoded + c.toString();
}
}
return (sEncoded);

I don't know what definition of "efficient" you are using, but there's the "don't reinvent the wheel" efficiency of using a simple call to Apache commons-text StringEscapeUtils utility class:
String encoded = StringEscapeUtils.escapeXml11(str);
or
String encoded = StringEscapeUtils.escapeHtml4(str);
and a variety of other similar methods, depending on which exact encoding you want.
Note: This class was originally in the commons-lang3 library, but was deprecated there and moved to the commons-text library.

Your approach is quite slow and inefficient. Maybe it looks elegant nowadays to use regex like a silver bullet for everything, but it is definitely not for this task. I see you are also using tokenizer which is also slow.Also loop inside a loop will degrade performance.
I would recomment using an iterative way with string builder which will produce blazing fast results, you will try for yourself. For each special character make an 'if' statement. Even if it looks too much code it will be very fast. Test yourself.
Try this :
class Scratch {
public static void main(String[] args) {
System.out.println(escapeSpecials("abc &"));
}
public static String escapeSpecials(String origin) {
StringBuilder result = new StringBuilder();
char[] chars = origin.toCharArray();
for (char c : chars) {
if (c == '&') {
result.append("&");
} else if (c == '\u2013') {
result.append("–");
} else {
// not a special character
result.append(c);
}
}
return result.toString();
}
}

Regular expression troubles, escaped quotes

Basically, I'm being passed a string and I need to tokenise it in much the same manner as command line options are tokenised by a *nix shell
Say I have the following string
"Hello\" World" "Hello Universe" Hi
How could I turn it into a 3 element list
Hello" World
Hello Universe
Hi
The following is my first attempt, but it's got a number of problems
It leaves the quote characters
It doesn't catch the escaped quote
Code:
public void test() {
String str = "\"Hello\\\" World\" \"Hello Universe\" Hi";
List<String> list = split(str);
}
public static List<String> split(String str) {
Pattern pattern = Pattern.compile(
"\"[^\"]*\"" + /* double quoted token*/
"|'[^']*'" + /*single quoted token*/
"|[A-Za-z']+" /*everything else*/
);
List<String> opts = new ArrayList<String>();
Scanner scanner = new Scanner(str).useDelimiter(pattern);
String token;
while ((token = scanner.findInLine(pattern)) != null) {
opts.add(token);
}
return opts;
}
So the incorrect output of the following code is
"Hello\"
World
" "
Hello
Universe
Hi
EDIT I'm totally open to a non regex solution. It's just the first solution that came to mind

If you decide you want to forego regex, and do parsing instead, there are a couple of options. If you are willing to have just a double quote or a single quote (but not both) as your quote, then you can use StreamTokenizer to solve this easily:
public static List<String> tokenize(String s) throws IOException {
List<String> opts = new ArrayList<String>();
StreamTokenizer st = new StreamTokenizer(new StringReader(s));
st.quoteChar('\"');
while (st.nextToken() != StreamTokenizer.TT_EOF) {
opts.add(st.sval);
}
return opts;
}
If you must support both quotes, here is a naive implementation that should work (caveat that a string like '"blah \" blah"blah' will yield something like 'blah " blahblah'. If that isn't OK, you will need to make some changes):
public static List<String> splitSSV(String in) throws IOException {
ArrayList<String> out = new ArrayList<String>();
StringReader r = new StringReader(in);
StringBuilder b = new StringBuilder();
int inQuote = -1;
boolean escape = false;
int c;
// read each character
while ((c = r.read()) != -1) {
if (escape) { // if the previous char is escape, add the current char
b.append((char)c);
escape = false;
continue;
}
switch (c) {
case '\\': // deal with escape char
escape = true;
break;
case '\"':
case '\'': // deal with quote chars
if (c == '\"' || c == '\'') {
if (inQuote == -1) { // not in a quote
inQuote = c; // now we are
} else {
inQuote = -1; // we were in a quote and now we aren't
}
}
break;
case ' ':
if (inQuote == -1) { // if we aren't in a quote, then add token to list
out.add(b.toString());
b.setLength(0);
} else {
b.append((char)c); // else append space to current token
}
break;
default:
b.append((char)c); // append all other chars to current token
}
}
if (b.length() > 0) {
out.add(b.toString()); // add final token to list
}
return out;
}

I'm pretty sure you can't do this by just tokenising on a regex. If you need to deal with nested and escaped delimiters, you need to write a parser. See e.g. http://kore-nordmann.de/blog/do_NOT_parse_using_regexp.html
There will be open source parsers which can do what you want, although I don't know any. You should also check out the StreamTokenizer class.

To recap, you want to split on whitespace, except when surrounded by double quotes, which are not preceded by a backslash.
Step 1: tokenize the input: /([ \t]+)|(\\")|(")|([^ \t"]+)/
This gives you a sequence of SPACE, ESCAPED_QUOTE, QUOTE and TEXT tokens.
Step 2: build a finite state machine matching and reacting to the tokens:
State: START
SPACE -> return empty string
ESCAPED_QUOTE -> Error (?)
QUOTE -> State := WITHIN_QUOTES
TEXT -> return text
State: WITHIN_QUOTES
SPACE -> add value to accumulator
ESCAPED_QUOTE -> add quote to accumulator
QUOTE -> return and clear accumulator; State := START
TEXT -> add text to accumulator
Step 3: Profit!!

I think if you use pattern like this:
Pattern pattern = Pattern.compile("\".*?(?<!\\\\)\"|'.*?(?<!\\\\)'|[A-Za-z']+");
Then it will give you desired output. When I ran with your input data I got this list:
["Hello\" World", "Hello Universe", Hi]
I used [A-Za-z']+ from your own question but shouldn't it be just : [A-Za-z]+
EDIT
Change your opts.add(token); line to:
opts.add(token.replaceAll("^\"|\"$|^'|'$", ""));

The first thing you need to do is stop thinking of the job in terms of split(). split() is meant for breaking down simple strings like this/that/the other, where / is always a delimiter. But you're trying to split on whitespace, unless the whitespace is within quotes, except if the quotes are escaped with backslashes (and if backslashes escape quotes, they probably escape other things, like other backslashes).
With all those exceptions-to-exceptions, it's just not possible to create a regex to match all possible delimiters, not even with fancy gimmicks like lookarounds, conditionals, reluctant and possessive quantifiers. What you want to do is match the tokens, not the delimiters.
In the following code, a token that's enclosed in double-quotes or single-quotes may contain whitespace as well as the quote character if it's preceded by a backslash. Everything except the enclosing quotes is captured in group #1 (for double-quoted tokens) or group #2 (single-quoted). Any character may be escaped with a backslash, even in non-quoted tokens; the "escaping" backslashes are removed in a separate step.
public static void test()
{
String str = "\"Hello\\\" World\" 'Hello Universe' Hi";
List<String> commands = parseCommands(str);
for (String s : commands)
{
System.out.println(s);
}
}
public static List<String> parseCommands(String s)
{
String rgx = "\"((?:[^\"\\\\]++|\\\\.)*+)\"" // double-quoted
+ "|'((?:[^'\\\\]++|\\\\.)*+)'" // single-quoted
+ "|\\S+"; // not quoted
Pattern p = Pattern.compile(rgx);
Matcher m = p.matcher(s);
List<String> commands = new ArrayList<String>();
while (m.find())
{
String cmd = m.start(1) != -1 ? m.group(1) // strip double-quotes
: m.start(2) != -1 ? m.group(2) // strip single-quotes
: m.group();
cmd = cmd.replaceAll("\\\\(.)", "$1"); // remove escape characters
commands.add(cmd);
}
return commands;
}
output:
Hello" World
Hello Universe
Hi
This is about as simple as it gets for a regex-based solution--and it doesn't really deal with malformed input, like unbalanced quotes. If you're not fluent in regexes, you might be better off with a purely hand-coded solution or, even better, a dedicated command-line interpreter (CLI) library.

Regex (java) help

How do I split this comma+quote delimited String into a set of strings:
String test = "[\"String 1\",\"String, two\"]";
String[] embeddedStrings = test.split("<insert magic regex here>");
//note: It should also work for this string, with a space after the separating comma: "[\"String 1\", \"String, two\"]";
assertEquals("String 1", embeddedStrings[0]);
assertEquals("String, two", embeddedStrings[1]);
I'm fine with trimming the square brackets as a first step. But the catch is, even if I do that, I can't just split on a comma because embedded strings can have commas in them.
Using Apache StringUtils is also acceptable.

You could also use one of the many open source small libraries for parsing CSVs, e.g. opencsv or Commons CSV.

If you can remove [\" from the start of the outer string and \"] from the end of it
to become:
String test = "String 1\",\"String, two";
You can use:
test.split("\",\"");

This is extremely fragile and should be avoided, but you could match the string literals.
Pattern p = Pattern.compile("\"((?:[^\"]+|\\\\\")*)\"");
String test = "[\"String 1\",\"String, two\"]";
Matcher m = p.matcher(test);
ArrayList<String> embeddedStrings = new ArrayList<String>();
while (m.find()) {
embeddedStrings.add(m.group(1));
}
The regular expression assumes that double quotes in the input are escaped using \" and not "". The pattern would break if the input had an odd number of (unescaped) double quotes.

Brute-force method, some of this may be pseudocode and I think there's a fencepost problem when setting currStart and/or String.substring(). This assumes that brackets are already removed.
boolean inquote = false;
List strings = new ArrayList();
int currStart=0;
for (int i=0; i<test.length(); i++) {
char c = test.charAt(i);
if (c == ',' && ! inquote) {
strings.add(test.substring(currStart, i);
currStart = i;
}
else if (c == ' ' && currStart + == i)
currStart = i; // strip off spaces after a comma
else if (c == '"')
inquote != inquote;
}
strings.add(test.substring(currStart,i));
String embeddedStrings = strings.toArray();

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Splitting csv lines that use "escaped" delimiter [duplicate] - java

Try a lookaround like (?!\"),(?!\"). This should match , that are not surrounded by ".

what about a one-liner using String.split()? String s = "foo,bar,c;qual=\"baz,blurb\",d;junk=\"quux,syzygy\""; String[] split = s.split( "(?<!\".{0,255}[^\"]),|,(?![^\"].*\")" );

I would do something like this: boolean foundQuote = false; if(charAtIndex(currentStringIndex) == '"') { foundQuote = true; } if(foundQuote == true) { //do nothing } else { string[] split = currentString.split(','); }

Related

How to remove leading 0 in the time timestamp 02:25PM using java? [duplicate]

Make regex not affecting Quotation mark [duplicate]

Efficient encoding all the special characters in a string into entities

Regular expression troubles, escaped quotes

Regex (java) help

Categories

Resources