Regex to consolidate multiple rules

Regex to consolidate multiple rules - java

I'm looking at optimising my string manipulation code and consolidating all of my replaceAll's to just one pattern if possible
Rules -
strip all special chars except -
replace space with -
condense consecutive - 's to just one -
Remove leading and trailing -'s
My code -
public static String slugifyTitle(String value) {
String slugifiedVal = null;
if (StringUtils.isNotEmpty(value))
slugifiedVal = value
.replaceAll("[ ](?=[ ])|[^-A-Za-z0-9 ]+", "") // strips all special chars except -
.replaceAll("\\s+", "-") // converts spaces to -
.replaceAll("--+", "-"); // replaces consecutive -'s with just one -
slugifiedVal = StringUtils.stripStart(slugifiedVal, "-"); // strips leading -
slugifiedVal = StringUtils.stripEnd(slugifiedVal, "-"); // strips trailing -
return slugifiedVal;
}
Does the job but obviously looks shoddy.
My test assertions -
Heading with symbols *~!##$%^&()_+-=[]{};',.<>?/ ==> heading-with-symbols
Heading with an asterisk* ==> heading-with-an-asterisk
Custom-id-&-stuff ==> custom-id-stuff
--Custom-id-&-stuff-- ==> custom-id-stuff

Disclaimer: I don't think a regex approach to this problem is wrong, or that this is an objectively better approach. I am merely presenting an alternative approach as food for thought.
I have a tendency against regex approaches to problems where you have to ask how to solve with regex, because that implies you're going to struggle to maintain that solution in the future. There is an opacity to regexes where "just do this" is obvious, when you know just to do this.
Some problems typically solved with regex, like this one, can be solved using imperative code. It tends to be more verbose, but it uses simple, apparent, code constructs; it's easier to debug; and can be faster because it doesn't involve the full "machinery" of the regex engine.
static String slugifyTitle(String value) {
boolean appendHyphen = false;
StringBuilder sb = new StringBuilder(value.length());
// Go through value one character at a time...
for (int i = 0; i < value.length(); i++) {
char c = value.charAt(i);
if (isAppendable(c)) {
// We have found a character we want to include in the string.
if (appendHyphen) {
// We previously found character(s) that we want to append a single
// hyphen for.
sb.append('-');
appendHyphen = false;
}
sb.append(c);
} else if (requiresHyphen(c)) {
// We want to replace hyphens or spaces with a single hyphen.
// Only append a hyphen if it's not going to be the first thing in the output.
// Doesn't matter if this is set for trailing hyphen/whitespace,
// since we then never hit the "isAppendable" condition.
appendHyphen = sb.length() > 0;
} else {
// Other characters are simply ignored.
}
}
// You can lowercase when appending the character, but `Character.toLowerCase()`
// recommends using `String.toLowerCase` instead.
return sb.toString().toLowerCase(Locale.ROOT);
}
// Some predicate on characters you want to include in the output.
static boolean isAppendable(char c) {
return (c >= 'A' && c <= 'Z')
|| (c >= 'a' && c <= 'z')
|| (c >= '0' && c <= '9');
}
// Some predicate on characters you want to replace with a single '-'.
static boolean requiresHyphen(char c) {
return c == '-' || Character.isWhitespace(c);
}
(This code is wildly over-commented, for the purpose of explaining it in this answer. Strip out the comments and unnecessary things like the else, it's actually not super complicated).

Consider the following regex parts:
Any special chars other than -: [\p{S}\p{P}&&[^-]]+ (character class subtraction)
Any one or more whitespace or hyphens: [^-\s]+ (this will be used to replace with a single -)
You will still need to remove leading/trailing hyphens, it will be a separate post-processing step. If you wish, you can use a ^-+|-+$ regex.
So, you can only reduce this to three .replaceAll invocations keeping the code precise and readable:
public static String slugifyTitle(String value) {
String slugifiedVal = null;
if (value != null && !value.trim().isEmpty())
slugifiedVal = value.toLowerCase()
.replaceAll("[\\p{S}\\p{P}&&[^-]]+", "") // strips all special chars except -
.replaceAll("[\\s-]+", "-") // converts spaces/hyphens to -
.replaceAll("^-+|-+$", ""); // remove trailing/leading hyphens
return slugifiedVal;
}
See the Java demo:
List<String> strs = Arrays.asList("Heading with symbols *~!##$%^&()_+-=[]{};',.<>?/",
"Heading with an asterisk*",
"Custom-id-&-stuff",
"--Custom-id-&-stuff--");
for (String str : strs)
System.out.println("\"" + str + "\" => " + slugifyTitle(str));
}
Output:
"Heading with symbols *~!##$%^&()_+-=[]{};',.<>?/" => heading-with-symbols
"Heading with an asterisk*" => heading-with-an-asterisk
"Custom-id-&-stuff" => custom-id-stuff
"--Custom-id-&-stuff--" => custom-id-stuff
NOTE: if your strings can contain any Unicode whitespace, replace "[\\s-]+" with "(?U)[\\s-]+".

Related

How to remove leading 0 in the time timestamp 02:25PM using java? [duplicate]

I've seen questions on how to prefix zeros here in SO. But not the other way!
Can you guys suggest me how to remove the leading zeros in alphanumeric text? Are there any built-in APIs or do I need to write a method to trim the leading zeros?
Example:
01234 converts to 1234
0001234a converts to 1234a
001234-a converts to 1234-a
101234 remains as 101234
2509398 remains as 2509398
123z remains as 123z
000002829839 converts to 2829839

Regex is the best tool for the job; what it should be depends on the problem specification. The following removes leading zeroes, but leaves one if necessary (i.e. it wouldn't just turn "0" to a blank string).
s.replaceFirst("^0+(?!$)", "")
The ^ anchor will make sure that the 0+ being matched is at the beginning of the input. The (?!$) negative lookahead ensures that not the entire string will be matched.
Test harness:
String[] in = {
"01234", // "[1234]"
"0001234a", // "[1234a]"
"101234", // "[101234]"
"000002829839", // "[2829839]"
"0", // "[0]"
"0000000", // "[0]"
"0000009", // "[9]"
"000000z", // "[z]"
"000000.z", // "[.z]"
};
for (String s : in) {
System.out.println("[" + s.replaceFirst("^0+(?!$)", "") + "]");
}
See also
regular-expressions.info
repetitions, lookarounds, and anchors
String.replaceFirst(String regex)

You can use the StringUtils class from Apache Commons Lang like this:
StringUtils.stripStart(yourString,"0");

If you are using Kotlin This is the only code that you need:
yourString.trimStart('0')

How about the regex way:
String s = "001234-a";
s = s.replaceFirst ("^0*", "");
The ^ anchors to the start of the string (I'm assuming from context your strings are not multi-line here, otherwise you may need to look into \A for start of input rather than start of line). The 0* means zero or more 0 characters (you could use 0+ as well). The replaceFirst just replaces all those 0 characters at the start with nothing.
And if, like Vadzim, your definition of leading zeros doesn't include turning "0" (or "000" or similar strings) into an empty string (a rational enough expectation), simply put it back if necessary:
String s = "00000000";
s = s.replaceFirst ("^0*", "");
if (s.isEmpty()) s = "0";

A clear way without any need of regExp and any external libraries.
public static String trimLeadingZeros(String source) {
for (int i = 0; i < source.length(); ++i) {
char c = source.charAt(i);
if (c != '0') {
return source.substring(i);
}
}
return ""; // or return "0";
}

To go with thelost's Apache Commons answer: using guava-libraries (Google's general-purpose Java utility library which I would argue should now be on the classpath of any non-trivial Java project), this would use CharMatcher:
CharMatcher.is('0').trimLeadingFrom(inputString);

You could just do:
String s = Integer.valueOf("0001007").toString();

Use this:
String x = "00123".replaceAll("^0*", ""); // -> 123

Use Apache Commons StringUtils class:
StringUtils.strip(String str, String stripChars);

Using Regexp with groups:
Pattern pattern = Pattern.compile("(0*)(.*)");
String result = "";
Matcher matcher = pattern.matcher(content);
if (matcher.matches())
{
// first group contains 0, second group the remaining characters
// 000abcd - > 000, abcd
result = matcher.group(2);
}
return result;

Using regex as some of the answers suggest is a good way to do that. If you don't want to use regex then you can use this code:
String s = "00a0a121";
while(s.length()>0 && s.charAt(0)=='0')
{
s = s.substring(1);
}

If you (like me) need to remove all the leading zeros from each "word" in a string, you can modify #polygenelubricants' answer to the following:
String s = "003 d0g 00ss 00 0 00";
s.replaceAll("\\b0+(?!\\b)", "");
which results in:
3 d0g ss 0 0 0

I think that it is so easy to do that. You can just loop over the string from the start and removing zeros until you found a not zero char.
int lastLeadZeroIndex = 0;
for (int i = 0; i < str.length(); i++) {
char c = str.charAt(i);
if (c == '0') {
lastLeadZeroIndex = i;
} else {
break;
}
}
str = str.subString(lastLeadZeroIndex+1, str.length());

Without using Regex or substring() function on String which will be inefficient -
public static String removeZero(String str){
StringBuffer sb = new StringBuffer(str);
while (sb.length()>1 && sb.charAt(0) == '0')
sb.deleteCharAt(0);
return sb.toString(); // return in String
}

Using kotlin it is easy
value.trimStart('0')

You could replace "^0*(.*)" to "$1" with regex

String s="0000000000046457657772752256266542=56256010000085100000";
String removeString="";
for(int i =0;i<s.length();i++){
if(s.charAt(i)=='0')
removeString=removeString+"0";
else
break;
}
System.out.println("original string - "+s);
System.out.println("after removing 0's -"+s.replaceFirst(removeString,""));

If you don't want to use regex or external library.
You can do with "for":
String input="0000008008451"
String output = input.trim();
for( ;output.length() > 1 && output.charAt(0) == '0'; output = output.substring(1));
System.out.println(output);//8008451

I made some benchmark tests and found, that the fastest way (by far) is this solution:
private static String removeLeadingZeros(String s) {
try {
Integer intVal = Integer.parseInt(s);
s = intVal.toString();
} catch (Exception ex) {
// whatever
}
return s;
}
Especially regular expressions are very slow in a long iteration. (I needed to find out the fastest way for a batchjob.)

And what about just searching for the first non-zero character?
[1-9]\d+
This regex finds the first digit between 1 and 9 followed by any number of digits, so for "00012345" it returns "12345".
It can be easily adapted for alphanumeric strings.

Splitting csv lines that use "escaped" delimiter [duplicate]

I have a string vaguely like this:
foo,bar,c;qual="baz,blurb",d;junk="quux,syzygy"
that I want to split by commas -- but I need to ignore commas in quotes. How can I do this? Seems like a regexp approach fails; I suppose I can manually scan and enter a different mode when I see a quote, but it would be nice to use preexisting libraries. (edit: I guess I meant libraries that are already part of the JDK or already part of a commonly-used libraries like Apache Commons.)
the above string should split into:
foo
bar
c;qual="baz,blurb"
d;junk="quux,syzygy"
note: this is NOT a CSV file, it's a single string contained in a file with a larger overall structure

Try:
public class Main {
public static void main(String[] args) {
String line = "foo,bar,c;qual=\"baz,blurb\",d;junk=\"quux,syzygy\"";
String[] tokens = line.split(",(?=(?:[^\"]*\"[^\"]*\")*[^\"]*$)", -1);
for(String t : tokens) {
System.out.println("> "+t);
}
}
}
Output:
> foo
> bar
> c;qual="baz,blurb"
> d;junk="quux,syzygy"
In other words: split on the comma only if that comma has zero, or an even number of quotes ahead of it.
Or, a bit friendlier for the eyes:
public class Main {
public static void main(String[] args) {
String line = "foo,bar,c;qual=\"baz,blurb\",d;junk=\"quux,syzygy\"";
String otherThanQuote = " [^\"] ";
String quotedString = String.format(" \" %s* \" ", otherThanQuote);
String regex = String.format("(?x) "+ // enable comments, ignore white spaces
", "+ // match a comma
"(?= "+ // start positive look ahead
" (?: "+ // start non-capturing group 1
" %s* "+ // match 'otherThanQuote' zero or more times
" %s "+ // match 'quotedString'
" )* "+ // end group 1 and repeat it zero or more times
" %s* "+ // match 'otherThanQuote'
" $ "+ // match the end of the string
") ", // stop positive look ahead
otherThanQuote, quotedString, otherThanQuote);
String[] tokens = line.split(regex, -1);
for(String t : tokens) {
System.out.println("> "+t);
}
}
}
which produces the same as the first example.
EDIT
As mentioned by #MikeFHay in the comments:
I prefer using Guava's Splitter, as it has saner defaults (see discussion above about empty matches being trimmed by String#split(), so I did:
Splitter.on(Pattern.compile(",(?=(?:[^\"]*\"[^\"]*\")*[^\"]*$)"))

While I do like regular expressions in general, for this kind of state-dependent tokenization I believe a simple parser (which in this case is much simpler than that word might make it sound) is probably a cleaner solution, in particular with regards to maintainability, e.g.:
String input = "foo,bar,c;qual=\"baz,blurb\",d;junk=\"quux,syzygy\"";
List<String> result = new ArrayList<String>();
int start = 0;
boolean inQuotes = false;
for (int current = 0; current < input.length(); current++) {
if (input.charAt(current) == '\"') inQuotes = !inQuotes; // toggle state
else if (input.charAt(current) == ',' && !inQuotes) {
result.add(input.substring(start, current));
start = current + 1;
}
}
result.add(input.substring(start));
If you don't care about preserving the commas inside the quotes you could simplify this approach (no handling of start index, no last character special case) by replacing your commas in quotes by something else and then split at commas:
String input = "foo,bar,c;qual=\"baz,blurb\",d;junk=\"quux,syzygy\"";
StringBuilder builder = new StringBuilder(input);
boolean inQuotes = false;
for (int currentIndex = 0; currentIndex < builder.length(); currentIndex++) {
char currentChar = builder.charAt(currentIndex);
if (currentChar == '\"') inQuotes = !inQuotes; // toggle state
if (currentChar == ',' && inQuotes) {
builder.setCharAt(currentIndex, ';'); // or '♡', and replace later
}
}
List<String> result = Arrays.asList(builder.toString().split(","));

http://sourceforge.net/projects/javacsv/
https://github.com/pupi1985/JavaCSV-Reloaded
(fork of the previous library that will allow the generated output to have Windows line terminators \r\n when not running Windows)
http://opencsv.sourceforge.net/
CSV API for Java
Can you recommend a Java library for reading (and possibly writing) CSV files?
Java lib or app to convert CSV to XML file?

I would not advise a regex answer from Bart, I find parsing solution better in this particular case (as Fabian proposed). I've tried regex solution and own parsing implementation I have found that:
Parsing is much faster than splitting with regex with backreferences - ~20 times faster for short strings, ~40 times faster for long strings.
Regex fails to find empty string after last comma. That was not in original question though, it was mine requirement.
My solution and test below.
String tested = "foo,bar,c;qual=\"baz,blurb\",d;junk=\"quux,syzygy\",";
long start = System.nanoTime();
String[] tokens = tested.split(",(?=([^\"]*\"[^\"]*\")*[^\"]*$)");
long timeWithSplitting = System.nanoTime() - start;
start = System.nanoTime();
List<String> tokensList = new ArrayList<String>();
boolean inQuotes = false;
StringBuilder b = new StringBuilder();
for (char c : tested.toCharArray()) {
switch (c) {
case ',':
if (inQuotes) {
b.append(c);
} else {
tokensList.add(b.toString());
b = new StringBuilder();
}
break;
case '\"':
inQuotes = !inQuotes;
default:
b.append(c);
break;
}
}
tokensList.add(b.toString());
long timeWithParsing = System.nanoTime() - start;
System.out.println(Arrays.toString(tokens));
System.out.println(tokensList.toString());
System.out.printf("Time with splitting:\t%10d\n",timeWithSplitting);
System.out.printf("Time with parsing:\t%10d\n",timeWithParsing);
Of course you are free to change switch to else-ifs in this snippet if you feel uncomfortable with its ugliness. Note then lack of break after switch with separator. StringBuilder was chosen instead to StringBuffer by design to increase speed, where thread safety is irrelevant.

You're in that annoying boundary area where regexps almost won't do (as has been pointed out by Bart, escaping the quotes would make life hard) , and yet a full-blown parser seems like overkill.
If you are likely to need greater complexity any time soon I would go looking for a parser library. For example this one

I was impatient and chose not to wait for answers... for reference it doesn't look that hard to do something like this (which works for my application, I don't need to worry about escaped quotes, as the stuff in quotes is limited to a few constrained forms):
final static private Pattern splitSearchPattern = Pattern.compile("[\",]");
private List<String> splitByCommasNotInQuotes(String s) {
if (s == null)
return Collections.emptyList();
List<String> list = new ArrayList<String>();
Matcher m = splitSearchPattern.matcher(s);
int pos = 0;
boolean quoteMode = false;
while (m.find())
{
String sep = m.group();
if ("\"".equals(sep))
{
quoteMode = !quoteMode;
}
else if (!quoteMode && ",".equals(sep))
{
int toPos = m.start();
list.add(s.substring(pos, toPos));
pos = m.end();
}
}
if (pos < s.length())
list.add(s.substring(pos));
return list;
}
(exercise for the reader: extend to handling escaped quotes by looking for backslashes also.)

Try a lookaround like (?!\"),(?!\"). This should match , that are not surrounded by ".

The simplest approach is not to match delimiters, i.e. commas, with a complex additional logic to match what is actually intended (the data which might be quoted strings), just to exclude false delimiters, but rather match the intended data in the first place.
The pattern consists of two alternatives, a quoted string ("[^"]*" or ".*?") or everything up to the next comma ([^,]+). To support empty cells, we have to allow the unquoted item to be empty and to consume the next comma, if any, and use the \\G anchor:
Pattern p = Pattern.compile("\\G\"(.*?)\",?|([^,]*),?");
The pattern also contains two capturing groups to get either, the quoted string’s content or the plain content.
Then, with Java 9, we can get an array as
String[] a = p.matcher(input).results()
.map(m -> m.group(m.start(1)<0? 2: 1))
.toArray(String[]::new);
whereas older Java versions need a loop like
for(Matcher m = p.matcher(input); m.find(); ) {
String token = m.group(m.start(1)<0? 2: 1);
System.out.println("found: "+token);
}
Adding the items to a List or an array is left as an excise to the reader.
For Java 8, you can use the results() implementation of this answer, to do it like the Java 9 solution.
For mixed content with embedded strings, like in the question, you can simply use
Pattern p = Pattern.compile("\\G((\"(.*?)\"|[^,])*),?");
But then, the strings are kept in their quoted form.

what about a one-liner using String.split()?
String s = "foo,bar,c;qual=\"baz,blurb\",d;junk=\"quux,syzygy\"";
String[] split = s.split( "(?<!\".{0,255}[^\"]),|,(?![^\"].*\")" );

A regular expression is not capable of handling escaped characters. For my application, I needed the ability to escape quotes and spaces (my separator is spaces, but the code is the same).
Here is my solution in Kotlin (the language from this particular application), based on the one from Fabian Steeg:
fun parseString(input: String): List<String> {
val result = mutableListOf<String>()
var inQuotes = false
var inEscape = false
val current = StringBuilder()
for (i in input.indices) {
// If this character is escaped, add it without looking
if (inEscape) {
inEscape = false
current.append(input[i])
continue
}
when (val c = input[i]) {
'\\' -> inEscape = true // escape the next character, \ isn't added to result
',' -> if (inQuotes) {
current.append(c)
} else {
result += current.toString()
current.clear()
}
'"' -> inQuotes = !inQuotes
else -> current.append(c)
}
}
if (current.isNotEmpty()) {
result += current.toString()
}
return result
}
I think this is not a place to use regular expressions. Contrary to other opinions, I don't think a parser is overkill. It's about 20 lines and fairly easy to test.

Rather than use lookahead and other crazy regex, just pull out the quotes first. That is, for every quote grouping, replace that grouping with __IDENTIFIER_1 or some other indicator, and map that grouping to a map of string,string.
After you split on comma, replace all mapped identifiers with the original string values.

I would do something like this:
boolean foundQuote = false;
if(charAtIndex(currentStringIndex) == '"')
{
foundQuote = true;
}
if(foundQuote == true)
{
//do nothing
}
else
{
string[] split = currentString.split(',');
}

Make regex not affecting Quotation mark [duplicate]

I have a string vaguely like this:
foo,bar,c;qual="baz,blurb",d;junk="quux,syzygy"
that I want to split by commas -- but I need to ignore commas in quotes. How can I do this? Seems like a regexp approach fails; I suppose I can manually scan and enter a different mode when I see a quote, but it would be nice to use preexisting libraries. (edit: I guess I meant libraries that are already part of the JDK or already part of a commonly-used libraries like Apache Commons.)
the above string should split into:
foo
bar
c;qual="baz,blurb"
d;junk="quux,syzygy"
note: this is NOT a CSV file, it's a single string contained in a file with a larger overall structure

Try:
public class Main {
public static void main(String[] args) {
String line = "foo,bar,c;qual=\"baz,blurb\",d;junk=\"quux,syzygy\"";
String[] tokens = line.split(",(?=(?:[^\"]*\"[^\"]*\")*[^\"]*$)", -1);
for(String t : tokens) {
System.out.println("> "+t);
}
}
}
Output:
> foo
> bar
> c;qual="baz,blurb"
> d;junk="quux,syzygy"
In other words: split on the comma only if that comma has zero, or an even number of quotes ahead of it.
Or, a bit friendlier for the eyes:
public class Main {
public static void main(String[] args) {
String line = "foo,bar,c;qual=\"baz,blurb\",d;junk=\"quux,syzygy\"";
String otherThanQuote = " [^\"] ";
String quotedString = String.format(" \" %s* \" ", otherThanQuote);
String regex = String.format("(?x) "+ // enable comments, ignore white spaces
", "+ // match a comma
"(?= "+ // start positive look ahead
" (?: "+ // start non-capturing group 1
" %s* "+ // match 'otherThanQuote' zero or more times
" %s "+ // match 'quotedString'
" )* "+ // end group 1 and repeat it zero or more times
" %s* "+ // match 'otherThanQuote'
" $ "+ // match the end of the string
") ", // stop positive look ahead
otherThanQuote, quotedString, otherThanQuote);
String[] tokens = line.split(regex, -1);
for(String t : tokens) {
System.out.println("> "+t);
}
}
}
which produces the same as the first example.
EDIT
As mentioned by #MikeFHay in the comments:
I prefer using Guava's Splitter, as it has saner defaults (see discussion above about empty matches being trimmed by String#split(), so I did:
Splitter.on(Pattern.compile(",(?=(?:[^\"]*\"[^\"]*\")*[^\"]*$)"))

While I do like regular expressions in general, for this kind of state-dependent tokenization I believe a simple parser (which in this case is much simpler than that word might make it sound) is probably a cleaner solution, in particular with regards to maintainability, e.g.:
String input = "foo,bar,c;qual=\"baz,blurb\",d;junk=\"quux,syzygy\"";
List<String> result = new ArrayList<String>();
int start = 0;
boolean inQuotes = false;
for (int current = 0; current < input.length(); current++) {
if (input.charAt(current) == '\"') inQuotes = !inQuotes; // toggle state
else if (input.charAt(current) == ',' && !inQuotes) {
result.add(input.substring(start, current));
start = current + 1;
}
}
result.add(input.substring(start));
If you don't care about preserving the commas inside the quotes you could simplify this approach (no handling of start index, no last character special case) by replacing your commas in quotes by something else and then split at commas:
String input = "foo,bar,c;qual=\"baz,blurb\",d;junk=\"quux,syzygy\"";
StringBuilder builder = new StringBuilder(input);
boolean inQuotes = false;
for (int currentIndex = 0; currentIndex < builder.length(); currentIndex++) {
char currentChar = builder.charAt(currentIndex);
if (currentChar == '\"') inQuotes = !inQuotes; // toggle state
if (currentChar == ',' && inQuotes) {
builder.setCharAt(currentIndex, ';'); // or '♡', and replace later
}
}
List<String> result = Arrays.asList(builder.toString().split(","));

http://sourceforge.net/projects/javacsv/
https://github.com/pupi1985/JavaCSV-Reloaded
(fork of the previous library that will allow the generated output to have Windows line terminators \r\n when not running Windows)
http://opencsv.sourceforge.net/
CSV API for Java
Can you recommend a Java library for reading (and possibly writing) CSV files?
Java lib or app to convert CSV to XML file?

I would not advise a regex answer from Bart, I find parsing solution better in this particular case (as Fabian proposed). I've tried regex solution and own parsing implementation I have found that:
Parsing is much faster than splitting with regex with backreferences - ~20 times faster for short strings, ~40 times faster for long strings.
Regex fails to find empty string after last comma. That was not in original question though, it was mine requirement.
My solution and test below.
String tested = "foo,bar,c;qual=\"baz,blurb\",d;junk=\"quux,syzygy\",";
long start = System.nanoTime();
String[] tokens = tested.split(",(?=([^\"]*\"[^\"]*\")*[^\"]*$)");
long timeWithSplitting = System.nanoTime() - start;
start = System.nanoTime();
List<String> tokensList = new ArrayList<String>();
boolean inQuotes = false;
StringBuilder b = new StringBuilder();
for (char c : tested.toCharArray()) {
switch (c) {
case ',':
if (inQuotes) {
b.append(c);
} else {
tokensList.add(b.toString());
b = new StringBuilder();
}
break;
case '\"':
inQuotes = !inQuotes;
default:
b.append(c);
break;
}
}
tokensList.add(b.toString());
long timeWithParsing = System.nanoTime() - start;
System.out.println(Arrays.toString(tokens));
System.out.println(tokensList.toString());
System.out.printf("Time with splitting:\t%10d\n",timeWithSplitting);
System.out.printf("Time with parsing:\t%10d\n",timeWithParsing);
Of course you are free to change switch to else-ifs in this snippet if you feel uncomfortable with its ugliness. Note then lack of break after switch with separator. StringBuilder was chosen instead to StringBuffer by design to increase speed, where thread safety is irrelevant.

You're in that annoying boundary area where regexps almost won't do (as has been pointed out by Bart, escaping the quotes would make life hard) , and yet a full-blown parser seems like overkill.
If you are likely to need greater complexity any time soon I would go looking for a parser library. For example this one

I was impatient and chose not to wait for answers... for reference it doesn't look that hard to do something like this (which works for my application, I don't need to worry about escaped quotes, as the stuff in quotes is limited to a few constrained forms):
final static private Pattern splitSearchPattern = Pattern.compile("[\",]");
private List<String> splitByCommasNotInQuotes(String s) {
if (s == null)
return Collections.emptyList();
List<String> list = new ArrayList<String>();
Matcher m = splitSearchPattern.matcher(s);
int pos = 0;
boolean quoteMode = false;
while (m.find())
{
String sep = m.group();
if ("\"".equals(sep))
{
quoteMode = !quoteMode;
}
else if (!quoteMode && ",".equals(sep))
{
int toPos = m.start();
list.add(s.substring(pos, toPos));
pos = m.end();
}
}
if (pos < s.length())
list.add(s.substring(pos));
return list;
}
(exercise for the reader: extend to handling escaped quotes by looking for backslashes also.)

Try a lookaround like (?!\"),(?!\"). This should match , that are not surrounded by ".

The simplest approach is not to match delimiters, i.e. commas, with a complex additional logic to match what is actually intended (the data which might be quoted strings), just to exclude false delimiters, but rather match the intended data in the first place.
The pattern consists of two alternatives, a quoted string ("[^"]*" or ".*?") or everything up to the next comma ([^,]+). To support empty cells, we have to allow the unquoted item to be empty and to consume the next comma, if any, and use the \\G anchor:
Pattern p = Pattern.compile("\\G\"(.*?)\",?|([^,]*),?");
The pattern also contains two capturing groups to get either, the quoted string’s content or the plain content.
Then, with Java 9, we can get an array as
String[] a = p.matcher(input).results()
.map(m -> m.group(m.start(1)<0? 2: 1))
.toArray(String[]::new);
whereas older Java versions need a loop like
for(Matcher m = p.matcher(input); m.find(); ) {
String token = m.group(m.start(1)<0? 2: 1);
System.out.println("found: "+token);
}
Adding the items to a List or an array is left as an excise to the reader.
For Java 8, you can use the results() implementation of this answer, to do it like the Java 9 solution.
For mixed content with embedded strings, like in the question, you can simply use
Pattern p = Pattern.compile("\\G((\"(.*?)\"|[^,])*),?");
But then, the strings are kept in their quoted form.

what about a one-liner using String.split()?
String s = "foo,bar,c;qual=\"baz,blurb\",d;junk=\"quux,syzygy\"";
String[] split = s.split( "(?<!\".{0,255}[^\"]),|,(?![^\"].*\")" );

A regular expression is not capable of handling escaped characters. For my application, I needed the ability to escape quotes and spaces (my separator is spaces, but the code is the same).
Here is my solution in Kotlin (the language from this particular application), based on the one from Fabian Steeg:
fun parseString(input: String): List<String> {
val result = mutableListOf<String>()
var inQuotes = false
var inEscape = false
val current = StringBuilder()
for (i in input.indices) {
// If this character is escaped, add it without looking
if (inEscape) {
inEscape = false
current.append(input[i])
continue
}
when (val c = input[i]) {
'\\' -> inEscape = true // escape the next character, \ isn't added to result
',' -> if (inQuotes) {
current.append(c)
} else {
result += current.toString()
current.clear()
}
'"' -> inQuotes = !inQuotes
else -> current.append(c)
}
}
if (current.isNotEmpty()) {
result += current.toString()
}
return result
}
I think this is not a place to use regular expressions. Contrary to other opinions, I don't think a parser is overkill. It's about 20 lines and fairly easy to test.

Rather than use lookahead and other crazy regex, just pull out the quotes first. That is, for every quote grouping, replace that grouping with __IDENTIFIER_1 or some other indicator, and map that grouping to a map of string,string.
After you split on comma, replace all mapped identifiers with the original string values.

I would do something like this:
boolean foundQuote = false;
if(charAtIndex(currentStringIndex) == '"')
{
foundQuote = true;
}
if(foundQuote == true)
{
//do nothing
}
else
{
string[] split = currentString.split(',');
}

Remove all punctuation from the end of a string

Examples:
// A B C. -> A B C
// !A B C! -> !A B C
// A? B?? C??? -> A? B?? C
Here's what I have so far:
while (endsWithRegex(word, "\\p{P}")) {
word = word.substring(0, word.length() - 1);
}
public static boolean endsWithRegex(String word, String regex) {
return word != null && !word.isEmpty() &&
word.substring(word.length() - 1).replaceAll(regex, "").isEmpty();
}
This current solution works, but since it's already calling String.replaceAll within endsWithRegex, we should be able to do something like this:
word = word.replaceAll(/* regex */, "");
Any advice?

I suggest using
\s*\p{Punct}+\s*$
It will match optional whitespace and punctuation at the end of the string.
If you do not care about the whitespace, just use \p{Punct}+$.
Do not forget that in Java strings, backslashes should be doubled to denote literal backslashes (that must be used as regex escape symbols).
Java demo
String word = "!Words word! ";
word = word.replaceAll("\\s*\\p{Punct}+\\s*$", "");
System.out.println(word); // => !Words word

You can use:
str = str.replaceFirst("\\p{P}+$", "");
To include space also:
str = str.replaceFirst("[\\p{Space}\\p{P}]+$", "")

how about this, if you can take a minor hit in efficiency.
reverse the input string
keep removing characters until you hit an alphabet
reverse the string and return

I have modified the logic of your method
public static boolean endsWithRegex(String word, String regex) {
return word != null && !word.isEmpty() && word.matches(regex);
}
and your regex is : regex = ".*[^a-zA-Z]$";

Checking if a string only contains certain characters

I have a string representing a 32 character long barcode made up of "|" and ":".
I want to check the validity of any given string to make sure it is a barcode. One of the tests is to check that the only symbols it contains are the two mentioned above. How can I check that?
I first I was using a delimiter, but I don't think that is the right way to go about this.
public boolean isValidBarCode (String barCode)
{
barCode.useDelimiter ("[|:]");
if (barCode.length() == 32)
{
return true;
}
else
{
return false;
}
I know there are other things I need to check in order to validate it as a barcode, but I'm asking only for the purposes of checking the symbols within the given string.
I'm a beginner programmer, so the help is greatly appreciated!

You can use a regex:
boolean correct = string.matches("[\\:\\|]+");
Explanation for the regex: it checks that the string is constituted of 1 or more characters (that's what the + suffix does) being either : or |. We would normally write [:|]+, but since : and (I think) | are special characters in regexes, they need to be escaped with a backslash. And backslashes must be escaped in a string literal, hence the double backslash.
Or you can simply code a 5 lines algorithm using a loop:
boolean correct = false;
for (int i = 0; i < string.length() && correct; i++) {
char c = string.charAt(i);
if (c != ':' && c != '|') {
correct = false;
}
}

Since you require the barcode to be exactly 32 characters long and consist only of the : and | characters, you should use a combination of length and regex checking:
boolean isCorrect = barCode.matches( "[\\|\\:]*" );
if(isCorrect && barCode.length() == 32) {
//true case
} else {
//false case
}

boolean isBarCode = barCode.matches( "[\\|\\:]*" );

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Regex to consolidate multiple rules - java

Related

How to remove leading 0 in the time timestamp 02:25PM using java? [duplicate]

Splitting csv lines that use "escaped" delimiter [duplicate]

Make regex not affecting Quotation mark [duplicate]

Remove all punctuation from the end of a string

Checking if a string only contains certain characters

Categories

Resources