Replacing spaces within quotes - java

I'm really struggling with regex here. Using Java how would I go about replacing all spaces within quotes (double quotes really) with another character (or escaped space "\ ") but ONLY if the phrase ends with a wildcard character.
word1 AND "word2 word3 word4*" OR "word5 word6" OR word7
to
word1 AND "word2\ word3\ word4*" OR "word5 word6" OR word7

I think the best solution is to use a regular expression to find the quoted strings you want, and then to replace the spaces within the regex's match. Something like this:
import java.util.regex.*;
class SOReplaceSpacesInQuotes {
public static void main(String[] args) {
Pattern findQuotes = Pattern.compile("\"[^\"]+\\*\"");
for (String arg : args) {
Matcher m = findQuotes.matcher(arg);
StringBuffer result = new StringBuffer();
while (m.find())
m.appendReplacement(result, m.group().replace(" ", "\\\\ "));
m.appendTail(result);
System.out.println(arg + " -> " + result.toString());
}
}
}
Running java SOReplaceSpacesInQuotes 'word1 AND "word2 word3 word4*" OR "word5 word6*" OR word7' then happily produced the output word1 AND "word2 word3 word4*" OR "word5 word6*" OR word7 -> word1 AND "word2\ word3\ word4*" OR "word5\ word6*" OR word7, which is exactly what you wanted.
The pattern is "[^"]+\*", but backslashes and quotes have to be escaped for Java. This matches a literal quote, any number of non-quotes, a *, and a quote, which is what you want. This assumes that (a) you aren't allowed to have embedded \" escape sequences, and (b) that * is the only wildcard. If you have embedded escape sequences, then use "([^\\"]|\\.)\*" (which, escaped for Java, is \"([^\\\\\\"]|\\\\.)\\*\"); if you have multiple wildcards, use "[^"]+[*+]"; and if you have both, combine them in the obvious way. Dealing with multiple wildcards is a matter of just letting any of them match at the end of the string; dealing with escape sequences is done by matching a quote followed by any number of non-backslash, non-quote characters, or a backslash preceding anything at all.
Now, that pattern finds the quoted strings you want. For each argument to the program, we then match all of them, and using m.group().replace(" ", "\\\\ "), replace each space in what was matched (the quoted string) with a backslash and a space. (This string is \\—why two real backslashes are required, I'm not sure.) If you haven't seen appendReplacement and appendTail before (I hadn't), here's what they do: in tandem, they iterate through the entire string, replacing whatever was matched with the second argument to appendReplacement, and appending it all to the given StringBuffer. The appendTail call is necessary to catch whatever didn't match at the end. The documentation for Matcher.appendReplacement(StringBuffer,String) contains a good example of their use.
Edit: As Roland Illig pointed out, this is problematic if certain kinds of invalid input can appear, such as a AND "b" AND *"c", which would become a AND "b"\ AND\ *"c". If this is a danger (or if it could possibly become a danger in the future, which it likely could), then you should make it more robust by always matching quotes, but only replacing if they ended in a wildcard character. This will work as long as your quotes are always appropriately paired, which is a much weaker assumption. The resulting code is very similar:
import java.util.regex.*;
class SOReplaceSpacesInQuotes {
public static void main(String[] args) {
Pattern findQuotes = Pattern.compile("\"[^\"]+?(\\*)?\"");
for (String arg : args) {
Matcher m = findQuotes.matcher(arg);
StringBuffer result = new StringBuffer();
while (m.find()) {
if (m.group(1) == null)
m.appendReplacement(result, m.group());
else
m.appendReplacement(result, m.group().replace(" ", "\\\\ "));
}
m.appendTail(result);
System.out.println(arg + " -> " + result.toString());
}
}
}
We put the wildcard character in a group, and make it optional, and make the body of the quotes reluctant with +?, so that it will match as little as possible and let the wildcard character get grouped. This way, we match each successive pair of quotes, and since the regex engine won't restart in the middle of a match, we'll only ever match the insides, not the outsides, of quotes. But now we don't always want to replace the spaces—we only want to do so if there was a wildcard character. This is easy: test to see if group 1 is null. If it is, then there wasn't a wildcard character, so replace the string with itself. Otherwise, replace the spaces. And indeed, java SOReplaceSpacesInQuotes 'a AND "b d" AND *"c d"' yields the desired a AND "b d" AND *"c d" -> a AND "b d" AND *"c d", while java SOReplaceSpacesInQuotes 'a AND "b d" AND "c d*"' performs a substitution to get a AND "b d" AND *"c d" -> a AND "b d" AND "c\ *d".

Do you really need regular expressions here? The task seems well-described, but a little too complex for regular expressions. So I would rather program it out explicitly.
package so4478038;
import static org.junit.Assert.*;
import org.junit.Test;
public class QuoteSpaces {
public static String escapeSpacesInQuotes(String input) {
StringBuilder sb = new StringBuilder();
StringBuilder quotedWord = new StringBuilder();
boolean inQuotes = false;
for (int i = 0, imax = input.length(); i < imax; i++) {
char c = input.charAt(i);
if (c == '"') {
if (!inQuotes) {
quotedWord.setLength(0);
} else {
String qw = quotedWord.toString();
if (qw.endsWith("*")) {
sb.append(qw.replace(" ", "\\ "));
} else {
sb.append(qw);
}
}
inQuotes = !inQuotes;
}
if (inQuotes) {
quotedWord.append(c);
} else {
sb.append(c);
}
}
return sb.toString();
}
#Test
public void test() {
assertEquals("word1 AND \"word2\\ word3\\ word4*\" OR \"word5 word6\" OR word7", escapeSpacesInQuotes("word1 AND \"word2 word3 word4*\" OR \"word5 word6\" OR word7"));
}
}

Does it work ?
str.replaceAll("\"", "\\");
I don't have IDE now and I don't test it

Related

Java: How to replace consecutive characters with a single character?

How can I replace consecutive characters with a single character in java?
String fileContent = "def mnop.UVW";
String oldDelimiters = " .";
String newDelimiter = "!";
for (int i = 0; i < oldDelimiters.length(); i++){
Character character = oldDelimiters.charAt(i);
fileContent = fileContent.replace(String.valueOf(character), newDelimiter);
}
Current output: def!!mnop!UVW
Desired output: def!mnop!UVW
Notice the two spaces are replaced with two exclamation marks. How can I replace consecutive delimiters with one delimiter?
Since you want to match consecutive characters from the old delimiter, a regex solution doesn't seem to be feasible here. You can instead match char by char if it belongs to one of the old delimiter chars and then set it with the new one as shown below.
import java.util.*;
public class Main{
public static void main(String[] args) {
String fileContent = "def mnop.UVW";
String oldDelimiters = " .";
// add all old delimiters in a set for fast checks
Set<Character> set = new HashSet<>();
for(int i=0;i<oldDelimiters.length();++i) set.add(oldDelimiters.charAt(i));
/*
match all consecutive chars at once, check if it belongs to an old delimiter
and replace it with the new one
*/
String newDelimiter = "!";
StringBuilder res = new StringBuilder("");
for(int i=0;i<fileContent.length();++i){
if(set.contains(fileContent.charAt(i))){
while(i + 1 < fileContent.length() && fileContent.charAt(i) == fileContent.charAt(i+1)) i++;
res.append(newDelimiter);
}else{
res.append(fileContent.charAt(i));
}
}
System.out.println(res.toString());
}
}
Demo: https://onlinegdb.com/r1BC6qKP8
s = s.replaceAll("([ \\.])[ \\.]+", "$1");
Or if only several same delimiters have to be replaced:
s = s.replaceAll("([ \\.])\\1+", "$1");
[....] is a group of alternative characters
First (...) is group 1, $1
\\1 is the text of the first group
While not using regex, I thought a solution with StreamS was needed, because everyone loves streams:
private static class StatefulFilter implements Predicate<String> {
private final String needle;
private String last = null;
public StatefulFilter(String needle) {
this.needle = needle;
}
#Override
public boolean test(String value) {
boolean duplicate = last != null && last.equals(value) && value.equals(needle);
last = value;
return !duplicate;
}
}
public static void main(String[] args) {
System.out.println(
"def mnop.UVW"
.codePoints()
.sequential()
.mapToObj(c -> String.valueOf((char) c))
.filter(new StatefulFilter(" "))
.map(x -> x.equals(" ") ? "!" : x)
.collect(Collectors.joining(""))
);
}
Runnable example: https://onlinegdb.com/BkY0R2twU
Explanation:
Theoretically, you aren't really supposed to have a stateful filter, but technically, as long as the stream is not parallelized, it works fine:
.codePoints() - splits the String into a Stream
.sequential() - since we care about the order of characters, our Stream may not be processed in parallel
.mapToObj(c -> String.valueOf((char) c)) - the comparison in the filter is more intuitive if we convert to String, but it's not really needed
.filter(new StatefulFilter(" ")) - here we filter out any space that comes after another space
.map(x -> x.equals(" ") ? "!" : x) - now we can replace the remaining spaces with exclamation marks
.collect(Collectors.joining("")) - and finally we can join the characters together to reconstitute a String
The StatefulFilter itself is pretty straight forward - it checks whether a) we have a previous character at all, b) whether the previous character is the same as the current character and c) whether the current character is the delimiter (space). It returns false (meaning the character gets deleted) only if all a, b and c are true.
The biggest difficulty to using a regex for this, is to create an expression from your oldDelimiters string. For example:
String oldDelimiters = " .";
String expression = "\\" + String.join("+|\\", oldDelimiters.split("")) + "+";
String text = "def mnop.UVW;abc .df";
String result = text.replaceAll(expression, "!");
(Edit: since characters in the expression are now escaped anyway, I removed the character classes and edited the following text to reflect that change.)
Where the generated expression looks like \ +|\.+, i.e. each character is quantified and constitutes one alternative of the expression. The engine will match and replace one alternative at a time if it can be matched. result now contains:
def!mnop!UVW;abc!!df
Not sure how backwards compatible this is due to split() behaviour in previous versions of Java (producing a leading space in splitting on the empty string), but with current versions this should be fine.
Edit: As it is, this breaks if the delimiting characters contain digits or characters representing unescaped regex tokens (i.e. 1, b, etc.).

Split a string using multiple delimiters in java [duplicate]

I'm new to regular expressions and would appreciate your help. I'm trying to put together an expression that will split the example string using all spaces that are not surrounded by single or double quotes. My last attempt looks like this: (?!") and isn't quite working. It's splitting on the space before the quote.
Example input:
This is a string that "will be" highlighted when your 'regular expression' matches something.
Desired output:
This
is
a
string
that
will be
highlighted
when
your
regular expression
matches
something.
Note that "will be" and 'regular expression' retain the space between the words.
I don't understand why all the others are proposing such complex regular expressions or such long code. Essentially, you want to grab two kinds of things from your string: sequences of characters that aren't spaces or quotes, and sequences of characters that begin and end with a quote, with no quotes in between, for two kinds of quotes. You can easily match those things with this regular expression:
[^\s"']+|"([^"]*)"|'([^']*)'
I added the capturing groups because you don't want the quotes in the list.
This Java code builds the list, adding the capturing group if it matched to exclude the quotes, and adding the overall regex match if the capturing group didn't match (an unquoted word was matched).
List<String> matchList = new ArrayList<String>();
Pattern regex = Pattern.compile("[^\\s\"']+|\"([^\"]*)\"|'([^']*)'");
Matcher regexMatcher = regex.matcher(subjectString);
while (regexMatcher.find()) {
if (regexMatcher.group(1) != null) {
// Add double-quoted string without the quotes
matchList.add(regexMatcher.group(1));
} else if (regexMatcher.group(2) != null) {
// Add single-quoted string without the quotes
matchList.add(regexMatcher.group(2));
} else {
// Add unquoted word
matchList.add(regexMatcher.group());
}
}
If you don't mind having the quotes in the returned list, you can use much simpler code:
List<String> matchList = new ArrayList<String>();
Pattern regex = Pattern.compile("[^\\s\"']+|\"[^\"]*\"|'[^']*'");
Matcher regexMatcher = regex.matcher(subjectString);
while (regexMatcher.find()) {
matchList.add(regexMatcher.group());
}
There are several questions on StackOverflow that cover this same question in various contexts using regular expressions. For instance:
parsings strings: extracting words and phrases
Best way to parse Space Separated Text
UPDATE: Sample regex to handle single and double quoted strings. Ref: How can I split on a string except when inside quotes?
m/('.*?'|".*?"|\S+)/g
Tested this with a quick Perl snippet and the output was as reproduced below. Also works for empty strings or whitespace-only strings if they are between quotes (not sure if that's desired or not).
This
is
a
string
that
"will be"
highlighted
when
your
'regular expression'
matches
something.
Note that this does include the quote characters themselves in the matched values, though you can remove that with a string replace, or modify the regex to not include them. I'll leave that as an exercise for the reader or another poster for now, as 2am is way too late to be messing with regular expressions anymore ;)
If you want to allow escaped quotes inside the string, you can use something like this:
(?:(['"])(.*?)(?<!\\)(?>\\\\)*\1|([^\s]+))
Quoted strings will be group 2, single unquoted words will be group 3.
You can try it on various strings here: http://www.fileformat.info/tool/regex.htm or http://gskinner.com/RegExr/
The regex from Jan Goyvaerts is the best solution I found so far, but creates also empty (null) matches, which he excludes in his program. These empty matches also appear from regex testers (e.g. rubular.com).
If you turn the searches arround (first look for the quoted parts and than the space separed words) then you might do it in once with:
("[^"]*"|'[^']*'|[\S]+)+
(?<!\G".{0,99999})\s|(?<=\G".{0,99999}")\s
This will match the spaces not surrounded by double quotes.
I have to use min,max {0,99999} because Java doesn't support * and + in lookbehind.
It'll probably be easier to search the string, grabbing each part, vs. split it.
Reason being, you can have it split at the spaces before and after "will be". But, I can't think of any way to specify ignoring the space between inside a split.
(not actual Java)
string = "This is a string that \"will be\" highlighted when your 'regular expression' matches something.";
regex = "\"(\\\"|(?!\\\").)+\"|[^ ]+"; // search for a quoted or non-spaced group
final = new Array();
while (string.length > 0) {
string = string.trim();
if (Regex(regex).test(string)) {
final.push(Regex(regex).match(string)[0]);
string = string.replace(regex, ""); // progress to next "word"
}
}
Also, capturing single quotes could lead to issues:
"Foo's Bar 'n Grill"
//=>
"Foo"
"s Bar "
"n"
"Grill"
String.split() is not helpful here because there is no way to distinguish between spaces within quotes (don't split) and those outside (split). Matcher.lookingAt() is probably what you need:
String str = "This is a string that \"will be\" highlighted when your 'regular expression' matches something.";
str = str + " "; // add trailing space
int len = str.length();
Matcher m = Pattern.compile("((\"[^\"]+?\")|('[^']+?')|([^\\s]+?))\\s++").matcher(str);
for (int i = 0; i < len; i++)
{
m.region(i, len);
if (m.lookingAt())
{
String s = m.group(1);
if ((s.startsWith("\"") && s.endsWith("\"")) ||
(s.startsWith("'") && s.endsWith("'")))
{
s = s.substring(1, s.length() - 1);
}
System.out.println(i + ": \"" + s + "\"");
i += (m.group(0).length() - 1);
}
}
which produces the following output:
0: "This"
5: "is"
8: "a"
10: "string"
17: "that"
22: "will be"
32: "highlighted"
44: "when"
49: "your"
54: "regular expression"
75: "matches"
83: "something."
I liked Marcus's approach, however, I modified it so that I could allow text near the quotes, and support both " and ' quote characters. For example, I needed a="some value" to not split it into [a=, "some value"].
(?<!\\G\\S{0,99999}[\"'].{0,99999})\\s|(?<=\\G\\S{0,99999}\".{0,99999}\"\\S{0,99999})\\s|(?<=\\G\\S{0,99999}'.{0,99999}'\\S{0,99999})\\s"
Jan's approach is great but here's another one for the record.
If you actually wanted to split as mentioned in the title, keeping the quotes in "will be" and 'regular expression', then you could use this method which is straight out of Match (or replace) a pattern except in situations s1, s2, s3 etc
The regex:
'[^']*'|\"[^\"]*\"|( )
The two left alternations match complete 'quoted strings' and "double-quoted strings". We will ignore these matches. The right side matches and captures spaces to Group 1, and we know they are the right spaces because they were not matched by the expressions on the left. We replace those with SplitHere then split on SplitHere. Again, this is for a true split case where you want "will be", not will be.
Here is a full working implementation (see the results on the online demo).
import java.util.*;
import java.io.*;
import java.util.regex.*;
import java.util.List;
class Program {
public static void main (String[] args) throws java.lang.Exception {
String subject = "This is a string that \"will be\" highlighted when your 'regular expression' matches something.";
Pattern regex = Pattern.compile("\'[^']*'|\"[^\"]*\"|( )");
Matcher m = regex.matcher(subject);
StringBuffer b= new StringBuffer();
while (m.find()) {
if(m.group(1) != null) m.appendReplacement(b, "SplitHere");
else m.appendReplacement(b, m.group(0));
}
m.appendTail(b);
String replaced = b.toString();
String[] splits = replaced.split("SplitHere");
for (String split : splits) System.out.println(split);
} // end main
} // end Program
If you are using c#, you can use
string input= "This is a string that \"will be\" highlighted when your 'regular expression' matches <something random>";
List<string> list1 =
Regex.Matches(input, #"(?<match>\w+)|\""(?<match>[\w\s]*)""|'(?<match>[\w\s]*)'|<(?<match>[\w\s]*)>").Cast<Match>().Select(m => m.Groups["match"].Value).ToList();
foreach(var v in list1)
Console.WriteLine(v);
I have specifically added "|<(?[\w\s]*)>" to highlight that you can specify any char to group phrases. (In this case I am using < > to group.
Output is :
This
is
a
string
that
will be
highlighted
when
your
regular expression
matches
something random
1st one-liner using String.split()
String s = "This is a string that \"will be\" highlighted when your 'regular expression' matches something.";
String[] split = s.split( "(?<!(\"|').{0,255}) | (?!.*\\1.*)" );
[This, is, a, string, that, "will be", highlighted, when, your, 'regular expression', matches, something.]
don't split at the blank, if the blank is surrounded by single or double quotes
split at the blank when the 255 characters to the left and all characters to the right of the blank are neither single nor double quotes
adapted from original post (handles only double quotes)
I'm reasonably certain this is not possible using regular expressions alone. Checking whether something is contained inside some other tag is a parsing operation. This seems like the same problem as trying to parse XML with a regex -- it can't be done correctly. You may be able to get your desired outcome by repeatedly applying a non-greedy, non-global regex that matches the quoted strings, then once you can't find anything else, split it at the spaces... that has a number of problems, including keeping track of the original order of all the substrings. Your best bet is to just write a really simple function that iterates over the string and pulls out the tokens you want.
A couple hopefully helpful tweaks on Jan's accepted answer:
(['"])((?:\\\1|.)+?)\1|([^\s"']+)
Allows escaped quotes within quoted strings
Avoids repeating the pattern for the single and double quote; this also simplifies adding more quoting symbols if needed (at the expense of one more capturing group)
You can also try this:
String str = "This is a string that \"will be\" highlighted when your 'regular expression' matches something";
String ss[] = str.split("\"|\'");
for (int i = 0; i < ss.length; i++) {
if ((i % 2) == 0) {//even
String[] part1 = ss[i].split(" ");
for (String pp1 : part1) {
System.out.println("" + pp1);
}
} else {//odd
System.out.println("" + ss[i]);
}
}
The following returns an array of arguments. Arguments are the variable 'command' split on spaces, unless included in single or double quotes. The matches are then modified to remove the single and double quotes.
using System.Text.RegularExpressions;
var args = Regex.Matches(command, "[^\\s\"']+|\"([^\"]*)\"|'([^']*)'").Cast<Match>
().Select(iMatch => iMatch.Value.Replace("\"", "").Replace("'", "")).ToArray();
When you come across this pattern like this :
String str = "2022-11-10 08:35:00,470 RAV=REQ YIP=02.8.5.1 CMID=caonaustr CMN=\"Some Value Pyt Ltd\"";
//this helped
String[] str1= str.split("\\s(?=(([^\"]*\"){2})*[^\"]*$)\\s*");
System.out.println("Value of split string is "+ Arrays.toString(str1));
This results in :[2022-11-10, 08:35:00,470, PLV=REQ, YIP=02.8.5.1, CMID=caonaustr, CMN="Some Value Pyt Ltd"]
This regex matches spaces ONLY if it is followed by even number of double quotes.

java regex replaceAll with negated groups

I'm trying to use the String.replaceAll() method with regex to only keep letter characters and ['-_]. I'm trying to do this by replacing every character that is neither a letter nor one of the characters above by an empty string.
So far I have tried something like this (in different variations) which correctly keeps letters but replaces the special characters I want to keep:
current = current.replaceAll("(?=\\P{L})(?=[^\\'-_])", "");
Make it simplier :
current = current.replaceAll("[^a-zA-Z'_-]", "");
Explanation :
Match any char not in a to z, A to Z, ', _, - and replaceAll() method will replace any matched char with nothing.
Tested input : "a_zE'R-z4r#m"
Output : a_zE'R-zrm
You don't need lookahead, just use negated regex:
current = current.replaceAll("[^\\p{L}'_-]+", "");
[^\\p{L}'_-] will match anything that is not a letter (unicode) or single quote or underscore or hyphen.
Your regex is too complicated. Just specify the characters you want to keep, and use ^ to negate, so [^a-z'_-] means "anything but these".
public class Replacer {
public static void main(String[] args) {
System.out.println("with 1234 &*()) -/.,>>?chars".replaceAll("[^\\w'_-]", ""));
}
}
You can try this:
String str = "Se#rbi323a`and_Eur$ope#-t42he-[A%merica]";
str = str.replaceAll("[\\d+\\p{Punct}&&[^-'_\\[\\]]]+", "");
System.out.println("str = " + str);
And it is the result:
str = Serbia'and_Europe-the-[America]

Regular expression troubles, escaped quotes

Basically, I'm being passed a string and I need to tokenise it in much the same manner as command line options are tokenised by a *nix shell
Say I have the following string
"Hello\" World" "Hello Universe" Hi
How could I turn it into a 3 element list
Hello" World
Hello Universe
Hi
The following is my first attempt, but it's got a number of problems
It leaves the quote characters
It doesn't catch the escaped quote
Code:
public void test() {
String str = "\"Hello\\\" World\" \"Hello Universe\" Hi";
List<String> list = split(str);
}
public static List<String> split(String str) {
Pattern pattern = Pattern.compile(
"\"[^\"]*\"" + /* double quoted token*/
"|'[^']*'" + /*single quoted token*/
"|[A-Za-z']+" /*everything else*/
);
List<String> opts = new ArrayList<String>();
Scanner scanner = new Scanner(str).useDelimiter(pattern);
String token;
while ((token = scanner.findInLine(pattern)) != null) {
opts.add(token);
}
return opts;
}
So the incorrect output of the following code is
"Hello\"
World
" "
Hello
Universe
Hi
EDIT I'm totally open to a non regex solution. It's just the first solution that came to mind
If you decide you want to forego regex, and do parsing instead, there are a couple of options. If you are willing to have just a double quote or a single quote (but not both) as your quote, then you can use StreamTokenizer to solve this easily:
public static List<String> tokenize(String s) throws IOException {
List<String> opts = new ArrayList<String>();
StreamTokenizer st = new StreamTokenizer(new StringReader(s));
st.quoteChar('\"');
while (st.nextToken() != StreamTokenizer.TT_EOF) {
opts.add(st.sval);
}
return opts;
}
If you must support both quotes, here is a naive implementation that should work (caveat that a string like '"blah \" blah"blah' will yield something like 'blah " blahblah'. If that isn't OK, you will need to make some changes):
public static List<String> splitSSV(String in) throws IOException {
ArrayList<String> out = new ArrayList<String>();
StringReader r = new StringReader(in);
StringBuilder b = new StringBuilder();
int inQuote = -1;
boolean escape = false;
int c;
// read each character
while ((c = r.read()) != -1) {
if (escape) { // if the previous char is escape, add the current char
b.append((char)c);
escape = false;
continue;
}
switch (c) {
case '\\': // deal with escape char
escape = true;
break;
case '\"':
case '\'': // deal with quote chars
if (c == '\"' || c == '\'') {
if (inQuote == -1) { // not in a quote
inQuote = c; // now we are
} else {
inQuote = -1; // we were in a quote and now we aren't
}
}
break;
case ' ':
if (inQuote == -1) { // if we aren't in a quote, then add token to list
out.add(b.toString());
b.setLength(0);
} else {
b.append((char)c); // else append space to current token
}
break;
default:
b.append((char)c); // append all other chars to current token
}
}
if (b.length() > 0) {
out.add(b.toString()); // add final token to list
}
return out;
}
I'm pretty sure you can't do this by just tokenising on a regex. If you need to deal with nested and escaped delimiters, you need to write a parser. See e.g. http://kore-nordmann.de/blog/do_NOT_parse_using_regexp.html
There will be open source parsers which can do what you want, although I don't know any. You should also check out the StreamTokenizer class.
To recap, you want to split on whitespace, except when surrounded by double quotes, which are not preceded by a backslash.
Step 1: tokenize the input: /([ \t]+)|(\\")|(")|([^ \t"]+)/
This gives you a sequence of SPACE, ESCAPED_QUOTE, QUOTE and TEXT tokens.
Step 2: build a finite state machine matching and reacting to the tokens:
State: START
SPACE -> return empty string
ESCAPED_QUOTE -> Error (?)
QUOTE -> State := WITHIN_QUOTES
TEXT -> return text
State: WITHIN_QUOTES
SPACE -> add value to accumulator
ESCAPED_QUOTE -> add quote to accumulator
QUOTE -> return and clear accumulator; State := START
TEXT -> add text to accumulator
Step 3: Profit!!
I think if you use pattern like this:
Pattern pattern = Pattern.compile("\".*?(?<!\\\\)\"|'.*?(?<!\\\\)'|[A-Za-z']+");
Then it will give you desired output. When I ran with your input data I got this list:
["Hello\" World", "Hello Universe", Hi]
I used [A-Za-z']+ from your own question but shouldn't it be just : [A-Za-z]+
EDIT
Change your opts.add(token); line to:
opts.add(token.replaceAll("^\"|\"$|^'|'$", ""));
The first thing you need to do is stop thinking of the job in terms of split(). split() is meant for breaking down simple strings like this/that/the other, where / is always a delimiter. But you're trying to split on whitespace, unless the whitespace is within quotes, except if the quotes are escaped with backslashes (and if backslashes escape quotes, they probably escape other things, like other backslashes).
With all those exceptions-to-exceptions, it's just not possible to create a regex to match all possible delimiters, not even with fancy gimmicks like lookarounds, conditionals, reluctant and possessive quantifiers. What you want to do is match the tokens, not the delimiters.
In the following code, a token that's enclosed in double-quotes or single-quotes may contain whitespace as well as the quote character if it's preceded by a backslash. Everything except the enclosing quotes is captured in group #1 (for double-quoted tokens) or group #2 (single-quoted). Any character may be escaped with a backslash, even in non-quoted tokens; the "escaping" backslashes are removed in a separate step.
public static void test()
{
String str = "\"Hello\\\" World\" 'Hello Universe' Hi";
List<String> commands = parseCommands(str);
for (String s : commands)
{
System.out.println(s);
}
}
public static List<String> parseCommands(String s)
{
String rgx = "\"((?:[^\"\\\\]++|\\\\.)*+)\"" // double-quoted
+ "|'((?:[^'\\\\]++|\\\\.)*+)'" // single-quoted
+ "|\\S+"; // not quoted
Pattern p = Pattern.compile(rgx);
Matcher m = p.matcher(s);
List<String> commands = new ArrayList<String>();
while (m.find())
{
String cmd = m.start(1) != -1 ? m.group(1) // strip double-quotes
: m.start(2) != -1 ? m.group(2) // strip single-quotes
: m.group();
cmd = cmd.replaceAll("\\\\(.)", "$1"); // remove escape characters
commands.add(cmd);
}
return commands;
}
output:
Hello" World
Hello Universe
Hi
This is about as simple as it gets for a regex-based solution--and it doesn't really deal with malformed input, like unbalanced quotes. If you're not fluent in regexes, you might be better off with a purely hand-coded solution or, even better, a dedicated command-line interpreter (CLI) library.

Regular Expression problem in Java

I am trying to create a regular expression for the replaceAll method in Java. The test string is abXYabcXYZ and the pattern is abc. I want to replace any symbol except the pattern with +. For example the string abXYabcXYZ and pattern [^(abc)] should return ++++abc+++, but in my case it returns ab++abc+++.
public static String plusOut(String str, String pattern) {
pattern= "[^("+pattern+")]" + "".toLowerCase();
return str.toLowerCase().replaceAll(pattern, "+");
}
public static void main(String[] args) {
String text = "abXYabcXYZ";
String pattern = "abc";
System.out.println(plusOut(text, pattern));
}
When I try to replace the pattern with + there is no problem - abXYabcXYZ with pattern (abc) returns abxy+xyz. Pattern (^(abc)) returns the string without replacement.
Is there any other way to write NOT(regex) or group symbols as a word?
What you are trying to achieve is pretty tough with regular expressions, since there is no way to express “replace strings not matching a pattern”. You will have to use a “positive” pattern, telling what to match instead of what not to match.
Furthermore, you want to replace every character with a replacement character, so you have to make sure that your pattern matches exactly one character. Otherwise, you will replace whole strings with a single character, returning a shorter string.
For your toy example, you can use negative lookaheads and lookbehinds to achieve the task, but this may be more difficult for real-world examples with longer or more complex strings, since you will have to consider each character of your string separately, along with its context.
Here is the pattern for “not ‘abc’”:
[^abc]|a(?!bc)|(?<!a)b|b(?!c)|(?<!ab)c
It consists of five sub-patterns, connected with “or” (|), each matching exactly one character:
[^abc] matches every character except a, b or c
a(?!bc) matches a if it is not followed by bc
(?<!a)b matches b if it is not preceded with a
b(?!c) matches b if it is not followed by c
(?<!ab)c matches c if it is not preceded with ab
The idea is to match every character that is not in your target word abc, plus every word character that, according to the context, is not part of your word. The context can be examined using negative lookaheads (?!...) and lookbehinds (?<!...).
You can imagine that this technique will fail once you have a target word containing one character more than once, like example. It is pretty hard to express “match e if it is not followed by x and not preceded by l”.
Especially for dynamic patterns, it is by far easier to do a positive search and then replace every character that did not match in a second pass, as others have suggested here.
[^ ... ] will match one character that is not any of ...
So your pattern "[^(abc)]" is saying "match one character that is not a, b, c or the left or right bracket"; and indeed that is what happens in your test.
It is hard to say "replace all characters that are not part of the string 'abc'" in a single trivial regular expression. What you might do instead to achieve what you want could be some nasty thing like
while the input string still contains "abc"
find the next occurrence of "abc"
append to the output a string containing as many "+"s as there are characters before the "abc"
append "abc" to the output string
skip, in the input string, to a position just after the "abc" found
append to the output a string containing as many "+"s as there are characters left in the input
or possibly if the input alphabet is restricted you could use regular expressions to do something like
replace all occurrences of "abc" with a single character that does not occur anywhere in the existing string
replace all other characters with "+"
replace all occurrences of the target character with "abc"
which will be more readable but may not perform as well
Negating regexps is usually troublesome. I think you might want to use negative lookahead. Something like this might work:
String pattern = "(?<!ab).(?!abc)";
I didn't test it, so it may not really work for degenerate cases. And the performance might be horrible too. It is probably better to use a multistep algorithm.
Edit: No I think this won't work for every case. You will probably spend more time debugging a regexp like this than doing it algorithmically with some extra code.
Try to solve it without regular expressions:
String out = "";
int i;
for(i=0; i<text.length() - pattern.length() + 1; ) {
if (text.substring(i, i + pattern.length()).equals(pattern)) {
out += pattern;
i += pattern.length();
}
else {
out += "+";
i++;
}
}
for(; i<text.length(); i++) {
out += "+";
}
Rather than a single replaceAll, you could always try something like:
#Test
public void testString() {
final String in = "abXYabcXYabcHIH";
final String expected = "xxxxabcxxabcxxx";
String result = replaceUnwanted(in);
assertEquals(expected, result);
}
private String replaceUnwanted(final String in) {
final Pattern p = Pattern.compile("(.*?)(abc)([^a]*)");
final Matcher m = p.matcher(in);
final StringBuilder out = new StringBuilder();
while (m.find()) {
out.append(m.group(1).replaceAll(".", "x"));
out.append(m.group(2));
out.append(m.group(3).replaceAll(".", "x"));
}
return out.toString();
}
Instead of using replaceAll(...), I'd go for a Pattern/Matcher approach:
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class Main {
public static String plusOut(String str, String pattern) {
StringBuilder builder = new StringBuilder();
String regex = String.format("((?:(?!%s).)++)|%s", pattern, pattern);
Matcher m = Pattern.compile(regex).matcher(str.toLowerCase());
while(m.find()) {
builder.append(m.group(1) == null ? pattern : m.group().replaceAll(".", "+"));
}
return builder.toString();
}
public static void main(String[] args) {
String text = "abXYabcXYZ";
String pattern = "abc";
System.out.println(plusOut(text, pattern));
}
}
Note that you'll need to use Pattern.quote(...) if your String pattern contains regex meta-characters.
Edit: I didn't see a Pattern/Matcher approach was already suggested by toolkit (although slightly different)...

Categories