Create a string-capable Guava Splitter - java

I would like to create a Guava Splitter for Java that can handles Java strings as one block. For instance, I would like the following assertion to be true:
#Test
public void testSplitter() {
String toSplit = "a,b,\"c,d\\\"\",e";
List<String> expected = ImmutableList.of("a", "b", "c,d\"","e");
Splitter splitter = Splitter.onPattern(...);
List<String> actual = ImmutableList.copyOf(splitter.split(toSplit));
assertEquals(expected, actual);
}
I can write the regex to find all the elements and don't consider the ',' but I can't find the regex that would act as a separator to be used with a Splitter.
If it's impossible, please just say so, then I'll build the list from the findAll regex.

This seems like something you should use a CSV library such as opencsv for. Separating values and handling cases like quoted blocks are what they're all about.

This is a Guava feature request: http://code.google.com/p/guava-libraries/issues/detail?id=412

I've same problem (except no need to support escaping of quote character). I don't like to include another library for such simple thing. And then i came to idea, that i need a mutable CharMatcher. As with solution of Bart Kiers, it keeps quote character.
public static Splitter quotableComma() {
return on(new CharMatcher() {
private boolean inQuotes = false;
#Override
public boolean matches(char c) {
if ('"' == c) {
inQuotes = !inQuotes;
}
if (inQuotes) {
return false;
}
return (',' == c);
}
});
}
#Test
public void testQuotableComma() throws Exception {
String toSplit = "a,b,\"c,d\",e";
List<String> expected = ImmutableList.of("a", "b", "\"c,d\"", "e");
Splitter splitter = Splitters.quotableComma();
List<String> actual = ImmutableList.copyOf(splitter.split(toSplit));
assertEquals(expected, actual);
}

You could split on the following pattern:
\s*,\s*(?=((\\["\\]|[^"\\])*"(\\["\\]|[^"\\])*")*(\\["\\]|[^"\\])*$)
which might look (a bit) friendlier with the (?x) flag:
(?x) # enable comments, ignore space-literals
\s*,\s* # match a comma optionally surrounded by space-chars
(?= # start positive look ahead
( # start group 1
( # start group 2
\\["\\] # match an escaped quote or backslash
| # OR
[^"\\] # match any char other than a quote or backslash
)* # end group 2, and repeat it zero or more times
" # match a quote
( # start group 3
\\["\\] # match an escaped quote or backslash
| # OR
[^"\\] # match any char other than a quote or backslash
)* # end group 3, and repeat it zero or more times
" # match a quote
)* # end group 1, and repeat it zero or more times
( # open group 4
\\["\\] # match an escaped quote or backslash
| # OR
[^"\\] # match any char other than a quote or backslash
)* # end group 4, and repeat it zero or more times
$ # match the end-of-input
) # end positive look ahead
But even in this commented-version, it still is a monster. In plain English, this regex could be explained as follows:
Match a comma that is optionally surrounded by space-chars, only when looking ahead of that comma (all the way to the end of the string!), there are zero or an even number of quotes while ignoring escaped quotes or escaped backslashes.
So, after seeing this, you might agree with ColinD (I do!) that using some sort of a CSV parser is the way to go in this case.
Note that the regex above will leave the qoutes around the tokens, i.e., the string a,b,"c,d\"",e (as a literal: "a,b,\"c,d\\\"\",e") will be split as follows:
a
b
"c,d\""
e

Improving on #Rage-Steel 's answer a bit.
final static CharMatcher notQuoted = new CharMatcher() {
private boolean inQuotes = false;
#Override
public boolean matches(char c) {
if ('"' == c) {
inQuotes = !inQuotes;
}
return !inQuotes;
};
final static Splitter SPLITTER = Splitter.on(notQuoted.and(CharMatcher.anyOf(" ,;|"))).trimResults().omitEmptyStrings();
And then,
public static void main(String[] args) {
final String toSplit = "a=b c=d,kuku=\"e=f|g=h something=other\"";
List<String> sputnik = SPLITTER.splitToList(toSplit);
for (String s : sputnik)
System.out.println(s);
}
Pay attention to thread safety (or, to simplify - there isn't any)

Related

Regex to identify strings containing a particular symbol?

I have set of inputs ++++,----,+-+-.Out of these inputs I want the string containing only + symbols.
If you want to see if a String contains nothing but + characters, write a loop to check it:
private static boolean containsOnly(String input, char ch) {
if (input.isEmpty())
return false;
for (int i = 0; i < input.length(); i++)
if (input.charAt(i) != ch)
return false;
return true;
}
Then call it to check:
System.out.println(containsOnly("++++", '+')); // prints: true
System.out.println(containsOnly("----", '+')); // prints: false
System.out.println(containsOnly("+-+-", '+')); // prints: false
UPDATE
If you must do it using regex (worse performance), then you can do any of these:
// escape special character '+'
input.matches("\\++")
// '+' not special in a character class
input.matches("[+]+")
// if "+" is dynamic value at runtime, use quote() to escape for you,
// then use a repeating non-capturing group around that
input.matches("(?:" + Pattern.quote("+") + ")+")
Replace final + with * in each of these, if an empty string should return true.
The regular expression for checking if a string is composed of only one repeated symbol is
^(.)\1*$
If you only want lines composed by '+', then it's
^\++$, or ^++*$ if your regex implementation does not support +(meaning "one or more").
For a sequence of the same symbol, use
(.)\1+
as the regular expression. For example, this will match +++, and --- but not +--.
Regex pattern: ^[^\+]*?\+[^\+]*$
This will only permit one plus sign per string.
Demo Link
Explanation:
^ #From start of string
[^\+]* #Match 0 or more non plus characters
\+ #Match 1 plus character
[^\+]* #Match 0 or more non plus characters
$ #End of string
edit, I just read the comments under the question, I didn't actually steal the commented regex (it just happens to be intellectual convergence):
Whoops, when using matches disregard ^ and $ anchors.
input.matches("[^\\+]*?\+[^\\+]*")

Java regular expression lookahead

I have strings that I need to use regex to replace a specific character. The strings are in the following format:
"abc.edf" : "abc.abc", "ghi.ghk" : "bbb.bbb" , "qwq.tyt" : "ddd.ddd"
I need to replace the periods, '.', that are between the strings in quotes before the colon but not the strings in quotes after the colon and before the comma. Could someone shed some light?
This pattern will match the entire part that you want to touch: "\w{3}\.\w{3}" : "\w{3}\.\w{3}". Since it includes the colon and the values on both side, it won't match ones where there is a comma between the values. Depending on your needs, you may need to change \w to some other character class.
But, as I'm sure you are aware, you don't want to replace the entire string. You only want to replace the one character. There are two ways to do that. You can either use look-aheads and look-behinds to exclude everything else except the period from the resulting match:
Pattern: (?<="\w{3})\.(?=\w{3}" : "\w{3}\.\w{3}")
Replacement: :
Or, if the look-aheads and look-behinds confuse you, you could just capture the whole thing and include the original values from the captured groups in the replacement value:
Pattern: ("\w{3})\.(\w{3}" : "\w{3}\.\w{3}")
Replacement: $1:$2
Try with the following patern: /.(?=[a-z]+)/g
Working regex-demo for substitution # regex101
Java Working Demo:
public class StackOverFlow31520446 {
public static String text;
public static String pattern;
public static String replacement;
static {
text = "\"abc.edf\" : \"123.231\", \"ghi.ghk\" : \"456.678\" , \"qwq.tyt\" : \"141.242\"";
pattern = "\\.(?=[a-z]+)";
replacement = ";";
}
public static String replaceMatches(String text, String pattern, String replacement) {
return text.replaceAll(pattern, replacement);
}
public static void main(String[] args) {
System.out.println(replaceMatches(text, pattern, replacement));
}
}
Not sure what you intend to do with the string but this is a way to
match the contents of the quote's.
The contents are in capture buffer 1.
You could use a callback to replace the dots within the
contents, passing that back within the main replacement function.
Find: "([^"]*\.[^"]*)"(?=\s*:)
Replace: " + func( call to replace dots from capt buff 1 ) + "
Formatted:
" # Open quote
( [^"]* \. [^"]* ) # (1), group 1 - contents
" # Close quote
(?= # Lookahead, must be a colon
\s*
:
)
If would go for a different approach (maybe it is even faster). In your loop over all strings first try if the string matches a number \d*\.?\d* - if not, do the replacement of . with : (without any regexp).
Would that solve your problem?
You can do it without look arounds:
str = str.replaceAll("(\\D)\\.(\\D)", "$1:$2");
should be sufficient for the task.

how to split a string by "|"

I want to use regular expression to split this string:
String filter = "(go|add)addition|(sub)subtraction|(mul|into)multiplication|adding(add|go)values|(add|go)(go)(into)multiplication|";
I want to split it by | except when the pipe appears within brackets in which case they should be ignored, i.e. I am excepting an output like this:
(go|add)addition
(sub)subtraction
(mul|into)multiplication
adding(add|go)values
(add|go)(go)(into)multiplication
Updated
And then i want to move the words within the brackets at the start to the end.
Something like this..
addition(go|add)
subtraction(sub)
multiplication(mul|into)
adding(add|go)values
multiplication(add|go)(go)(into)
I have tried this regular expression: Splitting of string for `whitespace` & `and` but they have used quotes and I have not been able to make it work for brackets.
Already seen this question 15 min ago. Now that it is asked correctly, here is my proposition of answer :
Trying with a regex is complex because you need to count parenthesis. I advice you to manually parse the string like this :
public static void main(String[] args) {
String filter = "(go|add)addition|(sub)subtraction|(mul|into)multiplication|";
List<String> strings = new LinkedList<>();
int countParenthesis = 0;
StringBuilder current = new StringBuilder();
for(char c : filter.toCharArray()) {
if(c == '(') {countParenthesis ++;}
if(c == ')') {countParenthesis --;}
if(c == '|' && countParenthesis == 0) {
strings.add(current.toString());
current = new StringBuilder();
} else {
current.append(c);
}
}
strings.add(current.toString());
for (String string : strings) {
System.out.println(string+" ");
}
}
Output :
(go|add)addition
(sub)subtraction
(mul|into)multiplication
If you don't have nested parenthesis (so not (mul(iple|y)|foo)) you can use:
((?:\([^)]*\))*)([^()|]+(?:\([^)]*\)[^()|]*)*)
( #start first capturing group
(?: # non capturing group
\([^)]*\) # opening bracket, then anything except closing bracket, closing bracket
)* # possibly multiple bracket groups at the beginning
)
( # start second capturing group
[^()|]+ # go to the next bracket group, or the closing |
(?:
\([^)]*\)[^()|]* # bracket group, then go to the next bracket group/closing |
)* # possibly multiple brackets groups
) # close second capturing group
and replace with
\2\1
Explanation
((?:\([^)]*\))*) matches and captures all the parenthesis groups at the beginning
[^()|]* anything except (, ), or |. If there isn't any parenthesis, this will match everything.
(?:\([^)]*\)[^()|]*): (?:...) is a non capturing group, \([^)]*\) matches everything inside parenthesis, [^()|]* gets us up to the next parenthesis group or the | that ends the match.
Code sample:
String testString = "(go|add)addition|(sub)subtraction|(mul|into)multiplication|adding(add|go)values|(add|go)(go)(into)multiplication|";
Pattern p = Pattern.compile("((?:\\([^)]*\\))*)([^()|]+(?:\\([^)]*\\)[^()|]*)*)");
Matcher m = p.matcher(testString);
while (m.find()) {
System.out.println(m.group(2)+m.group(1));
}
Outputs (demo):
addition(go|add)
subtraction(sub)
multiplication(mul|into)
adding(add|go)values
multiplication(add|go)(go)(into)
Your String
"(go|add)addition|(sub)subtraction|(mul|into)multiplication|"
have a pattern |( from where you can split for this particular String pattern. But this wont give expected result if your sub string contains paranthesis( "(" ) in between ex:
(go|(add))addition.... continue
Hope this would help.
Set up bool to keep track if you are inside a parenthesis or not.
Bool isInside = True;
loop through string
if char at i = ")" isInside = False
if isInside = false
code for skipping |
else
code for leaving | here
something like this should work i think.

Regex to extract strings within delimiter

I am trying to extract string occurences within delimiters (parentheses in this case) but not the ones which are within quotes (single or double). Here is what I have tried - this regex fetches all occurences within parentheses, also the ones which are within quotes (I don't want the ones within quotes)
public class RegexMain {
static final String PATTERN = "\\(([^)]+)\\)";
static final Pattern CONTENT = Pattern.compile(PATTERN);
/**
* #param args
*/
public static void main(String[] args) {
String testString = "Rhyme (Jack) and (Jill) went up the hill on \"(Peter's)\" request.";
Matcher match = CONTENT.matcher(testString);
while(match.find()) {
System.out.println(match.group()); // prints Jack, Jill and Peter's
}
}
}
You could try
public class RegexMain {
static final String PATTERN = "\\(([^)]+)\\)|\"[^\"]*\"";
static final Pattern CONTENT = Pattern.compile(PATTERN);
/**
* #param args
*/
public static void main(String[] args) {
String testString = "Rhyme (Jack) and (Jill) went up the hill on \"(Peter's)\" request.";
Matcher match = CONTENT.matcher(testString);
while(match.find()) {
if(match.group(1) != null) {
System.out.println(match.group(1)); // prints Jack, Jill
}
}
}
}
This pattern will match quoted strings as well as parenthesized ones but only the parenthesized ones will put something in group(1). Since + and * are greedy in regular expressions it will prefer to match "(Peter's)" over (Peter's).
This is a case where you can make elegant use of look-behind and look-ahead operators to achieve what you want. Here is a solution in Python (I always use it for trying out stuff quickly on the command line), but the regular expression should be the same in Java code.
This regex matches content that is preceded by an opening parenthesis using positive look-behind and succeeded by a closing parenthesis using positive look-ahead. But it avoids these matches when the opening parenthesis is preceded by a single or double quote using negative look-behind and when the closing parenthesis is succeeded by a single or double quote using negative look-ahead.
In [1]: import re
In [2]: s = "Rhyme (Jack) and (Jill) went up the hill on \"(Peter's)\" request."
In [3]: re.findall(r"""
...: (?<= # start of positive look-behind
...: (?<! # start of negative look-behind
...: [\"\'] # avoids matching opening parenthesis preceded by single or double quote
...: ) # end of negative look-behind
...: \( # matches opening parenthesis
...: ) # end of positive look-behind
...: \w+ (?: \'\w* )? # matches whatever your content looks like (configure this yourself)
...: (?= # start of positive look-ahead
...: \) # matches closing parenthesis
...: (?! # start of negative look-ahead
...: [\"\'] # avoids matching closing parenthesis succeeded by single or double quote
...: ) # end of negative look-ahead
...: ) # end of positive look-ahead
...: """,
...: s,
...: flags=re.X)
Out[3]: ['Jack', 'Jill']
Note: This is not the final response because I'm not familiar with JAVA but I believe it can still be converted into the JAVA language.
The easiest approach, as far as I'm concerned, is to replace the quoted parts in the string with an empty string, then look for the matches. Hoping you're somewhat familiar with PHP, here's the idea.
$str = "Rhyme (Jack) and (Jill) went up the hill on \" (Peter's)\" request.";
preg_match_all(
$pat = '~(?<=\().*?(?=\))~',
// anything inside parentheses
preg_replace('~([\'"]).*?\1~','',$str),
// this replaces quoted strings with ''
$matches
// and assigns the result into this variable
);
print_r($matches[0]);
// $matches[0] returns the matches in preg_match_all
// [0] => Jack
// [1] => Jill

Regex for numeric portion of Java string

I'm trying to write a Java method that will take a string as a parameter and return another string if it matches a pattern, and null otherwise. The pattern:
Starts with a number (1+ digits); then followed by
A colon (":"); then followed by
A single whitespace (" "); then followed by
Any Java string of 1+ characters
Hence, some valid string thats match this pattern:
50: hello
1: d
10938484: 394958558
And some strings that do not match this pattern:
korfed49
: e4949
6
6:
6:sdjjd4
The general skeleton of the method is this:
public String extractNumber(String toMatch) {
// If toMatch matches the pattern, extract the first number
// (everything prior to the colon).
// Else, return null.
}
Here's my best attempt so far, but I know I'm wrong:
public String extractNumber(String toMatch) {
// If toMatch matches the pattern, extract the first number
// (everything prior to the colon).
String regex = "???";
if(toMatch.matches(regex))
return toMatch.substring(0, toMatch.indexOf(":"));
// Else, return null.
return null;
}
Thanks in advance.
Your description is spot on, now it just needs to be translated to a regex:
^ # Starts
\d+ # with a number (1+ digits); then followed by
: # A colon (":"); then followed by
# A single whitespace (" "); then followed by
\w+ # Any word character, one one more times
$ # (followed by the end of input)
Giving, in a Java string:
"^\\d+: \\w+$"
You also want to capture the numbers: put parentheses around \d+, use a Matcher, and capture group 1 if there is a match:
private static final Pattern PATTERN = Pattern.compile("^(\\d+): \\w+$");
// ...
public String extractNumber(String toMatch) {
Matcher m = PATTERN.matcher(toMatch);
return m.find() ? m.group(1) : null;
}
Note: in Java, \w only matches ASCII characters and digits (this is not the case for .NET languages for instance) and it will also match an underscore. If you don't want the underscore, you can use (Java specific syntax):
[\w&&[^_]]
instead of \w for the last part of the regex, giving:
"^(\\d+): [\\w&&[^_]]+$"
Try using the following: \d+: \w+

Categories