how to split a string by "|" - java

I want to use regular expression to split this string:
String filter = "(go|add)addition|(sub)subtraction|(mul|into)multiplication|adding(add|go)values|(add|go)(go)(into)multiplication|";
I want to split it by | except when the pipe appears within brackets in which case they should be ignored, i.e. I am excepting an output like this:
(go|add)addition
(sub)subtraction
(mul|into)multiplication
adding(add|go)values
(add|go)(go)(into)multiplication
Updated
And then i want to move the words within the brackets at the start to the end.
Something like this..
addition(go|add)
subtraction(sub)
multiplication(mul|into)
adding(add|go)values
multiplication(add|go)(go)(into)
I have tried this regular expression: Splitting of string for `whitespace` & `and` but they have used quotes and I have not been able to make it work for brackets.

Already seen this question 15 min ago. Now that it is asked correctly, here is my proposition of answer :
Trying with a regex is complex because you need to count parenthesis. I advice you to manually parse the string like this :
public static void main(String[] args) {
String filter = "(go|add)addition|(sub)subtraction|(mul|into)multiplication|";
List<String> strings = new LinkedList<>();
int countParenthesis = 0;
StringBuilder current = new StringBuilder();
for(char c : filter.toCharArray()) {
if(c == '(') {countParenthesis ++;}
if(c == ')') {countParenthesis --;}
if(c == '|' && countParenthesis == 0) {
strings.add(current.toString());
current = new StringBuilder();
} else {
current.append(c);
}
}
strings.add(current.toString());
for (String string : strings) {
System.out.println(string+" ");
}
}
Output :
(go|add)addition
(sub)subtraction
(mul|into)multiplication

If you don't have nested parenthesis (so not (mul(iple|y)|foo)) you can use:
((?:\([^)]*\))*)([^()|]+(?:\([^)]*\)[^()|]*)*)
( #start first capturing group
(?: # non capturing group
\([^)]*\) # opening bracket, then anything except closing bracket, closing bracket
)* # possibly multiple bracket groups at the beginning
)
( # start second capturing group
[^()|]+ # go to the next bracket group, or the closing |
(?:
\([^)]*\)[^()|]* # bracket group, then go to the next bracket group/closing |
)* # possibly multiple brackets groups
) # close second capturing group
and replace with
\2\1
Explanation
((?:\([^)]*\))*) matches and captures all the parenthesis groups at the beginning
[^()|]* anything except (, ), or |. If there isn't any parenthesis, this will match everything.
(?:\([^)]*\)[^()|]*): (?:...) is a non capturing group, \([^)]*\) matches everything inside parenthesis, [^()|]* gets us up to the next parenthesis group or the | that ends the match.
Code sample:
String testString = "(go|add)addition|(sub)subtraction|(mul|into)multiplication|adding(add|go)values|(add|go)(go)(into)multiplication|";
Pattern p = Pattern.compile("((?:\\([^)]*\\))*)([^()|]+(?:\\([^)]*\\)[^()|]*)*)");
Matcher m = p.matcher(testString);
while (m.find()) {
System.out.println(m.group(2)+m.group(1));
}
Outputs (demo):
addition(go|add)
subtraction(sub)
multiplication(mul|into)
adding(add|go)values
multiplication(add|go)(go)(into)

Your String
"(go|add)addition|(sub)subtraction|(mul|into)multiplication|"
have a pattern |( from where you can split for this particular String pattern. But this wont give expected result if your sub string contains paranthesis( "(" ) in between ex:
(go|(add))addition.... continue
Hope this would help.

Set up bool to keep track if you are inside a parenthesis or not.
Bool isInside = True;
loop through string
if char at i = ")" isInside = False
if isInside = false
code for skipping |
else
code for leaving | here
something like this should work i think.

Related

Find dash "-" that's not inside round brackets "()" within String

I'm trying to find/determine if a String contains the character "-" that is not enclosed in round brackets "()".
I've tried the regex
[^\(]*-[^\)]*,
but it's not working.
Examples:
100 - 200 mg -> should match because the "-" is not enclosed in round brackets.
100 (+/-) units -> should NOT match
Do you have to use regex? You could try just iterating over the string and keeping track of the scope like so:
public boolean HasScopedDash(String str)
{
int scope = 0;
boolean foundInScope = false;
for (int i = 0; i < str.length(); i++)
{
char c = str.charAt(i);
if (c == '(')
scope++;
else if (c == '-')
foundInScope = scope != 0;
else if (c == ')' && scope > 0)
{
if (foundInScope)
return true;
scope--;
}
}
return false;
}
Edit: As mentioned in the comments, it might be desirable to exclude cases where the dash comes after an opening parenthesis but no closing parenthesis ever follows. (I.e. "abc(2-xyz") The above edited code accounts for this.
You might not to want to check for that to make this pass. Maybe, you could simply make a check on other boundaries. This expression for instance checks for spaces and numbers before and after the dash or any other chars in the middle you wish to have, which is much easier to modify:
([0-9]\s+[-]\s+[0-9])
It passes your first input and fails the undesired input. You could simply add other chars to its middle char list using logical ORs.
Demo
Java supports quantified atomic groups, this works.
The way it works is to consume paired parenthesis and their contents,
and not giving anything back, up until it finds a dash -.
This is done via the atomic group constructs (?> ).
^(?>(?>\(.*?\))|[^-])*?-
https://www.regexplanet.com/share/index.html?share=yyyyd8n1dar
(click on the Java button, check the find() function column)
Readable
^
(?>
(?> \( .*? \) )
|
[^-]
)*?
-
If you don't mind to check the string by using 2 regex instead of 1 complicated regex. You can try this instead
public static boolean match(String input) {
Pattern p1 = Pattern.compile("\\-"); // match dash
Pattern p2 = Pattern.compile("\\(.*\\-.*\\)"); // match dash within bracket
Matcher m1 = p1.matcher(input);
Matcher m2 = p2.matcher(input);
if ( m1.find() && !m2.find() ) {
return true;
} else {
return false;
}
}
Test the string
public static void main(String[] args) {
String input1 = "100 - 200 mg";
String input2 = "100 (+/-) units";
System.out.println(input1 + " : " + ( match(input1) ? "match" : "not match") );
System.out.println(input2 + " : " + ( match(input2) ? "match" : "not match") );
}
The output will be
100 - 200 mg : match
100 (+/-) units : not match
Matcher m = Pattern.compile("\\([^()-]*-[^()]*\\)").matcher(s); return !m.find();
https://ideone.com/YXvuem

Nested regexps and replace

I have strings like this <p0=v0 p1=v1 p2=v2 ....> and I want to swap pX with vX to have something like <v0=p0 v1=p1 v2=p2 ....> using regexps.
I want only pairs in <> to be swapped.
I wrote:
Pattern pattern = Pattern.compile("<(\\w*)=(\\w*)>");
Matcher matcher = pattern.matcher("<p1=v1>");
System.out.println(matcher.replaceAll("$2=$1"));
But it works only with a single pair pX=vX
Could someone explain me how to write regexp that works for multiple pairs?
Simple, use groups:
String input = "<p0=v0 p1=v1 p2=v2>";
// |group 1
// ||matches "p" followed by one digit
// || |... followed by "="
// || ||group 2
// || |||... followed by "v", followed by one digit
// || ||| |replaces group 2 with group 1,
// || ||| |re-writes "=" in the middle
System.out.println(input.replaceAll("(p[0-9])=(v[0-9])", "$2=$1"));
Output:
<v0=p0 v1=p1 v2=p2>
You can use this pattern:
"((?:<|\\G(?<!\\A))\\s*)(p[0-9]+)(\\s*=\\s*)(v[0-9]+)"
To ensure that the pairs are after an opening angle bracket, the pattern start with:
(?:<|\\G(?<!\\A))
that means: an opening angle bracket OR at the end of the last match
\\G is an anchor for the position immediatly after the last match or the begining of the string (in other words, it is the last position of the regex engine in the string, that is zero at the start of the string). To avoid a match at the start of the string I added a negative lookbehind (?<!\\A) -> not preceded by the start of the string.
This trick forces each pair to be preceded by an other pair or by a <.
example:
String subject = "p5=v5 <p0=v0 p1=v1 p2=v2 p3=v3> p4=v4";
String pattern = "((?:<|\\G(?<!\\A))\\s*)(p[0-9]+)(\\s*=\\s*)(v[0-9]+)";
String result = subject.replaceAll(pattern, "$1$4$3$2");
If you need p and v to have the same number you can change it to:
String pattern = "((?:<|\\G(?<!\\A))\\s*)(p([0-9]+))(\\s*=\\s*)(v\\3)";
String result = subject.replaceAll(pattern, "$1$5$4$2");
If parts between angle brackets can contain other things (that are not pairs):
String pattern = "((?:<|\\G(?<!\\A))(?:[^\s>]+\\s*)*?\\s*)(p([0-9]+))(\\s*=\\s*)(v\\3)";
String result = subject.replaceAll(pattern, "$1$4$3$2");
Note: all these patterns only checks if there is an opening angle bracket, but don't check if there is a closing angle bracket. If a closing angle bracket is missing, all pairs will be replaced until there is no more contiguous pairs for the two first patterns and until the next closing angle bracket or the end of the string for the third pattern.
You can check the presence of a closing angle bracket by adding (?=[^<>]*>) at the end of each pattern. However adding this will make your pattern not performant at all. It is better to search parts between angle brackets with (?<=<)[^<>]++(?=>) and to perform the replacement of pairs in a callback function. You can take a look at this post to implement it.
To replace everything between < and > (let's call it tag) is - imho - not possible if the same pattern can occur outside the tag.
Instead to replace everything at once, I'd go for two regexes:
String str = "<p1=v1 p2=v2> p3=v3 <p4=v4>";
Pattern insideTag = Pattern.compile("<(.+?)>");
Matcher m = insideTag.matcher(str);
while(m.find()) {
str = str.replace(m.group(1), m.group(1).replaceAll("(\\w*)=(\\w*)", "$2=$1"));
}
System.out.println(str);
//prints: <v1=p1 v2=p2> p3=v3 <v4=p4>
The matcher grabs everything between < and > and for each match it replaces the content of the first capturing group with the swapped one on the original string, but only if it matches (\w*)=(\w*), of course.
Trying it with
<p1=v1 p2=v2 just some trash> p3=v3 <p4=v4>
gives the output
<v1=p1 v2=p2 just some trash> p3=v3 <v4=p4>
This should work to swap only those pairs between < and >:
String string = "<p0=v0 p1=v1 p2=v2> a=b c=d xyz=abc <foo=bar baz=bat>";
Pattern pattern1 = Pattern.compile("<[^>]+>");
Pattern pattern2 = Pattern.compile("(\\w+)=(\\w+)");
Matcher matcher1 = pattern1.matcher(string);
StringBuffer sbuf = new StringBuffer();
while (matcher1.find()) {
Matcher matcher2 = pattern2.matcher(matcher1.group());
matcher1.appendReplacement(sbuf, matcher2.replaceAll("$2=$1"));
}
matcher1.appendTail(sbuf);
System.out.println(sbuf);
OUTPUT:
<v0=p0 v1=p1 v2=p2> a=b c=d xyz=abc <bar=foo bat=baz>
If Java can do the \G anchor, this will work for unnested <>'s
Find: ((?:(?!\A|<)\G|<)[^<>]*?)(\w+)=(\w+)(?=[^<>]*?>)
Replace (globally): $1$3=$2
Regex explained
( # (1 start)
(?:
(?! \A | < )
\G # Start at last match
|
< # Or, <
)
[^<>]*?
) # (1 end)
( \w+ ) # (2)
=
( \w+ ) # (3)
(?= [^<>]*? > ) # There must be a closing > ahead
Perl test case
$/ = undef;
$str = <DATA>;
$str =~ s/((?:(?!\A|<)\G|<)[^<>]*?)(\w+)=(\w+)(?=[^<>]*?>)/$1$3=$2/g;
print $str;
__DATA__
<p0=v0 p1=v1 p2=v2 ....>
Output >>
<v0=p0 v1=p1 v2=p2 ....>

How to skip multiline comments with multiple comment closing tags using java

I have a java program which reads a text file and adds and removes some portion of the contents. It works with the inline and multiple line comments also in the text files.
For example the following portion will be skipped
// inline comment
/*multiple
*comment
*/
I am having a problem with a case where multiple comment closing occurs, for example
/**
*This
* is
*/
* a multiple line comment
*/
In this case as soon as the first comment closing tag occurs the skipping of comment is stopped and the rest of the line is printed in the output file.
Here is how I an doing this
boolean commentStart = false;
boolean commentEnd = false;
if(line.trim().indexOf("/*") != -1) { // start
commentStart = true;
}
if(line.trim().indexOf("*/") != -1 && commentStart) { // closed
commentEnd = true;
commentStart = false;
}
if(commentStart || (!commentStart && commentClosed)) {
//skip line
}
Any help? Thank you.
Unless you restrict yourself to nested comments, you have a malformed file there. If that's ok, then you need to define what is a comment, if not only something that's between /* and */. From your example, it looks like your definition of comment is any line that starts with */, /* or *. In regex: ^[/\\\b]?*.
If that works, I'd just skip lines if they match the regular expression.
I have a Perl regex that will strip comments from Java taking full account of quoted strings and everything. The only thing it doesn't grok is comments or quotes made with \uXXXX sequences.
sub strip_java_comments_and_quotes
{
s!( (?: \" [^\"\\]* (?: \\. [^\"\\]* )* \" )
| (?: \' [^\'\\]* (?: \\. [^\'\\]* )* \' )
| (?: \/\/ [^\n] *)
| (?: \/\* .*? \*\/)
)
!
my $x = $1;
my $first = substr($x, 0, 1);
if ($first eq '/')
{
# Replace comment with equal number of newlines to keep line count consistent
"\n" x ($x =~ tr/\n//);
}
else
{
# Replace quoted string with equal number of newlines to keep line count consistent
$first . ("\n" x ($x =~ tr/\n//)) . $first;
}
!esxg;
}
I'll have a go at converting it to Java:
Pattern re = Pattern.compile(
"( (?: \" [^\"\\\\]* (?: \\\\. [^\"\\\\]* )* \" )" +
"| (?: ' [^'\\\\]* (?: \\\\. [^'\\\\]* )* ' )" +
"| (?: // [^\\n] *)" +
"| (?: /\\* .*? \\*/)" +
")", Pattern.DOTALL | Pattern.COMMENTS);
Matcher m = Pattern.matcher(entireSourceFile);
Stringbuffer replacement = new Stringbuffer();
while (m.find())
{
String match = m.group(1);
String first = match.substring(0, 1);
m.appendReplacement(replacement, ""); // Beware of $n in replacement string!!
if (first.equals("/"))
{
// Replace comment with equal number of newlines to keep line count consistent
replacement.append( match.replaceAll("[^\\n]", ""));
}
else
{
// Replace quoted string with equal number of newlines to keep line count consistent
// Although Java quoted strings aren't legally allowed newlines in them
replacement.append(first).append(match.replaceAll("[^\\n]", "")).append(first);
}
}
m.appendTail(replacement);
Something like that!

split string for returning only the latter part

I have a string like this:
abc:def,ghi,jkl;mno:pqr,stu;vwx:yza,aaa,bbb;
I want to split first on ; and then on :
Finally the output should be only the latter part around : i.e. my output should be
def, ghi, jkl, pqr, stu, yza,aaa,bbb
This can be done using Split twice i.e. once with ; and then with : and then pattern match to find just the right part next to the :. Howvever, is there a better and optimized solution to achieve this?
So basically you want to fetch the content between ; and :, with : on the left and ; on the right.
You can use this regex: -
"(?<=:)(.*?)(?=;)"
This contains a look-behind for : and a look-ahead for ;. And matches the string preceded by a colon(:) and followed by a semi-colon (;).
Regex Explanation: -
(?<= // Look behind assertion.
: // Check for preceding colon (:)
)
( // Capture group 1
. // Any character except newline
* // repeated 0 or more times
? // Reluctant matching. (Match shortest possible string)
)
(?= // Look ahead assertion
; // Check for string followed by `semi-colon (;)`
)
Here's the working code: -
String str = "abc:def,ghi,jkl;mno:pqr,stu;vwx:yza,aaa,bbb;";
Matcher matcher = Pattern.compile("(?<=:)(.*?)(?=;)").matcher(str);
StringBuilder builder = new StringBuilder();
while (matcher.find()) {
builder.append(matcher.group(1)).append(", ");
}
System.out.println(builder.substring(0, builder.lastIndexOf(",")));
OUTPUT: -
def,ghi,jkl, pqr,stu, yza,aaa,bbb
String[] tabS="abc:def,ghi,jkl;mno:pqr,stu;vwx:yza,aaa,bbb;".split(";");
StringBuilder sb = new StringBuilder();
Pattern patt = Pattern.compile("(.*:)(.*)");
String sep = ",";
for (String string : tabS) {
sb.append(patt.matcher(string).replaceAll("$2 ")); // ' ' after $2 == ';' replaced
sb.append(sep);
}
System.out.println(sb.substring(0,sb.lastIndexOf(sep)));
output
def,ghi,jkl ,pqr,stu ,yza,aaa,bbb
Don't pattern match unless you have to in Java; if you can't have the ':' character in the field name (abc in your example), you can use indexOf(":") to figure out the "right part".

Create a string-capable Guava Splitter

I would like to create a Guava Splitter for Java that can handles Java strings as one block. For instance, I would like the following assertion to be true:
#Test
public void testSplitter() {
String toSplit = "a,b,\"c,d\\\"\",e";
List<String> expected = ImmutableList.of("a", "b", "c,d\"","e");
Splitter splitter = Splitter.onPattern(...);
List<String> actual = ImmutableList.copyOf(splitter.split(toSplit));
assertEquals(expected, actual);
}
I can write the regex to find all the elements and don't consider the ',' but I can't find the regex that would act as a separator to be used with a Splitter.
If it's impossible, please just say so, then I'll build the list from the findAll regex.
This seems like something you should use a CSV library such as opencsv for. Separating values and handling cases like quoted blocks are what they're all about.
This is a Guava feature request: http://code.google.com/p/guava-libraries/issues/detail?id=412
I've same problem (except no need to support escaping of quote character). I don't like to include another library for such simple thing. And then i came to idea, that i need a mutable CharMatcher. As with solution of Bart Kiers, it keeps quote character.
public static Splitter quotableComma() {
return on(new CharMatcher() {
private boolean inQuotes = false;
#Override
public boolean matches(char c) {
if ('"' == c) {
inQuotes = !inQuotes;
}
if (inQuotes) {
return false;
}
return (',' == c);
}
});
}
#Test
public void testQuotableComma() throws Exception {
String toSplit = "a,b,\"c,d\",e";
List<String> expected = ImmutableList.of("a", "b", "\"c,d\"", "e");
Splitter splitter = Splitters.quotableComma();
List<String> actual = ImmutableList.copyOf(splitter.split(toSplit));
assertEquals(expected, actual);
}
You could split on the following pattern:
\s*,\s*(?=((\\["\\]|[^"\\])*"(\\["\\]|[^"\\])*")*(\\["\\]|[^"\\])*$)
which might look (a bit) friendlier with the (?x) flag:
(?x) # enable comments, ignore space-literals
\s*,\s* # match a comma optionally surrounded by space-chars
(?= # start positive look ahead
( # start group 1
( # start group 2
\\["\\] # match an escaped quote or backslash
| # OR
[^"\\] # match any char other than a quote or backslash
)* # end group 2, and repeat it zero or more times
" # match a quote
( # start group 3
\\["\\] # match an escaped quote or backslash
| # OR
[^"\\] # match any char other than a quote or backslash
)* # end group 3, and repeat it zero or more times
" # match a quote
)* # end group 1, and repeat it zero or more times
( # open group 4
\\["\\] # match an escaped quote or backslash
| # OR
[^"\\] # match any char other than a quote or backslash
)* # end group 4, and repeat it zero or more times
$ # match the end-of-input
) # end positive look ahead
But even in this commented-version, it still is a monster. In plain English, this regex could be explained as follows:
Match a comma that is optionally surrounded by space-chars, only when looking ahead of that comma (all the way to the end of the string!), there are zero or an even number of quotes while ignoring escaped quotes or escaped backslashes.
So, after seeing this, you might agree with ColinD (I do!) that using some sort of a CSV parser is the way to go in this case.
Note that the regex above will leave the qoutes around the tokens, i.e., the string a,b,"c,d\"",e (as a literal: "a,b,\"c,d\\\"\",e") will be split as follows:
a
b
"c,d\""
e
Improving on #Rage-Steel 's answer a bit.
final static CharMatcher notQuoted = new CharMatcher() {
private boolean inQuotes = false;
#Override
public boolean matches(char c) {
if ('"' == c) {
inQuotes = !inQuotes;
}
return !inQuotes;
};
final static Splitter SPLITTER = Splitter.on(notQuoted.and(CharMatcher.anyOf(" ,;|"))).trimResults().omitEmptyStrings();
And then,
public static void main(String[] args) {
final String toSplit = "a=b c=d,kuku=\"e=f|g=h something=other\"";
List<String> sputnik = SPLITTER.splitToList(toSplit);
for (String s : sputnik)
System.out.println(s);
}
Pay attention to thread safety (or, to simplify - there isn't any)

Categories