Regular Expression to extract text containing pipe charcters

Regular Expression to extract text containing pipe charcters - java

I have a string and required an regular expression to extract the substring from a string.
Example: this is a|b|c|d whatever e|f|g|h
Result: a|b|c|d, e|f|g|h
However based on the Java code that I wrote, it is producing the results as follows:
Pattern ptyy = Pattern.compile("\\|*.+? ");
Matcher matcher_values = ptyy.matcher("this is a|b|c|d whatever e|f|g|h");
while (matcher_values.find()) {
String line = matcher_values.group(0);
System.out.println(line);
}
Result
this
is
a|b|c|d
whatever
The result is not what I have hoped for. Any advice?

I think this regex is enough (.\|)+.
see the example
(.\|) this find all the a|b|...| and last . find the last char of the sub-string.

Your \|*.+? pattern matches 0 or more pipes, then 1 or more any chars other than a newline up to the first space. Thus, it matches almost all non-whitespace chunks in a string.
If a, b and c are just placeholders and there can be any non-whitespace chars, I'd suggest:
[^\s|]+(?:\|[^\s|])+
See the regex demo
Details:
[^\s|]+ - 1 or more chars other than whitespace and |
(?:\|[^\s|])+ - 1 or more sequences of:
\| - a literal |
[^\s|] - 1 or more chars other than whitespace and |
Java demo:
Pattern ptyy = Pattern.compile("[^\\s|]+(?:\\|[^\\s|])+");
Matcher matcher_values = ptyy.matcher("this is a|b|c|d whatever e|f|g|h");
while (matcher_values.find()) {
String line = matcher_values.group(0);
System.out.println(line);
}

Based on your advice, i managed to come up with my own regular expression that can address different combination of the pipe expression.
Pattern ptyy = Pattern.compile("[^\\s|]+(?:\\|[^\\s|])+");
Matcher matcher_values = ptyy.matcher("this is a|b|c|d whater e|f|g|h and Az|09|23|A3 and 22|1212|12121|55555");
while (matcher_values.find()) {
String line = matcher_values.group(0);
System.out.println(line);
}
This will enable me to get the result
a|b|c|d
e|f|g|h
Az|09|23|A
22|1212|12121|5
Thanks everyone!

Related

Java non-greedy (?) regex to match string

String poolId = "something/something-else/pools[name='test'][scope='lan1']";
String statId = "something/something-else/pools[name='test'][scope='lan1']/stats[base-string='10.10.10.10']";
Pattern pattern = Pattern.compile(".+pools\\[name='.+'\\]\\[scope='.+'\\]$");
What regular expression should be used such that
pattern.matcher(poolId).matches()
returns true whereas
pattern.matcher(statsId).matches()
returns false?
Note that
something/something-else is irrelevant and can be of any length
Both name and scope can have ANY character including any of \, /, [, ] etc
stats[base-string='10.10.10.10'] is an example and there can be anything else after /
I tried to use the non-greedy ? like so .+pools\\[name='.+'\\]\\[scope='.+?'\\]$ but still both matches return true

You can use
.+pools\[name='[^']*'\]\[scope='[^']*'\]$
See the regex demo. Details:
.+ - any one or more chars other than line break chars as many as possible
pools\[name=' - a pools[name='string
[^']* - zero or more chars other than a '
'\]\[scope=' - a '][scope=' string
[^']* - zero or more chars other than a '
'\] - a '] substring
$ - end of string.
In Java:
Pattern pattern = Pattern.compile(".+pools\\[name='[^']*']\\[scope='[^']*']$");
See the Java demo:
//String s = "something/something-else/pools[name='test'][scope='lan1']"; // => Matched!
String s = "something/something-else/pools[name='test'][scope='lan1']/stats[base-string='10.10.10.10']";
Pattern pattern = Pattern.compile(".+pools\\[name='[^']*']\\[scope='[^']*']$");
Matcher matcher = pattern.matcher(s);
if (matcher.find()){
System.out.println("Matched!");
} else {
System.out.println("Not Matched!");
}
// => Not Matched!

Wiktor assumed that your values for name and scope cannot have single quotes in them. Thus the following:
.../pools[name='tes't']
would not match. This is really the only valid assumption to make, as if you can include unescaped single quotes, then what's to stop the value of scope from being (for example) the literal value lan1']/stats[base-string='10.10.10.10? The regex you included in your question has this issue. If you simply must have these values in your code, you need to escape them somehow. Try the following (edit of Wiktor's regex):
.+pools\[name='([^']|\\')*'\]\[scope='([^']|\\')*'\]$

Java regex matches but String.replaceAll() doesn't replace matching substrings

public class test {
public static void main(String[]args) {
String test1 = "Nørrebro, Denmark";
String test2 = "ø";
String regex = new String("^&\\S*;$");
String value = test1.replaceAll(regex,"");
System.out.println(test2.matches(regex));
System.out.println(value);
}
}
This gives me following Output:
true
Nørrebro, Denmark
How is that possible ? Why does replaceAll() not register a match?

Your regex includes ^. Which makes the regex match from the very start.
If you try
test1.matches(regex)
you will get false.

You need to understand what ^ and $ means.
You probably put them in there because you want to say:
At the start of each match, I want a &, then 0 or more non-whitespace characters, then a ; at the end of the match.
However, ^ and $ doesn't mean the start and end of each match. It means the start and end of the string.
So you should remove the ^ and $ from your regex:
String regex = "&\\S*;";
Now it outputs:
true
Nrrebro, Denmark
"What character specifies the start and end of the match then?" you might ask. Well, since your regex basically the pattern you are matching, the start of the regex is the start of the match (unless you have lookbehinds)!

It is possible because ^&\S*;$ pattern matches the entire ø string but it does not match entire Nørrebro, Denmark string. The ^ matches (requires here) start of string to be right before & and $ requires the ; to appear right at the end of the string.
Just removing the ^ and $ anchors may not work, because \S* is a greedy pattern, and it may overmatch, e.g. in Nørrebro;.
You may use &\w+; or &\S+?; pattern, e.g.:
String test1 = "Nørrebro, Denmark";
String regex = "&\\w+;";
String value = test1.replaceAll(regex,"");
System.out.println(value); // => Nrrebro, Denmark
See the Java demo.
The &\w+; pattern matches a &, then any 1+ word chars, and then ;, anywhere inside the string. \S*? matches any 0+ chars other than whitespace.

You can use this regex : &(.*?);
String test1 = "Nørrebro, Denmark";
String test2 = "ø";
String regex = new String("&(.*?);");
String value = test1.replaceAll(regex,"");
System.out.println(test2.matches(regex));
System.out.println(value);
output :
true
Nrrebro, Denmark

find substring using match regex

Using regex how to find a substring in other string. Here are two strings:
String a= "?drug <http://www4.wiwiss.fu-berlin.de/drugbank/resource/drugbank/possibleDiseaseTarget> ?disease .";
String b = "?drug <http://www4.wiwiss.fu-berlin.de/drugbank/resource/drugbank/molecularWeightAverage> ?weight . ?drug <http://www4.wiwiss.fu-berlin.de/drugbank/resource/drugbank/possibleDiseaseTarget> ?disease";
I want to match only
<http://www4.wiwiss.fu-berlin.de/drugbank/resource/drugbank/possibleDiseaseTarget>

Since this is not quite HTML and any XML/HTML parser couldn't help it you can try with regex. It seems that you want to find text in form
?drug <someData> ?disease
To describe such text regex you need to escape ? (it is one of regex special characters representing optional - zero or once - quantifier) so you need to place \ before it (which in String needs to be written as "\\").
Also part <someData> can be written as as <[^>]> which means,
<,
one or more non > after it,
and finally >
So regex to match ?drug <someData> ?disease can be written as
"\\?drug <[^>]+> \\?disease"
But since we are interested only in part <[^>]+> representing <someData> we need to let regex group founded contend. In short if we surround some part of regex with parenthesis, then string matched by this regex part will be placed in something we call group, so we will be able to get part from this group. In short final regex can look like
"\\?drug (<[^>]+>) \\?disease"
^^^^^^^^^---first group,
and can be used like
String a = "?drug <http://www4.wiwiss.fu-berlin.de/drugbank/resource/drugbank/possibleDiseaseTarget> ?disease .";
String b = "?drug <http://www4.wiwiss.fu-berlin.de/drugbank/resource/drugbank/molecularWeightAverage> ?weight . ?drug <http://www4.wiwiss.fu-berlin.de/drugbank/resource/drugbank/possibleDiseaseTarget> ?disease";
Pattern p = Pattern.compile("\\?drug (<[^>]+>) \\?disease");
Matcher m = p.matcher(a);
while (m.find()) {
System.out.println(m.group(1));
}
System.out.println("-----------");
m = p.matcher(b);
while (m.find()) {
System.out.println(m.group(1));
}
which will produce as output
<http://www4.wiwiss.fu-berlin.de/drugbank/resource/drugbank/possibleDiseaseTarget>
-----------
<http://www4.wiwiss.fu-berlin.de/drugbank/resource/drugbank/possibleDiseaseTarget>

There's no need to use a regex here, just do this :
String substr = "<http://www4.wiwiss.fu-berlin.de/drugbank/resource/drugbank/possibleDiseaseTarget>";
System.out.println(b.contains(substr)); // prints true
System.out.println(a.contains(substr)); // prints true

Splitting strings delimited by [[ ]] in java?

I have the input string of the following form "[[Animal rights]] [[Anthropocentrism]] [[Anthropology]]" and I need to extract the tokens "Animal rights" , "Anthropocentrism" and so on etc.
I tried using the split method in the String library but I am not able to find the appropriate regular expression to get the tokens, it would be great if someone could help.
I am basically trying to parse the internal links in a Wikipedia XML file you can check out the format here.

You probably shouldn't be using split() here but instead a Matcher:
String input = "[[Animal rights]] [[Anthropocentrism]] [[Anthropology]]";
Matcher m = Pattern.compile("\\[\\[(.*?)\\]\\]").matcher(input);
while (m.find()) {
System.out.println(m.group(1));
}
Animal rights
Anthropocentrism
Anthropology

A pattern like this should work:
\[\[(.*?)\]\]
This will match a literal [[ followed by zero or more of any character, non-greedily, captured in group 1, followed by a literal ]].
Don't forget to escape the \ in the Java string literal:
Pattern.compile("\\[\\[(.*)?\\]\\]");

It's pretty easy with regex.
\[\[(.+?)\]\]
Edit live on Debuggex
I recommend doing a .+ to make sure there is something actually in the brackets and you won't get a null if something doesn't exist when you're trying to put it in your array.
string output = new string [10];
string pattern = "\[\[(.+?)\]\]";
string input = "[[Animal rights]] [[Anthropocentrism]] [[Anthropology]]";
Matcher m = Pattern.compile(pattern).matcher(input);
int increment= 0;
while (m.find()) {
output[increment] = m.group(1);
increment++;
}
Since you said you wanted to learn regex also i'll break it down.
\[ 2x is finding [ brackets you need a \ because it's regex's special characters
. can denote every character except newlines
+ means one or more of that character
? Repeats the previous item once or more. Lazy, so the engine first matches the previous item only once, before trying permutations with ever increasing matches of the preceding item.
\] is capturing the ]

Try the next:
String str = "[[Animal rights]] [[Anthropocentrism]] [[Anthropology]]";
str = str.replaceAll("(^\\[\\[|\\]\\]$)", "");
String[] array = str.split("\\]\\] \\[\\[");
System.out.println(Arrays.toString(array));
// prints "[Animal rights, Anthropocentrism, Anthropology]"

How to create a java regular expression pattern that would match a string only at certain positon?

I would like to create a regular expression pattern that would succeed in matching only if the pattern string not followed by any other string in the test string or input string ! Here is what i tried :
Pattern p = Pattern.compile("google.com");//I want to know the right format
String input1 = "mail.google.com";
String input2 = "mail.google.com.co.uk";
Matcher m1 = p.matcher(input1);
Matcher m2 = p.matcher(input2);
boolean found1 = m1.find();
boolean found2 = m2.find();//This should be false because "google.com" is followed by ".co.uk" in input2 string
Any help would be appreciated!

Your pattern should be google\.com$. The $ character matches the end of a line. Read about regex boundary matchers for details.

Here is how to match and get the non-matching part as well.
Here is the raw regex pattern as an interactive link to a great regular expression tool
^(.*)google\.com$
^ - match beginning of string
(.*) - capture everything in a group up to the next match
google - matches google literal
\. - matches the . literal has to be escaped with \
com - matches com literal
$ - matches end of string
Note: In Java the \ in the String literal has to be escaped as well! ^(.*)google\\.com$

You should use google\.com$. $ character matches the end of a line.
Pattern p = Pattern.compile("google\\.com$");//I want to know the right format
String input2 = "mail.google.com.co.uk";
Matcher m2 = p.matcher(input2);
boolean found2 = m2.find();
System.out.println(found2);
Output = false

Pattern p = Pattern.compile("google\.com$");
The dollar sign means it has to occur at the end of the line/string being tested. Note too that your dot will match any character, so if you want it to match a dot only, you need to escape it.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Regular Expression to extract text containing pipe charcters - java

I think this regex is enough (.\|)+. see the example (.\|) this find all the a|b|...| and last . find the last char of the sub-string.

Related

Java non-greedy (?) regex to match string

Java regex matches but String.replaceAll() doesn't replace matching substrings

find substring using match regex

Splitting strings delimited by [[ ]] in java?

How to create a java regular expression pattern that would match a string only at certain positon?

Categories

Resources