Regex parse string in java - java

I am using Java. I need to parse the following line using regex :
<actions>::=<action><action>|X|<game>|alpha
It should give me tokens <action>, <action>,X and <game>
What kind of regex will work?
I was trying sth like: "<[a-zA-Z]>" but that doesn't take care of X or alpha.

You can try something like this:
String str="<actions>::=<action><action>|X|<game>|alpha";
str=str.split("=")[1];
Pattern pattern = Pattern.compile("<.*?>|\\|.*?\\|");
Matcher matcher = pattern.matcher(str);
while (matcher.find()) {
System.out.println(matcher.group());
}

You should have something like this:
String input = "<actions>::=<action><action>|X|<game>|alpha";
Matcher matcher = Pattern.compile("(<[^>]+>)(<[^>]+>)\\|([^|]+)\\|(<[^|]+>)").matcher(input);
while (matcher.find()) {
System.out.println(matcher.group().replaceAll("\\|", ""));
}
You didn't specefied if you want to return alpha or not, in this case, it doesn't return it.
You can return alpha by adding |\\w* to the end of the regex I wrote.
This will return:
<action><action>X<game>

From the original pattern it is not clear if you mean that literally there are <> in the pattern or not, i'll go with that assumption.
String pattern="<actions>::=<(.*?)><(.+?)>\|(.+)\|<(.*?)\|alpha";
For the java code you can use Pattern and Matcher: here is the basic idea:
Pattern p = Pattern.compile(pattern, Pattern.DOTALL|Pattern.MULTILINE);
Matcher m = p.matcher(text);
m.find();
for (int g = 1; g <= m.groupCount(); g++) {
// use your four groups here..
}

You can use following Java regex:
Pattern pattern = Pattern.compile
("::=(<[^>]+>)(<[^>]+>)\\|([^|]+)\\|(<[^>]+>)\\|(\\w+)$");

Related

Get Substring from a String in Java

I have the following text:
...,Niedersachsen,NOT IN CHARGE SINCE: 03.2009, CATEGORY:...,
Now I want to extract the date after NOT IN CHARGE SINCE: until the comma.
So i need only 03.2009 as result in my substring.
So how can I handle that?
String substr = "not in charge since:";
String before = s.substring(0, s.indexOf(substr));
String after = s.substring(s.indexOf(substr),s.lastIndexOf(","));
EDIT
for (String s : split) {
s = s.toLowerCase();
if (s.contains("ex peps")) {
String substr = "not in charge since:";
String before = s.substring(0, s.indexOf(substr));
String after = s.substring(s.indexOf(substr), s.lastIndexOf(","));
System.out.println(before);
System.out.println(after);
System.out.println("PEP!!!");
} else {
System.out.println("Line ok");
}
}
But that is not the result I want.
You can use Patterns for example :
String str = "Niedersachsen,NOT IN CHARGE SINCE: 03.2009, CATEGORY";
Pattern p = Pattern.compile("\\d{2}\\.\\d{4}");
Matcher m = p.matcher(str);
if (m.find()) {
System.out.println(m.group());
}
Output
03.2009
Note : if you want to get similar dates in all your String you can use while instead of if.
Edit
Or you can use :
String str = "Niedersachsen,NOT IN CHARGE SINCE: 03.03.2009, CATEGORY";
Pattern p = Pattern.compile("SINCE:(.*?)\\,");
Matcher m = p.matcher(str);
if (m.find()) {
System.out.println(m.group(1).trim());
}
You can use : to separate the String s.
String substr = "NOT IN CHARGE SINCE:";
String before = s.substring(0, s.indexOf(substr)+1);
String after = s.substring(s.indexOf(':')+1, s.lastIndexOf(','));
Of course, regular expressions give you more ways to do searching/matching, but assuming that the ":" is the key thing you are looking for (and it shows up exactly once in that position) then:
s.substring(s.indexOf(':')+1, s.lastIndexOf(',')).trim();
is the "most simple" and "least overhead" way of fetching that substring.
Hint: as you are searching for a single character, use a character as search pattern; not a string!
If you have a more generic usecase and you know the structure of the text to be matched well you might profit from using regular expressions:
Pattern pattern = Pattern.compile(".*NOT IN CHARGE SINCE: \([0-9.]*\),");
Matcher matcher = pattern.matcher(line);
System.out.println(matcher.group());
A more generic way to solve your problem is to use Regex to match Every group Between : and ,
Pattern pattern = Pattern.compile("(?<=:)(.*?)(?=,)");
Matcher m = p.matcher(str);
You have to create a pattern for it. Try this as a simple regex starting point, and feel free to improvise on it:
String s = "...,Niedersachsen,NOT IN CHARGE SINCE: 03.2009, CATEGORY:....,";
Pattern pattern = Pattern.compile(".*NOT IN CHARGE SINCE: ([\\d\\.]*).*");
Matcher matcher = pattern.matcher(s);
if (matcher.find())
{
System.out.println(matcher.group(1));
}
That should get you whatever group of digits you received as date.

How do I get the matched part of a wildcard in a regular expression?

For example:
Pattern pattern = Pattern.compile("a(.*)b");
Matcher matcher = pattern.matcher("a19203b");
matcher.find();
System.out.println(matcher.group());
This prints out the entire string (a19203b). All I need is 19203. How can I get this in Java?
(for example, in a mod_rewrite rule, I would do something like RewriteRule article/(.*) article.php?id=$1 where $1 is the string I need)
Found the solution. Instead of matcher.group(), use matcher.group(1).
Pattern pattern = Pattern.compile("a(.*)b");
Matcher matcher = pattern.matcher("a19203b");
matcher.find();
System.out.println(matcher.group(1));
Use lookbehinds/lookaheads :
Pattern regex = Pattern.compile("(?<=a).*(?=b)");
Don't capture what you don't want to capture. Here your entire match will be what you want.

Java regex to extract text between tags

I have a file with some custom tags and I'd like to write a regular expression to extract the string between the tags. For example if my tag is:
[customtag]String I want to extract[/customtag]
How would I write a regular expression to extract only the string between the tags. This code seems like a step in the right direction:
Pattern p = Pattern.compile("[customtag](.+?)[/customtag]");
Matcher m = p.matcher("[customtag]String I want to extract[/customtag]");
Not sure what to do next. Any ideas? Thanks.
You're on the right track. Now you just need to extract the desired group, as follows:
final Pattern pattern = Pattern.compile("<tag>(.+?)</tag>", Pattern.DOTALL);
final Matcher matcher = pattern.matcher("<tag>String I want to extract</tag>");
matcher.find();
System.out.println(matcher.group(1)); // Prints String I want to extract
If you want to extract multiple hits, try this:
public static void main(String[] args) {
final String str = "<tag>apple</tag><b>hello</b><tag>orange</tag><tag>pear</tag>";
System.out.println(Arrays.toString(getTagValues(str).toArray())); // Prints [apple, orange, pear]
}
private static final Pattern TAG_REGEX = Pattern.compile("<tag>(.+?)</tag>", Pattern.DOTALL);
private static List<String> getTagValues(final String str) {
final List<String> tagValues = new ArrayList<String>();
final Matcher matcher = TAG_REGEX.matcher(str);
while (matcher.find()) {
tagValues.add(matcher.group(1));
}
return tagValues;
}
However, I agree that regular expressions are not the best answer here. I'd use XPath to find elements I'm interested in. See The Java XPath API for more info.
To be quite honest, regular expressions are not the best idea for this type of parsing. The regular expression you posted will probably work great for simple cases, but if things get more complex you are going to have huge problems (same reason why you cant reliably parse HTML with regular expressions). I know you probably don't want to hear this, I know I didn't when I asked the same type of questions, but string parsing became WAY more reliable for me after I stopped trying to use regular expressions for everything.
jTopas is an AWESOME tokenizer that makes it quite easy to write parsers by hand (I STRONGLY suggest jtopas over the standard java scanner/etc.. libraries). If you want to see jtopas in action, here are some parsers I wrote using jTopas to parse this type of file
If you are parsing XML files, you should be using an xml parser library. Dont do it youself unless you are just doing it for fun, there are plently of proven options out there
A generic,simpler and a bit primitive approach to find tag, attribute and value
Pattern pattern = Pattern.compile("<(\\w+)( +.+)*>((.*))</\\1>");
System.out.println(pattern.matcher("<asd> TEST</asd>").find());
System.out.println(pattern.matcher("<asd TEST</asd>").find());
System.out.println(pattern.matcher("<asd attr='3'> TEST</asd>").find());
System.out.println(pattern.matcher("<asd> <x>TEST<x>asd>").find());
System.out.println("-------");
Matcher matcher = pattern.matcher("<as x> TEST</as>");
if (matcher.find()) {
for (int i = 0; i <= matcher.groupCount(); i++) {
System.out.println(i + ":" + matcher.group(i));
}
}
String s = "<B><G>Test</G></B><C>Test1</C>";
String pattern ="\\<(.+)\\>([^\\<\\>]+)\\<\\/\\1\\>";
int count = 0;
Pattern p = Pattern.compile(pattern);
Matcher m = p.matcher(s);
while(m.find())
{
System.out.println(m.group(2));
count++;
}
Try this:
Pattern p = Pattern.compile(?<=\\<(any_tag)\\>)(\\s*.*\\s*)(?=\\<\\/(any_tag)\\>);
Matcher m = p.matcher(anyString);
For example:
String str = "<TR> <TD>1Q Ene</TD> <TD>3.08%</TD> </TR>";
Pattern p = Pattern.compile("(?<=\\<TD\\>)(\\s*.*\\s*)(?=\\<\\/TD\\>)");
Matcher m = p.matcher(str);
while(m.find()){
Log.e("Regex"," Regex result: " + m.group())
}
Output:
10 Ene
3.08%
final Pattern pattern = Pattern.compile("tag\\](.+?)\\[/tag");
final Matcher matcher = pattern.matcher("[tag]String I want to extract[/tag]");
matcher.find();
System.out.println(matcher.group(1));
I prefix this reply with "you shouldn't use a regular expression to parse XML -- it's only going to result in edge cases that don't work right, and a forever-increasing-in-complexity regex while you try to fix it."
That being said, you need to proceed by matching the string and grabbing the group you want:
if (m.matches())
{
String result = m.group(1);
// do something with result
}
This works for me, use in your main method below Scanner input. Works for Hackerrank "Tag Content Extractor" also.
boolean matchFound = false;
Pattern r = Pattern.compile("<(.+)>([^<]+)</\\1>");
Matcher m = r.matcher(line);
while (m.find()) {
System.out.println(m.group(2));
matchFound = true;
}
if ( ! matchFound) {
System.out.println("None");
}
testCases--;

Parsing an expression containing repeated groups using Java regexp

I'm trying to parse from a string like below
"name1(value1),name2(value2),name3(value3),name4(value4),........" and so it goes
How can I do it recursively with groups?
String s = "name1(value1),name2(value2),name3(value3),name4(value4),";
Pattern p = Pattern.compile(".*?\\((.*?)\\)");
Matcher m = p.matcher(s);
while(m.find()){
System.out.println(m.group(1));
}
i would rather use the java String operations to get to the values but if you want to use regex, you could use something that looks like that:
[^\(]*\([^\)]*\),
Should be quite stable
You can test it here:
http://regexr.com?2u7u3
You can use matcher.find, try something like this:
String input = "name1(value1),name2(value2),name3(value3),name4(value4)";
Matcher matcher = Pattern.compile(".*?[(].*?[)]").matcher(input);
while(matcher.find()) {
System.out.println(matcher.group(0));
}
or just use String.split like this:
String input = "name1(value1),name2(value2),name3(value3),name4(value4)";
String[] split = input.split(",");

Java Regular Expressions

I am trying to write something like this:
Pattern p = Pattern.compile("Mar\\w");
Matcher m = p.matcher("Mary");
String result = m.replaceAll("\\w");
The result would ideally be "y". Any ideas?
Your question is not so clear, but I think you want to use a lookahead:
Pattern p = Pattern.compile("Mar(?=\\w)");
Matcher m = p.matcher("Mary");
String result = m.replaceAll("");
See it online: ideone
Or you could use a capturing group:
Pattern p = Pattern.compile("Mar(\\w)");
Matcher m = p.matcher("Mary");
String result = m.replaceAll("$1");
See it online: ideone

Categories