Regex extract string in java

Regex extract string in java - java

I'm trying to extract a string from a String in Regex Java
Pattern pattern = Pattern.compile("((.|\\n)*).{4}InsurerId>\\S*.{5}InsurerId>((.|\\n)*)");
Matcher matcher = pattern.matcher(abc);
I'm trying to extract the value between
<_1:InsurerId>F2021633_V1</_1:InsurerId>
I'm not sure where am I going wrong but I don't get output for
if (matcher.find())
{
System.out.println(matcher.group(1));
}

You can use:
Pattern pattern = Pattern.compile("<([^:]+:InsurerId)>([^<]*)</\\1>");
Matcher matcher = pattern.matcher(abc);
if (matcher.find()) {
System.out.println(matcher.group(2));
}
RegEx Demo

You may want to use the totally awesome page http://regex101.com/ to test your regular expressions. As you can see at https://regex101.com/r/rV8uM3/1, you only have empty capturing groups, but let me explain to you what you did. :D
((.|\n)*) This matches any character, or a new line, unimportant how often. It is capturing, so your first matching group will always be everything before <_1:InsurerId>, or an empty string. You can match any character instead, it will include new lines: .*. You can even leave it away as it isn't actually part of the String you want to match - using anything here will actually be a problem if you have multiple InsurerIds in your file and want to get them all.
.{4}InsurerId> This matches "InsurerId>" with any four characters in front of it and is exactly what you want. As the first character is probably always an opening angle bracket (and you don't want stuff like "<ExampleInsurerId>"), I'd suggest using <.{3}InsurerId> instead. This still could have some problems (<Test id="<" xInsurerId>), so if you know exactly that it's "_<a digit>:", why not use <_\d:InsurerId>?
\S* matches everything except for whitespaces - probably not the best idea as XML and similar files can be written to not contain any space at all. You want to have everything to the next tag, so use [^<]* - this matches everything except for an opening angle bracket. You also want to get this value later, so you have to use a capturing group: ([^<]*)
.{5}InsurerId> The same thing here: use <\/.{3}InsurerId> or <\/_\d:InsurerId> (forward slashes are actually characters interpreted by other RegEx implementations, so I suggest escaping them)
((.|\n)*) Again the same thing, just leave it away
The resulting Regular Expression would then be the following:
<_\d:InsurerId>([^<]*)<\/_\d:InsurerId>
And as you can see at https://regex101.com/r/mU6zZ3/1 - you have exactly one match, and it's even "F2021633_V1" :D
For Java, you have to escape the backslashes, so the resulting code would look like this:
Pattern pattern = Pattern.compile("<_\\d:InsurerId>([^<]*)<\\/_\\d:InsurerId>");

If you are using Java 7 and above, you can use naming groups to make the Regex a little bit more readable (also see the backreference group \k for close tag to match the openning tag):
Pattern pattern = Pattern.compile("(?:<(?<InsurancePrefix>.+)InsurerId>)(?<id>[A-Z0-9_]+)</\\k<InsurancePrefix>InsurerId>");
Matcher matcher = pattern.matcher("<_1:InsurerId>F2021633_V1</_1:InsurerId>");
if (matcher.matches()) {
System.out.println(matcher.group("id"));
}
Using back reference the matches() fails, for example, on this text
<_1:InsurerId>F2021633_V1</_2:InsurerId>
which is correct
Javadoc has a good explanation: https://docs.oracle.com/javase/8/docs/api/
Also you might consider using a different tool (XML parser) instead of Regex, as well, as other people have to support your code, and complex Regex is usually difficult to understand.

Related

Regex to find XML tag in multiline string

Here is a simple function I wrote to get the value from a tag.
public static String getTagAValue(String xmlAsString) {
Pattern pattern = Pattern.compile("<TagA>(.+)</TagA>");
Matcher matcher = pattern.matcher(xmlAsString);
if (matcher.find()) {
return matcher.group(1);
} else {
return null;
}
}
It is not finding a match and returning null.
XML Sample
<xml>
<sample>
<TagA>result</TagA>
</sample>
</xml>
Note, here I used 4 spaces for tabs, but the real string would contain tabs.

Don't use regular expressions to parse XML: it's the wrong tool for the job.
Classic answer here: RegEx match open tags except XHTML self-contained tags
The answer you have accepted gives wrong answers, for example:
It doesn't accept whitespace in places where whitespace is allowed, such as before ">"
It will match a commented-out element, or one that appears in a CDATA section
It does a greedy match, so it will find the LAST matching end tag, not the first one.
However hard you try, you will never get it 100% right.
And in case you care more about performance than correctness, it's also grossly inefficient because of the need for backtracking.
To do the job properly and professionally, use an XML parser.

You probably want to enable that the RegExp works on multi-line:
Pattern.compile("<TagA>(.+)</TagA>", Pattern.DOTALL);
Documentation explains the parameter Pattern.DOTALL:
Enables dotall mode. In dotall mode, the expression . matches any
character, including a line terminator. By default this expression
does not match line terminators.
Edit: While this works in this particular case, please everyone refer to the answert of Michael Kay if you want to solve such problems professionally, efficiently and right.

Finding whole word only in Java string search

I'm running into the problem of finding a searched pattern within a larger pattern in my Java program. For example, I'll try and find all for loops, but will stumble upon formula. Most of the suggestions I've found talk about using regular expression searches like
String regex = "\\b"+keyword+"\\b";
Pattern pattern = Pattern.compile(regex);
Matcher matcher = pattern.matcher(searchString);
or some variant of this. The issue I'm running into is that I'm crawling through code, not a book-like text where there are spaces on either side of every word. For example, this will miss for(, which I would like to find. Is there another clever way to find whole words only?
Edit: Thanks for the suggestions. How about cases in which there the keyword starts on the first entry of the string? For example,
class Vec {
public:
...
};
where I'm searching for class (or alternatively public). The patterns suggested by Thanga, Austin Lee, npinti, and Kai Iskratsch do not work in this case. Any ideas?

In your case, the issue is that the \b flag will look for punctuation marks, white spaces and the beginning or end of the string. An opening bracket does not fall within any of these categories, and is thus omitted.
The easiest way to fix this would be to replace "\\b"+keyword+"\\b" with "[\\b(]"+keyword+"[\\b)]".
In regex syntax, the square brackets denote a set of which the regex engine will attempt to match any character it contains.
As per this previous SO question, it would seem that \b and [\b] are not the same. Whilst \b represents a word boundary, [\b] represents a backspace character. To fix this, simply replace "\\b"+keyword+"\\b" with "(\b|\()"+keyword+"(\b|\))".

Regex should match 0 or more chars. The below code change will fix the issue
String regex = ".*("+keyword+").*";

You could modify your regex to search for multiple characters afterwords, for example
[^\w]+"for"+[^\w] using the Pattern class in Java.
For your reference:
https://docs.oracle.com/javase/7/docs/api/java/util/regex/Pattern.html

Basically you will have to adapt your regex to all the possible patterns it can find. But considering your actually dealing with code, you are better of building a parser/tokenizer for that language, or using one that already exists. Then all you have to do is run through the tokens to find the the ones you want.

get the last portion of the link using java regex

I have an arraylist links. All links having same format abc.([a-z]*)/\\d{4}/
List<String > links= new ArrayList<>();
links.add("abc.com/2012/aa");
links.add("abc.com/2014/dddd");
links.add("abc.in/2012/aa");
I need to get the last portion of every link. ie, the part after domain name. Domain name can be anything(.com, .in, .edu etc).
/2012/aa
/2014/dddd
/2012/aa
This is the output i want. How can i get this using regex?
Thanks

Some people, when confronted with a problem, think “I know, I'll use
regular expressions.” Now they have two problems.
(see here for background)
Why use regex ? Perhaps a simpler solution is to use String.split("/") , which gives you an array of substrings of the original string, split by /. See this question for more info.
Note that String.split() does in fact take a regex to determine the boundaries upon which to split. However you don't need a regex in this case and a simple character specification is sufficient.

Try with below regex and use regex grouping feature that is grouped based on parenthesis ().
\.[a-zA-Z]{2,3}(/.*)
Pattern description :
dot followed by two or three letters followed by forward slash then any characters
DEMO
Sample code:
Pattern pattern = Pattern.compile("\\.[a-zA-Z]{2,3}(/.*)");
Matcher matcher = pattern.matcher("abc.com/2012/aa");
if (matcher.find()) {
System.out.println(matcher.group(1));
}
output:
/2012/aa
Note:
You can make it more precise by using \\.[a-zA-Z]{2,3}(/\\d{4}/.*) if there are always 4 digits in the pattern.

String result = s.replaceAll("^[^/]*","");
s would be the string in your list.

Some people, when confronted with a problem, think “I know, I'll use regular expressions.” Now they have two problems.
Why not just use the URI class?
output = new URI(link).getPath()

Try this one and use the second capturing group
(.*?)(/.*)

Use foreach loop to iterate over list.
Use substring and indexOf('/').
FOR EXAMPLE
String s="abc.com/2014/dddd";
System.out.println(s.substring(s.indexOf('/')));
OUTPUT
/2014/dddd
Or you can go for split method.
System.out.println(s.split("/",2)[1]);//OUTPUT:2014/dddd --->you need to add /

Java: regex - how do i get the first quote text

As a beginner with regex i believe im about to ask something too simple but ill ask anyway hope it won't bother you helping me..
Lets say i have a text like "hello 'cool1' word! 'cool2'"
and i want to get the first quote's text (which is 'cool1' without the ')
what should be my pattern? and when using matcher, how do i guarantee it will remain the first quote and not the second?
(please suggest a solution only with regex.. )

Use this regular expression:
'([^']*)'
Use as follows: (ideone)
Pattern pattern = Pattern.compile("'([^']*)'");
Matcher matcher = pattern.matcher(s);
if (matcher.find()) {
System.out.println(matcher.group(1));
}
Or this if you know that there are no new-line characters in your quoted string:
'(.*?)'
when using matcher, how do i guarantee it will remain the first quote and not the second?
It will find the first quoted string first because it starts seaching from left to right. If you ask it for the next match it will give you the second quoted string.

If you want to find first quote's text without the ' you can/should use Lookahead and Lookbehind mechanism like
(?<=').*?(?=')
for example
System.out.println("hello 'cool1' word! 'cool2'".replaceFirst("(?<=').*?(?=')", "ABC"));
//out -> hello 'ABC' word! 'cool2'
more info

You could just split the string on quotes and get the second piece (which will be between the first and second quotes).
If you insist on regex, try this:
/^.*?'(.*?)'/
Make sure it's set to multiline, unless you know you'll never have newlines in your input. Then, get the subpattern from the result and that will be your string.
To support double quotes too:
/^.*?(['"])(.*?)\1/
Then get subpattern 2.

Is this Regex incorrect? No matches found

I'm trying to parse through a string formatted like this, except with more values:
Key1=value,Key2=value,Key3=value,Key4=value,Key5=value,Key6=value,Key7=value
The Regex
((Key1)=(.*)),((Key2)=(.*)),((Key3)=(.*)),((Key4)=(.*)),((Key5)=(.*)),((Key6)=(.*)),((Key7)=(.*))
In the actual string, there are about double the amount of key/values, but I'm keeping it short for brevity. I have them in parentheses so I can call them in groups. The keys I have stored as Constants, and they will always be the same. The problem is, it never finds a match which doesn't make sense (unless the Regex is wrong)

Judging by your comment above, it sounds like you're creating the Pattern and Matcher objects and associating the Matcher with the target string, but you aren't actually applying the regex. That's a very common mistake. Here's the full sequence:
String regex = "Key1=(.*),Key2=(.*)"; // etc.
Pattern p = Pattern.compile(regex);
Matcher m = p.matcher(targetString);
// Now you have to apply the regex:
if (m.find())
{
String value1 = m.group(1);
String value2 = m.group(2);
// etc.
}
Not only do you have to call find() or matches() (or lookingAt(), but nobody ever uses that one), you should always call it in an if or while statement--that is, you should make sure the regex actually worked before you call any methods like group() that require the Matcher to be in a "matched" state.
Also notice the absence of most of your parentheses. They weren't necessary, and leaving them out makes it easier to (1) read the regex and (2) keep track of the group numbers.

Looks like you'd do better to do:
String[] pairs = data.split(",");
Then parse the key/value pairs one at a time

Your regex is working for me...
If you are always getting an IllegalStateException, I would say that you are trying to do something like:
matcher.group(1);
without having invoked the find() method.
You need to call that method before any attempt to fetch a group (or you will be in an illegal state to call the group() method)
Give this a try:
String test = "Key1=value,Key2=value,Key3=value,Key4=value,Key5=value,Key6=value,Key7=value";
Pattern pattern = Pattern.compile("((Key1)=(.*)),((Key2)=(.*)),((Key3)=(.*)),((Key4)=(.*)),((Key5)=(.*)),((Key6)=(.*)),((Key7)=(.*))");
Matcher matcher = pattern.matcher(test);
matcher.find();
System.out.println(matcher.group(1));

It's not wrong per se, but it requires a lot of backtracking which might cause the regular expression engine to bail. I would try a split as suggested elsewhere, but if you really need to use a regular expression, try making it non-greedy.
((Key1)=(.*?)),((Key2)=(.*?)),((Key3)=(.*?)),((Key4)=(.*?)),((Key5)=(.*?)),((Key6)=(.*?)),((Key7)=(.*?))
To understand why it requires so much backtracking, understand that for
Key1=(.*),Key2=(.*)
applied to
Key1=x,Key2=y
Java's regular expression engine matches the first (.*) to x,Key2=y and then tries stripping characters off the right until it can get a match for the rest of the regular expression: ,Key2=(.*). It effectively ends up asking,
Does "" match ,Key2=(.*), no so try
Does "y" match ,Key2=(.*), no so try
Does "=y" match ,Key2=(.*), no so try
Does "2=y" match ,Key2=(.*), no so try
Does "y2=y" match ,Key2=(.*), no so try
Does "ey2=y" match ,Key2=(.*), no so try
Does "Key2=y" match ,Key2=(.*), no so try
Does ",Key2=y" match ,Key2=(.*), yes so the first .* is "x" and the second is "y".
EDIT:
In Java, the non-greedy qualifier changes things so that it starts off trying to match nothing and then building from there.
Does "x,Key2=(.*)" match ,Key2=(.*), no so try
Does ",Key2=(.*)" match ,Key2=(.*), yes.
So when you've got 7 keys it doesn't need to unmatch 6 of them which involves unmatching 5 which involves unmatching 4, .... It can do it's job in one forward pass over the input.

I'm not going to say that there's no regex that will work for this, but it's most likely more complicated to write (and more importantly, read, for the next person that has to deal with the code) than it's worth. The closest I'm able to get with a regex is if you append a terminal comma to the string you're matching, i.e, instead of:
"Key1=value1,Key2=value2"
you would append a comma so it's:
"Key1=value1,Key2=value2,"
Then, the regex that got me the closest is: "(?:(\\w+?)=(\\S+?),)?+"...but this doesn't quite work if the values have commas, though.
You can try to continue tweaking that regex from there, but the problem I found is that there's a conflict in the behavior between greedy and reluctant quantifiers. You'd have to specify a capturing group for the value that is greedy with respect to commas up to the last comma prior to an non-capturing group comprised of word characters followed by the equal sign (the next value)...and this last non-capturing group would have to be optional in case you're matching the last value in the sequence, and maybe itself reluctant. Complicated.
Instead, my advice is just to split the string on "=". You can get away with this because presumably the values aren't allowed to contain the equal sign character.
Now you'll have a bunch of substrings, each of which that is a bunch of characters that comprise a value, the last comma in the string, followed by a key. You can easily find the last comma in each substring using String.lastIndexOf(',').
Treat the first and last substrings specially (because the first one does not have a prepended value and the last one has no appended key) and you should be in business.

If you know you always have 7, the hack-of-least resistance is
^Key1=(.+),Key2=(.+),Key3=(.+),Key4=(.+),Key5=(.+),Key6=(.+),Key7=(.+)$
Try it out at http://www.fileformat.info/tool/regex.htm
I'm pretty sure that there is a better way to parse this thing down that goes through .find() rather than .matches() which I think I would recommend as it allows you to move down the string one key=value pair at a time. It moves you into the whole "greedy" evaluation discussion.

Some people, when confronted with a problem, think "I know, I'll use
regular expressions." Now they have two problems. - Jamie Zawinski
The simplest solution is the most robust.
final String data = "Key1=value,Key2=value,Key3=value,Key4=value,Key5=value,Key6=value,Key7=value";
final String[] pairs = data.split(",");
for (final String pair: pairs)
{
final String[] keyValue = pair.split("=");
final String key = keyValue[0];
final String value = keyValue[1];
}

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.