Regex in java question, multiple matches - java

I am trying to match multiple CSS style code blocks in a HTML document. This code will match the first one but won't match the second. What code would I need to match the second. Can I just get a list of the groups that are inside of my 'style' brackets? Should I call the 'find' method to get the next match?
Here is my regex pattern
^.*(<style type="text/css">)(.*)(</style>).*$
Usage:
final Pattern pattern_css = Pattern.compile(css_pattern_buf.toString(),
Pattern.CASE_INSENSITIVE | Pattern.MULTILINE | Pattern.DOTALL);
final Matcher match_css = pattern_css.matcher(text);
if (match_css.matches() && (match_css.groupCount() >= 3)) {
System.out.println("Woot ==>" + match_css.groupCount());
System.out.println(match_css.group(2));
} else {
System.out.println("No Match");
}

I am trying to match multiple CSS style code blocks in a HTML document.
Standard Answer: don't use regex to parse HTML. regex cannot parse HTML reliably, no matter how complicated and clever you make your expression. Unless you are absolutely sure the exact format of the target document is totally fixed, string or regex processing is insufficient and you must use an HTML parser.
(<style type="text/css">)(.*)(</style>)
That's a greedy expression. The (.*) in the middle will match as much as it possibly can. If you have two style blocks:
<style type="text/css">1</style> <style type="text/css">2</style>
then it will happily match '1</style> <style type="text/css">2'.
Use (.*?) to get a non-greedy expression, which will allow the trailing (</style>) to match at the first opportunity.
Should I call the 'find' method to get the next match?
Yes, and you should have used it to get the first match too. The usual idiom is:
while (matcher.find()) {
s= matcher.group(n);
}
Note that standard string processing (indexOf, etc) may be a simpler approach for you than regex, since you're only using completely fixed strings. However, the Standard Answer still applies.

You can simplify the regex as follows:
(<style type="text/css">)(.*?)(</style>)
And if you don’t need the groups 1 and 3 (probably not), I would drop the parentheses, remaining only:
<style type="text/css">(.*?)</style>

Related

Regex to find XML tag in multiline string

Here is a simple function I wrote to get the value from a tag.
public static String getTagAValue(String xmlAsString) {
Pattern pattern = Pattern.compile("<TagA>(.+)</TagA>");
Matcher matcher = pattern.matcher(xmlAsString);
if (matcher.find()) {
return matcher.group(1);
} else {
return null;
}
}
It is not finding a match and returning null.
XML Sample
<xml>
<sample>
<TagA>result</TagA>
</sample>
</xml>
Note, here I used 4 spaces for tabs, but the real string would contain tabs.
Don't use regular expressions to parse XML: it's the wrong tool for the job.
Classic answer here: RegEx match open tags except XHTML self-contained tags
The answer you have accepted gives wrong answers, for example:
It doesn't accept whitespace in places where whitespace is allowed, such as before ">"
It will match a commented-out element, or one that appears in a CDATA section
It does a greedy match, so it will find the LAST matching end tag, not the first one.
However hard you try, you will never get it 100% right.
And in case you care more about performance than correctness, it's also grossly inefficient because of the need for backtracking.
To do the job properly and professionally, use an XML parser.
You probably want to enable that the RegExp works on multi-line:
Pattern.compile("<TagA>(.+)</TagA>", Pattern.DOTALL);
Documentation explains the parameter Pattern.DOTALL:
Enables dotall mode. In dotall mode, the expression . matches any
character, including a line terminator. By default this expression
does not match line terminators.
Edit: While this works in this particular case, please everyone refer to the answert of Michael Kay if you want to solve such problems professionally, efficiently and right.

Regex extract string in java

I'm trying to extract a string from a String in Regex Java
Pattern pattern = Pattern.compile("((.|\\n)*).{4}InsurerId>\\S*.{5}InsurerId>((.|\\n)*)");
Matcher matcher = pattern.matcher(abc);
I'm trying to extract the value between
<_1:InsurerId>F2021633_V1</_1:InsurerId>
I'm not sure where am I going wrong but I don't get output for
if (matcher.find())
{
System.out.println(matcher.group(1));
}
You can use:
Pattern pattern = Pattern.compile("<([^:]+:InsurerId)>([^<]*)</\\1>");
Matcher matcher = pattern.matcher(abc);
if (matcher.find()) {
System.out.println(matcher.group(2));
}
RegEx Demo
You may want to use the totally awesome page http://regex101.com/ to test your regular expressions. As you can see at https://regex101.com/r/rV8uM3/1, you only have empty capturing groups, but let me explain to you what you did. :D
((.|\n)*) This matches any character, or a new line, unimportant how often. It is capturing, so your first matching group will always be everything before <_1:InsurerId>, or an empty string. You can match any character instead, it will include new lines: .*. You can even leave it away as it isn't actually part of the String you want to match - using anything here will actually be a problem if you have multiple InsurerIds in your file and want to get them all.
.{4}InsurerId> This matches "InsurerId>" with any four characters in front of it and is exactly what you want. As the first character is probably always an opening angle bracket (and you don't want stuff like "<ExampleInsurerId>"), I'd suggest using <.{3}InsurerId> instead. This still could have some problems (<Test id="<" xInsurerId>), so if you know exactly that it's "_<a digit>:", why not use <_\d:InsurerId>?
\S* matches everything except for whitespaces - probably not the best idea as XML and similar files can be written to not contain any space at all. You want to have everything to the next tag, so use [^<]* - this matches everything except for an opening angle bracket. You also want to get this value later, so you have to use a capturing group: ([^<]*)
.{5}InsurerId> The same thing here: use <\/.{3}InsurerId> or <\/_\d:InsurerId> (forward slashes are actually characters interpreted by other RegEx implementations, so I suggest escaping them)
((.|\n)*) Again the same thing, just leave it away
The resulting Regular Expression would then be the following:
<_\d:InsurerId>([^<]*)<\/_\d:InsurerId>
And as you can see at https://regex101.com/r/mU6zZ3/1 - you have exactly one match, and it's even "F2021633_V1" :D
For Java, you have to escape the backslashes, so the resulting code would look like this:
Pattern pattern = Pattern.compile("<_\\d:InsurerId>([^<]*)<\\/_\\d:InsurerId>");
If you are using Java 7 and above, you can use naming groups to make the Regex a little bit more readable (also see the backreference group \k for close tag to match the openning tag):
Pattern pattern = Pattern.compile("(?:<(?<InsurancePrefix>.+)InsurerId>)(?<id>[A-Z0-9_]+)</\\k<InsurancePrefix>InsurerId>");
Matcher matcher = pattern.matcher("<_1:InsurerId>F2021633_V1</_1:InsurerId>");
if (matcher.matches()) {
System.out.println(matcher.group("id"));
}
Using back reference the matches() fails, for example, on this text
<_1:InsurerId>F2021633_V1</_2:InsurerId>
which is correct
Javadoc has a good explanation: https://docs.oracle.com/javase/8/docs/api/
Also you might consider using a different tool (XML parser) instead of Regex, as well, as other people have to support your code, and complex Regex is usually difficult to understand.

java regexp parse partial title tag

Ok, quick question. I'm a bit of a newbie at Java, and I have an assignment in which I have to get the name of a person from the title tag of a page. I know my regex, but I can't (or don't know how) to escape some characters.
Example
<title>Mr. Somebody | Department in which he's in</title>
So, basically I need a regexp that would get me the "Mr. Somebody". I've tried :
Pattern pat = Pattern.compile("<title>(.+?)|");
Matcher mat = pat.matcher(data);
boolean found = false;
while (!found && mat.find()) {
name = mat.group(0);
found = true;
}
System.out.println("Found a name : " + name);
My problem is, that no matter what I've tried, the most I could get was the first character. Do you think that a more simpler approach with indexOf and substrings would be better, or is a regexp still viable?
I know that usually regexps are not suitable for parsing html tags, but I'm considering this search more of a string search, because I'm not interested in the whole tag (or other tags that might be contained within).
Any kind of help is greatly appreciated :)
You need to escape the pipe because it's a character with a special meaning in regex. Try:
<title>(.+?)\\|
| means "or" which means that the regex will try to match with either <title>(.+?) or nothing (there's nothing after the |.
When it tries to match with <title>(.+?), it will get only the first character because .+? is lazy (it matches as little as possible).
Alternatively, you can use a negated class:
<title>([^\\|]+)
[^\\|]+ will match any character except a pipe.
It should work
Pattern pat = Pattern.compile("<title>(.*?)\\|");
and use
mat.group(1) instead of mat.group(o);
Here's a way to do it that will avoid using Pattern and Matcher, if you want:
String name = "<title>Mr. Somebody | Department in which he's in</title>";
name = name.substring(7).replaceAll("\\|.*", "");
The substring(7) will remove the first tag, then replaceAll will remove everything from the pipe character onwards (replace with empty string).
Maybe this it what you want:
(?<=<title>)(.+?(?=[|].+?))(?=.+?</title>)
It returns Mr. Somebody. You can test it here for example.
Here is a way :
<\s*title[^>]*>\s*([^\|]+)
Takes away leading white space.
Handles any possible weird attributes that someone may add to a title tag, i.e. <title data-cookies="I hide cookies here :P">I like titles</title>
Handles any whitespace added before title, i.e. < title > is still valid.

RegEx - match the whole <a> tag in java

I'm trying to match this <a href="**something**"> using regex in java using this code:
Pattern regex = Pattern.compile("<([a-z]+) *[^/]*?>");
Matcher matcher = regex.matcher(string);
string= matcher.replaceAll("");
I'm not really familiar with regex. What am I doing wrong? Thanks
If you just want to find the start tag you could use:
"<a(?=[>\\s])[^>]*>"
If you are trying to get the href attribute it would be better to use:
"<a\\s+[^>]*href=(['\"])(.*?)\\1[^>]*>"
This would capture the link into capturing group 2.
To give you an idea of why people always say "don't try to parse HTML with a regular expression", here'e a simplified regex for matching an <a> tag:
<\s*a(?:\s+[a-z]+(?:\s*=\s*(?:[a-z0-9]+|"[^"]*"|'[^']*'))?)*\s*>
It actually is possible to match a tag with a regular expression. It just isn't as easy as most people expect.
All of HTML, on the other hand, is not "regular" and so you can't do it with a regular expression. (The "regex" support in many/most languages is actually more powerful than "regular", but few are powerful enough to deal with balanced structures like those in HTML.)
Here's a breakdown of what the above expression does:
<\s* < and possibly some spaces
a "a"
(?: 0 or more...
\s+ some spaces
[a-z]+ attribute name (simplified)
(?: and maybe...
\s*=\s* an equal sign, possibly with surrounding spaces
(?: and one of:
[a-z0-9]+ - a simple attribute value (simplified)
|"[^"]*" - a double-quoted attr value
|'[^']*' - a single quoted atttr value
)
)?
)*
\s*> possibly more spaces and then >
(The comments at the start of each group also talk about the operator at
the end of the group, or even in the group.)
There are possibly other simplifications here -- I wrote this from
memory, not from the spec. Even if you follow the spec to the letter, browsers are even more fault tolerant and will accept all sorts of invalid input.
you can just match against:
"<a[^>]*>"
If the * is "greedy" in java (what I think it is, this is correct)
But you cannot match < a whatever="foo" > with that, because of the whitespaces.
Although the following is better, but more complicated to understand:
"<\\s*a\\s+[^>]*>"
(The double \\ is needed because \ is a special char in a java strings)
This handles optional whitespaces before a and at minimum one whitespace after a.
So you don't match <abcdef> which is not a correct a tag.
(I assume your a tag stands isolated in one line and you are not working with multiline mode enabled. Else it gets far far more complicated.)
your last *[^/]*?> seems a little bit strange, maybe it doesn't work cause of that.
Ok lets check what you are doing:
<([a-z]+) *[^/]*?>
<([a-z]+)
match something that contains an <followed by a [a-z] at least one time. This is grouped by the brackets.
Now you use a * which means the defined group ([a-z])* may appear multiple time, or not.
[^/]*
This means now match everything, but a / or nothing (because of the *)
The question mark is just wrong, not sure how this is interpreted.
Last char > matched as last element, which must appear.
To sum up, your expression is just wrong and cannot work :)
Take a look at: http://www.regular-expressions.info/
This is a good starting point.

Using Condition in Regular Expressions

Source:
<TD>
<IMG SRC="/images/home.gif">
<IMG SRC="/images/spacer.gif">
<IMG SRC="/images/search.gif">
<IMG SRC="/images/spacer.gif">
<IMG SRC="/images/help.gif">
</TD>
Regex:
(<[Aa]\s+[^>]+>\s*)?<[Ii][Mm][Gg]\s+[^>]+>(?(1)\s*</[Aa]>)
Result:
<IMG SRC="/images/home.gif">
<IMG SRC="/images/spacer.gif">
<IMG SRC="/images/search.gif">
<IMG SRC="/images/spacer.gif">
<IMG SRC="/images/help.gif">
what's the "?(1)" mean?
When I run it in Java ,it cause a exception: java.util.regex.PatternSyntaxException,the
"?(1)" can't be recognized.
The explanation in the book is :
This pattern requires explanation. (<[Aa]\s+[^>]+>\s*)? matches an opening <A> or <a> tag (with any attributes that may be present), if present (the closing ? makes the expression optional). <[Ii][Mm][Gg]\s+[^>]+> then matches the <IMG> tag (regardless of case) with any of its attributes. (?(1)\s*</[Aa]>) starts off with a condition: ?(1) means execute only what comes next if backreference 1 (the opening <A> tag) exists (or in other words, execute only what comes next if the first <A> match was successful). If (1) exists, then \s*</[Aa]> matches any trailing whitespace followed by the closing </A> tag.
The syntax is correct. The strange looking (?....) sets up a conditional. This is the regular expression syntax for an if...then statement. The (1) is a back-reference to the capture group at the beginning of the regex, which matches an html <a> tag, if there is one since that capture group is optional. Since the back-reference to the captured tag follows the "if" part of the regex, what it is doing is making sure that there was an opening <a> tag captured before trying to match the closing one. A pretty clever way of making both tags optional, but forcing both when the first one exists. That's how it's able to match all the lines in the sample text even though some of them just have <img> tags.
As to why it throws an exception in your case, most likely the flavor of regex you're using doesn't support conditionals. Not all do.
EDIT: Here's a good reference on conditionals in regular expressions: http://www.regular-expressions.info/conditional.html
What you're looking at is a conditional construct, as Bryan said, and Java doesn't support them. The parenthesized expression immediately after the question mark can actually be any zero-width assertion, like a lookahead or lookbehind, and not just a reference to a capture group. (I prefer to call those back-assertions, to avoid confusion. A back-reference matches the same thing the capture group did, but a back-assertion just asserts that the capture group matched something.)
I learned about conditionals when I was working in Perl years ago, but I've never missed them in Java. In this case, for example, a simple alternation will do the trick:
(?i)<a\s+[^>]+>\s*<img\s+[^>]+>\s*</a]>|<img\s+[^>]+>
One advantage of the conditional version is that you can capture the IMG tag with a single capture group:
(?i)(<a\s+[^>]+>\s*)?(<img\s+[^>]+>)(?(1)\s*</a>)
In the alternation version you have to have a capturing group for each alternative, but that's not as important in Java as it is in Perl, with all its built-in regex magic. Here's how I would pluck the IMG tags in Java:
Pattern p = Pattern.compile(
"<a\\s+[^>]+>\\s*(<img\\s+[^>]+>)\\s*</a>|(<img\\s+[^>]+>)"
Pattern.CASE_INSENSITIVE);
Matcher m = p.matcher(s);
while (m.find())
{
System.out.println(m.start(1) != -1 ? m.group(1) : m.group(2));
}
Could it be a non capturing group as described here:
There is also a special group, group
0, which always represents the entire
expression. This group is not included
in the total reported by groupCount.
Groups beginning with (? are pure,
non-capturing groups that do not
capture text and do not count towards
the group total. (You'll see examples
of non-capturing groups later in the
section Methods of the Pattern Class.)
Java Regex Tutorial
The short answer: it doesn't mean anything. The problem lies in this whole snippet:
(?(1)\s*)
() creates a back reference, so you can reuse any text matched inside. They also allow you to apply operators to everything inside of them (but this isn't done in your example).
? means that the item before it should be matched if it's there but it is also OK if it's not. This simply doesn't make sense when it appears after (
(?:MoreTextHere)
Can be used to speed up RegExs when you don't need to reuse the matched text. But that still doesn't really make sense, why match a 1 when your input is HTML?
Try:
(?:<[Aa]\s+[^>]+>\s*)?<[Ii][Mm][Gg]\s+[^>]+>
You never said exactly what you were trying to match so if this answer doesn't satisfy you, please explain what you're trying to do with RegEx.

Categories