Cannot figure out regex issue - java

I'm trying to extract the text within the title elements and ignore everything else.
I've looked at these articles, but they don't seem to help :\
Regular expression to extract text between square brackets
String Pattern Matching In Java
Java Regex to get the text from HTML anchor (<a>...</a>) tags
The main problem is I am not able to understand what the responders are saying while trying to hack up my own code.
Here is what I've managed from reading the Java API in the Pattern article.
<title>(.*?)</title>
Here's my code to return the title.
String title = null;
Matcher match = Pattern.compile("[<title>](.*?)[</title>]").matcher(this.webPage);
try{
title = match.group();
}
catch(IllegalStateException e)
{
e.printStackTrace();
}
I am getting the IllegalStateException, which says this:
java.lang.IllegalStateException: No match found
at java.util.regex.Matcher.group(Matcher.java:485)
at java.util.regex.Matcher.group(Matcher.java:445)
at BrowserModal.getWebPageTitle(BrowserModal.java:21)
at BrowserTest.main(BrowserTest.java:7)
Line 21 would be "title = match.group();"

What are the pros and cons of the leading Java HTML parsers? lists a bunch of HTML parsers. Parse your HTML to a DOM, then use getElementsByClassName("title") to get the title elements, and grab the text content by looking at its children which should be text nodes.
title = match.group();
This is failing because group() returns the entire matched text. group(1) will return just the content of the first parenthetical group.
[<title>](.*?)[</title>]
The square brackets are just breaking it. [<title>] will match any single character that is an angle bracket or a letter in the word "title".
<title>(.*?)</title>
is better, but will only match a title that is on one line (since . does not, by default, match newlines, and will not match minor variations like
<title lang=en>Foo</title>
It will also fail to find the title correctly in HTML like
<html>
<head>
<!-- <title>Old commented out title</title> -->
<title>Spiffy new title</title>

Try this:-
String title = null;
String subjectString = "<title>TextWithinTags</title>";
Pattern titleFinder = Pattern.compile("<title[^>]*>(.*?)</title>", Pattern.DOTALL | Pattern.CASE_INSENSITIVE);
Matcher regexMatcher = titleFinder.matcher(subjectString);
while (regexMatcher.find()) {
title = regexMatcher.group(1);
}
Edit:- Regex explained:-
[^>]* :- Anything but > is acceptable there. This is used as we can have attributes in the tags.
(.*?) :- Dot represents any character other than newline character. *? represents repeat any number of times, but as few as possible.
For more details on regex, check this out.

This gets the title in just one line of java code:
String title = html.replaceAll("(?s).*<title>(.*)</title>.*", "$1");
This regex assumes the HTML is "simple", and with the "DOTALL" switch (?s) (which means dots also match new-line chars), it will work with multi-line input, and even multi-line titles.

Related

Regex <img > Tag parsing with src, width, height

You may react to this saying that HTML Parsing using regex is a totally bad idea, following this for example, and you are right.
But in my case, the following html node is created by our own server so we know that it will always look like this, and as the regex will be in a mobile android library, I don't want to use a library like Jsoup.
What I want to parse: <img src="myurl.jpg" width="12" height="32">
What should be parsed:
match a regular img tag, and group the src attribute value: <img[^>]+src\\s*=\\s*['\"]([^'\"]+)['\"][^>]*>
width and height attribute values: (width|height)\s*=\s*['"]([^'"]*)['"]*
So the first regex will have a #1 group with the img url, and the second regex will have two matches with subgroups of their values.
How can I merge both?
Desired output:
img url
width value
height value
To match any img tag with src, height and width attributes that can come in any order and that are in fact optional, you can use
"(<img\\b|(?!^)\\G)[^>]*?\\b(src|width|height)=([\"']?)([^>]*?)\\3"
See the regex demo and an IDEONE Java demo:
String s = "<img height=\"132\" src=\"NEW_myurl.jpg\" width=\"112\"><link src=\"/test/test.css\"/><img src=\"myurl.jpg\" width=\"12\" height=\"32\">";
Pattern pattern = Pattern.compile("(<img\\b|(?!^)\\G)[^>]*?\\b(src|width|height)=([\"']?)([^\"]*)\\3");
Matcher matcher = pattern.matcher(s);
while (matcher.find()){
if (!matcher.group(1).isEmpty()) { // We have a new IMG tag
System.out.println("\n--- NEW MATCH ---");
}
System.out.println(matcher.group(2) + ": " + matcher.group(4));
}
The regex details:
(<img\\b|(?!^)\\G) - the initial boundary matching the <img> tag start or the end of the previous successful match
[^>]*? - match any optional attributes we are not interested in (0+ characters other than > so as to stay inside the tag)
-\\b(src|width|height)= - a whole word src=, width= or height=
([\"']?) - a technical 3rd group to check the attribute value delimiter
([^>]*?) - Group 4 containing the attribute value (0+ characters other than a > as few as possible up to the first
\\3 - attribute value delimiter matched with the Group 3 (NOTE if a delimiter may be empty, add (?=\\s|/?>) at the end of the pattern)
The logic:
Match the start of img tag
Then, match everything that is inside, but only capture the attributes we need
Since we are going to have multiple matches, not groups, we need to find a boundary for each new img tag. This is done by checking if the first group is not empty (if (!matcher.group(1).isEmpty()))
All there remains to do is to add a list for keeping matches.
If you want to combine the both the things here is the answer.
<img\s+src="([^"]+)"\s+width="([^"]+)"\s+height="([^"]+)"
sample I tested
<img src="rakesh.jpg" width="25" height="45">
try this
You may want this :
"(?i)(src|width|height)=\"(.*?)\""
Update:
I misunderstood your question, you need something like :
"(?i)<img\\s+src=\"(.*?)\"\\s+width=\"(.*?)\"\\s+height=\"(.*?)\">"
Regex101 Demo
Update 2
The regex below will capture the img tag attributes in any order:
"(?i)(?><img\\s+)src=\"(.*?)\"|width=\"(.*?)\"|height=\"(.*?)\">"
Regex101 Demo v2

Java: match all strings that do not end in .htm?"

I'm parsing an HTML file in Java using regular expressions and I want to know how to match for all href="" elements that do not end in a .htm or .html, and, if it matches, capture the content between quotes into a group
These are the ones I've tried so far:
href\s*[=]\s*"(.+?)(?![.]htm[l]?)"
href\s*[=]\s*"(.*?)(?![.]htm[l]?)"
href\s*[=]\s*"(?![.]htm[l]?)"
I understand that with the first two, the entire string between quotes is being captured into the first group, including the .htm(l) if it is present.
Does anyone know how I can avoid this from happening?
You can just rearrange the expression, and move the negative look-ahead to before the capturing:
href\s*[=]\s*"(?!.+?[.]htm[l]?")(.+?)"
Here is a demo.
As a side answer, jsoup is a very good API when dealing with html.
Using jsoup:
Document doc = Jsoup.parse(html);
for(Element link : doc.select("a")) {
String linkHref = link.attr("href");
if(linkHref.endsWith(".htm") || linkHref.endsWith(".html")) {
// do something
}
}
Try this .*\.(?!(htm|html)$)
any character in any number .* followed by a dot . not followed by htm, htmt (?! ... )

Replacing link tag using regex

I'm using java. And i want to find then replace hyperlink and anchor text of tag <a> html. I knew i must use: replace() method but i'm pretty bad about regex.
An example:
anchor text 1
will be replaced by:
anchor text 2
Could you show my the regex for that purpose? Thanks a lot.
Don't use regex for this task. You should use some HTML parser like Jsoup:
String str = "<a href='http://example.com'>anchor text 1</a>";
Document doc = Jsoup.parse(str);
str = doc.select("a[href]").attr("href", "http://anotherweb.com").first().toString();
System.out.println(str);
You could perhaps use a replaceAll with the regex:
[^<]+
And replace with:
anchor text 2
[^\"]+ and [^<]+ are negated class and will match all characters except " and < respectively.

Help with a regex to parse through and grab contents of <p> tag in html

I have a site I am trying to grab data from, and the content is laid out like this:
<p uri="/someRandomURL.p1" class="">TestData TestData TestData</p>
<p uri="/someRandomURL.p2" class="">TestData1 TestData1 TestData1</p>
I am using Java to grab the webpage's content, and am trying to parse through it like this:
Pattern p = Pattern.compile(".*?p1' class=''>(.*?)<.*");
Matcher m = p.matcher(data);
//Print out regex groups to console
System.out.println(m.group(1)) ;
But then an exception is thrown saying there is no match found...
Is my regex right? What else could possibly be going on? I am getting the html ok, but apparently there is no match for my regex...
Thanks
If the text elements contain multiple text lines, then it wouldn't find a match, because the dot (.) doesn't match \n (by default).
Give this a try:
Pattern p = Pattern.compile(".*?p1' class=''>(.*?)<.*", Pattern.DOTALL);

Java Regex to get the text from HTML anchor (<a>...</a>) tags

I'm trying to get a text within a certain tag. So if I have:
<a href="http://something.com">Found<a/>
I want to be able to retrieve the Found text.
I'm trying to do it using regex. I am able to do it if the <a href="http://something.com> stays the same but it doesn't.
So far I have this:
Pattern titleFinder = Pattern.compile( ".*[a-zA-Z0-9 ]* ([a-zA-Z0-9 ]*)</a>.*" );
I think the last two parts - the ([a-zA-Z0-9 ]*)</a>.* - are ok but I don't know what to do for the first part.
As they said, don't use regex to parse HTML. If you are aware of the shortcomings, you might get away with it, though. Try
Pattern titleFinder = Pattern.compile("<a[^>]*>(.*?)</a>", Pattern.DOTALL | Pattern.CASE_INSENSITIVE);
Matcher regexMatcher = titleFinder.matcher(subjectString);
while (regexMatcher.find()) {
// matched text: regexMatcher.group(1)
}
will iterate over all matches in a string.
It won't handle nested <a> tags and ignores all the attributes inside the tag.
str.replaceAll("</?a>", "");
Here is online ideone demo
Here is similar topic : How to remove the tags only from a text ?

Categories