Java: match all strings that do not end in .htm?" - java

I'm parsing an HTML file in Java using regular expressions and I want to know how to match for all href="" elements that do not end in a .htm or .html, and, if it matches, capture the content between quotes into a group
These are the ones I've tried so far:
href\s*[=]\s*"(.+?)(?![.]htm[l]?)"
href\s*[=]\s*"(.*?)(?![.]htm[l]?)"
href\s*[=]\s*"(?![.]htm[l]?)"
I understand that with the first two, the entire string between quotes is being captured into the first group, including the .htm(l) if it is present.
Does anyone know how I can avoid this from happening?

You can just rearrange the expression, and move the negative look-ahead to before the capturing:
href\s*[=]\s*"(?!.+?[.]htm[l]?")(.+?)"
Here is a demo.

As a side answer, jsoup is a very good API when dealing with html.
Using jsoup:
Document doc = Jsoup.parse(html);
for(Element link : doc.select("a")) {
String linkHref = link.attr("href");
if(linkHref.endsWith(".htm") || linkHref.endsWith(".html")) {
// do something
}
}

Try this .*\.(?!(htm|html)$)
any character in any number .* followed by a dot . not followed by htm, htmt (?! ... )

Related

Regex for find accent letters that aren't inside html/jsp comments

Hello I need to find all the accent words that aren't inside comments in jsp files.
By example.
<%--This jsp comment have accents áóéí--%>
<html>
<!--This html comment have accents áóéí-->
<h1>This text have accents áóí</h1>
<html>
I need to find the accent letter inside the h1 tag but no the ones inside the comments.
Until now I had the regex to find the comments but I don't know how to negate that part.
This is the Regex I had:
\<[!%][ \r\n\t]*(--([^\-]|[\r\n]|-[^\-])*--[ \r\n\t]*)\%*>
I try
[ó](?!(\<[!%][ \r\n\t]*(--([^\-]|[\r\n]|-[^\-])*--[ \r\n\t]*)\%*>))
But it didn't works.
How I could negate that expresion?
It is not feasible to match the inner text of each and every HTML tag with a regex.
I suggest using a Java HTML parser instead. jsoup is a good one. See jsoup cookbook for more examples.
String html = "<p>An <a href='http://example.com/'><b>example</b></a> link.</p>";
Document doc = Jsoup.parse(html);
Element link = doc.select("a").first();
String text = doc.body().text(); // "An example link"
String linkHref = link.attr("href"); // "http://example.com/"
String linkText = link.text(); // "example""
If you need to simply delete them, use Notepad++ regex Find-and-Replace (CHECK the box for ". matches newline"):
Find what:
(--%?>(?:(?!<%--|<!--).)*?)[^-!~##$%^&*()+=.,<>|?/{}\[\]\\""';:\w\s]+
Replace with:
$1
Repeat that Find-and-Replace until it can't find any more matches.
Otherwise, you can just use that regex to find them and deal with them individually.

Unable to parse string with a dot in Java with REGEX

When copy and pasting content from a word document into a Vaadin7 RichTextArea (or any other Richtextfield), there are plenty of unwanted HTML tags and attributes. Since in a current project the attribute width does some funny business, I'd like to remove them with the following funtion
private String cleanUpHTMLcontent(String content) {
LOG.log(Level.INFO, "Cleaning up that rubbish now");
content = content.replaceAll("width=\"[0-9]*\"",""); // this works fine
content = content.replaceAll("width:[0-9]*[\\.|]*[0-9]*pt;",""); // not working
content = content.replaceAll(";width:[0-9]*[\\.|]*[0-9]*pt",""); // not working
content = content.replaceAll("width:[0-9]*[\\.|]*[0-9]*pt",""); // not working
return content;
}
The first line works fine to remove old html tags like width="500", the other lines are going into the style attribute and try to remove the properties like width:300.45pt; with different positions of the colon.
The code works fine on the test page http://www.regexplanet.com/advanced/java/index.html . I generated my regex strings here, specially for java, but it's still not working. Anyone any idea?
Here an example where it doesn't find the width property
td style="width:453.1pt;border:solid windowtext 1.0pt;
UPDATE
content = content.replaceAll("width:\\s*[.0-9]*pt;",""); // doesn't work
content = content.replaceAll(";width:\\s*[.0-9]*pt",""); // doesn't work
content = content.replaceAll("width:\\s*[.0-9]*pt",""); // works :-)
it appears, that I have to escape the semi-colon as well with a backslash? I will test that
To remove any number of digits with a dot you can use a negated character class [.\d]* or [.0-9]*:
"\\bwidth:\\s*[.0-9]*pt;"
See the regex demo
The \b is a word boundary (makes sure we only match width as a whole word).
Details:
\b - leading word boundary
width: - literal string width:
\s* - 0+ whitespace symbols
[.0-9]* - 0+ dots or digits
pt; - literal pt;

Replacing link tag using regex

I'm using java. And i want to find then replace hyperlink and anchor text of tag <a> html. I knew i must use: replace() method but i'm pretty bad about regex.
An example:
anchor text 1
will be replaced by:
anchor text 2
Could you show my the regex for that purpose? Thanks a lot.
Don't use regex for this task. You should use some HTML parser like Jsoup:
String str = "<a href='http://example.com'>anchor text 1</a>";
Document doc = Jsoup.parse(str);
str = doc.select("a[href]").attr("href", "http://anotherweb.com").first().toString();
System.out.println(str);
You could perhaps use a replaceAll with the regex:
[^<]+
And replace with:
anchor text 2
[^\"]+ and [^<]+ are negated class and will match all characters except " and < respectively.

Cannot figure out regex issue

I'm trying to extract the text within the title elements and ignore everything else.
I've looked at these articles, but they don't seem to help :\
Regular expression to extract text between square brackets
String Pattern Matching In Java
Java Regex to get the text from HTML anchor (<a>...</a>) tags
The main problem is I am not able to understand what the responders are saying while trying to hack up my own code.
Here is what I've managed from reading the Java API in the Pattern article.
<title>(.*?)</title>
Here's my code to return the title.
String title = null;
Matcher match = Pattern.compile("[<title>](.*?)[</title>]").matcher(this.webPage);
try{
title = match.group();
}
catch(IllegalStateException e)
{
e.printStackTrace();
}
I am getting the IllegalStateException, which says this:
java.lang.IllegalStateException: No match found
at java.util.regex.Matcher.group(Matcher.java:485)
at java.util.regex.Matcher.group(Matcher.java:445)
at BrowserModal.getWebPageTitle(BrowserModal.java:21)
at BrowserTest.main(BrowserTest.java:7)
Line 21 would be "title = match.group();"
What are the pros and cons of the leading Java HTML parsers? lists a bunch of HTML parsers. Parse your HTML to a DOM, then use getElementsByClassName("title") to get the title elements, and grab the text content by looking at its children which should be text nodes.
title = match.group();
This is failing because group() returns the entire matched text. group(1) will return just the content of the first parenthetical group.
[<title>](.*?)[</title>]
The square brackets are just breaking it. [<title>] will match any single character that is an angle bracket or a letter in the word "title".
<title>(.*?)</title>
is better, but will only match a title that is on one line (since . does not, by default, match newlines, and will not match minor variations like
<title lang=en>Foo</title>
It will also fail to find the title correctly in HTML like
<html>
<head>
<!-- <title>Old commented out title</title> -->
<title>Spiffy new title</title>
Try this:-
String title = null;
String subjectString = "<title>TextWithinTags</title>";
Pattern titleFinder = Pattern.compile("<title[^>]*>(.*?)</title>", Pattern.DOTALL | Pattern.CASE_INSENSITIVE);
Matcher regexMatcher = titleFinder.matcher(subjectString);
while (regexMatcher.find()) {
title = regexMatcher.group(1);
}
Edit:- Regex explained:-
[^>]* :- Anything but > is acceptable there. This is used as we can have attributes in the tags.
(.*?) :- Dot represents any character other than newline character. *? represents repeat any number of times, but as few as possible.
For more details on regex, check this out.
This gets the title in just one line of java code:
String title = html.replaceAll("(?s).*<title>(.*)</title>.*", "$1");
This regex assumes the HTML is "simple", and with the "DOTALL" switch (?s) (which means dots also match new-line chars), it will work with multi-line input, and even multi-line titles.

How to use regular expressions to parse HTML in Java?

Please can someone tell me a simple way to find href and src tags in an html file using regular expressions in Java?
And then, how do I get the URL associated with the tag?
Thanks for any suggestion.
Using regular expressions to pull values from HTML is always a mistake. HTML syntax is a lot more complex that it may first appear and it's very easy for a page to catch out even a very complex regular expression.
Use an HTML Parser instead. See also What are the pros and cons of the leading Java HTML parsers?
The other answers are true. Java Regex API is not a proper tool to achieve your goal. Use efficient, secure and well tested high-level tools mentioned in the other answers.
If your question concerns rather Regex API than a real-life problem (learning purposes for example) - you can do it with the following code:
String html = "foo <a href='link1'>bar</a> baz <a href='link2'>qux</a> foo";
Pattern p = Pattern.compile("<a href='(.*?)'>");
Matcher m = p.matcher(html);
while(m.find()) {
System.out.println(m.group(0));
System.out.println(m.group(1));
}
And the output is:
<a href='link1'>
link1
<a href='link2'>
link2
Please note that lazy/reluctant qualifier *? must be used in order to reduce the grouping to the single tag. Group 0 is the entire match, group 1 is the next group match (next pair of parenthesis).
Dont use regular expressions use NekoHTML or TagSoup which are a bridge providing a SAX or DOM as in XML approach to visiting a HTML document.
If you want to go down the html parsing route, which Dave and I recommend here's the code to parse a String Data for anchor tags and print their href.
since your just using anchor tags you should be ok with just regex but if you want to do more go with a parser. The Mozilla HTML Parser is the best out there.
File parserLibraryFile = new File("lib/MozillaHtmlParser/native/bin/MozillaParser" + EnviromentController.getSharedLibraryExtension());
String parserLibrary = parserLibraryFile.getAbsolutePath();
// mozilla.dist.bin directory :
final File mozillaDistBinDirectory = new File("lib/MozillaHtmlParser/mozilla.dist.bin."+ EnviromentController.getOperatingSystemName());
MozillaParser.init(parserLibrary,mozillaDistBinDirectory.getAbsolutePath());
MozillaParser parser = new MozillaParser();
Document domDocument = parser.parse(data);
NodeList list = domDocument.getElementsByTagName("a");
for (int i = 0; i < list.getLength(); i++) {
Node n = list.item(i);
NamedNodeMap m = n.getAttributes();
if (m != null) {
Node attrNode = m.getNamedItem("href");
if (attrNode != null)
System.out.println(attrNode.getNodeValue());
I searched the Regular Expression Library (http://regexlib.com/Search.aspx?k=href and http://regexlib.com/Search.aspx?k=src)
The best I found was
((?<html>(href|src)\s*=\s*")|(?<css>url\())(?<url>.*?)(?(html)"|\))
Check out these links for more expressions:
http://regexlib.com/REDetails.aspx?regexp_id=2261
http://regexlib.com/REDetails.aspx?regexp_id=758
http://regexlib.com/REDetails.aspx?regexp_id=774
http://regexlib.com/REDetails.aspx?regexp_id=1437
Regular expressions can only parse regular languages, that's why they are called regular expressions. HTML is not a regular language, ergo it cannot be parsed by regular expressions.
HTML parsers, on the other hand, can parse HTML, that's why they are called HTML parsers.
You should use you favorite HTML parser instead.
Contrary to popular opinion, regular expressions are useful tools to extract data from unstructured text (which HTML is).
If you are doing complex HTML data extraction (say, find all paragraphs in a page) then HTML parsing is probably the way to go. But if you just need to get some URLs from HREFs, then a regular expression would work fine and it will be very hard to break it.
Try something like this:
/<a[^>]+href=["']?([^'"> ]+)["']?[^>]*>/i

Categories