When copy and pasting content from a word document into a Vaadin7 RichTextArea (or any other Richtextfield), there are plenty of unwanted HTML tags and attributes. Since in a current project the attribute width does some funny business, I'd like to remove them with the following funtion
private String cleanUpHTMLcontent(String content) {
LOG.log(Level.INFO, "Cleaning up that rubbish now");
content = content.replaceAll("width=\"[0-9]*\"",""); // this works fine
content = content.replaceAll("width:[0-9]*[\\.|]*[0-9]*pt;",""); // not working
content = content.replaceAll(";width:[0-9]*[\\.|]*[0-9]*pt",""); // not working
content = content.replaceAll("width:[0-9]*[\\.|]*[0-9]*pt",""); // not working
return content;
}
The first line works fine to remove old html tags like width="500", the other lines are going into the style attribute and try to remove the properties like width:300.45pt; with different positions of the colon.
The code works fine on the test page http://www.regexplanet.com/advanced/java/index.html . I generated my regex strings here, specially for java, but it's still not working. Anyone any idea?
Here an example where it doesn't find the width property
td style="width:453.1pt;border:solid windowtext 1.0pt;
UPDATE
content = content.replaceAll("width:\\s*[.0-9]*pt;",""); // doesn't work
content = content.replaceAll(";width:\\s*[.0-9]*pt",""); // doesn't work
content = content.replaceAll("width:\\s*[.0-9]*pt",""); // works :-)
it appears, that I have to escape the semi-colon as well with a backslash? I will test that
To remove any number of digits with a dot you can use a negated character class [.\d]* or [.0-9]*:
"\\bwidth:\\s*[.0-9]*pt;"
See the regex demo
The \b is a word boundary (makes sure we only match width as a whole word).
Details:
\b - leading word boundary
width: - literal string width:
\s* - 0+ whitespace symbols
[.0-9]* - 0+ dots or digits
pt; - literal pt;
Related
Hi I’m trying to use a replaceAll in java, to delete some html content of image:
This is my input
String html = ' asd<i> qwe qwe<u>qweqwe</u></i><u>wqeqwesd.<img alt="vechile" src="urldirectionstring" style="float:left; height:190px; width:400px" /></u>';
So what I’m trying to do is replace all content of <img ...> and just return in replace this:
"Image Url: urldirectionstring";
So just replace the tag img, all the rest, let it, only touch this tag, and for now I have this, but its not enougth;
String replaceImg = html.replaceAll("<img[^>]*/>","Image Url: "+$srcImgdirection);
So, as you can see, I don’t have an idea how to get the urldirectionstring as variable in the replace.
----------- LAST EDIT -----------
I found this regex to get the urlstringdirection, but now I don’t how to replace it only and add the text:
String replaceImg = html.replaceAll("<img.*src="(.*)"[^>]*/?>","Image Url: "+$srcImgdirection);
You could use:
String replaceImg = html.replaceAll(".*<img.*src=\"(.*?)\".*", "Image Url: $1");
This replaces the entire string and the output would be only Image Url: urldirectionstring (note that $1 contains the string matched in the expression, but just the part inside the parenthesis - basically each pair of parenthesis create "groups" that can be referenced later; as the regex contains only one pair, that's the first group, so you can reference it with $1)
If you want to replace only the img tag and keep the other tags intact, you could use:
String replaceImg = html.replaceAll("<img.*src=\"(.*?)\"[^>]*/?>", "Image Url: $1");
In this case, the output will be:
asd<i> qwe qwe<u>qweqwe</u></i><u>wqeqwesd.Image Url: urldirectionstring</u>
You may react to this saying that HTML Parsing using regex is a totally bad idea, following this for example, and you are right.
But in my case, the following html node is created by our own server so we know that it will always look like this, and as the regex will be in a mobile android library, I don't want to use a library like Jsoup.
What I want to parse: <img src="myurl.jpg" width="12" height="32">
What should be parsed:
match a regular img tag, and group the src attribute value: <img[^>]+src\\s*=\\s*['\"]([^'\"]+)['\"][^>]*>
width and height attribute values: (width|height)\s*=\s*['"]([^'"]*)['"]*
So the first regex will have a #1 group with the img url, and the second regex will have two matches with subgroups of their values.
How can I merge both?
Desired output:
img url
width value
height value
To match any img tag with src, height and width attributes that can come in any order and that are in fact optional, you can use
"(<img\\b|(?!^)\\G)[^>]*?\\b(src|width|height)=([\"']?)([^>]*?)\\3"
See the regex demo and an IDEONE Java demo:
String s = "<img height=\"132\" src=\"NEW_myurl.jpg\" width=\"112\"><link src=\"/test/test.css\"/><img src=\"myurl.jpg\" width=\"12\" height=\"32\">";
Pattern pattern = Pattern.compile("(<img\\b|(?!^)\\G)[^>]*?\\b(src|width|height)=([\"']?)([^\"]*)\\3");
Matcher matcher = pattern.matcher(s);
while (matcher.find()){
if (!matcher.group(1).isEmpty()) { // We have a new IMG tag
System.out.println("\n--- NEW MATCH ---");
}
System.out.println(matcher.group(2) + ": " + matcher.group(4));
}
The regex details:
(<img\\b|(?!^)\\G) - the initial boundary matching the <img> tag start or the end of the previous successful match
[^>]*? - match any optional attributes we are not interested in (0+ characters other than > so as to stay inside the tag)
-\\b(src|width|height)= - a whole word src=, width= or height=
([\"']?) - a technical 3rd group to check the attribute value delimiter
([^>]*?) - Group 4 containing the attribute value (0+ characters other than a > as few as possible up to the first
\\3 - attribute value delimiter matched with the Group 3 (NOTE if a delimiter may be empty, add (?=\\s|/?>) at the end of the pattern)
The logic:
Match the start of img tag
Then, match everything that is inside, but only capture the attributes we need
Since we are going to have multiple matches, not groups, we need to find a boundary for each new img tag. This is done by checking if the first group is not empty (if (!matcher.group(1).isEmpty()))
All there remains to do is to add a list for keeping matches.
If you want to combine the both the things here is the answer.
<img\s+src="([^"]+)"\s+width="([^"]+)"\s+height="([^"]+)"
sample I tested
<img src="rakesh.jpg" width="25" height="45">
try this
You may want this :
"(?i)(src|width|height)=\"(.*?)\""
Update:
I misunderstood your question, you need something like :
"(?i)<img\\s+src=\"(.*?)\"\\s+width=\"(.*?)\"\\s+height=\"(.*?)\">"
Regex101 Demo
Update 2
The regex below will capture the img tag attributes in any order:
"(?i)(?><img\\s+)src=\"(.*?)\"|width=\"(.*?)\"|height=\"(.*?)\">"
Regex101 Demo v2
I'm parsing an HTML file in Java using regular expressions and I want to know how to match for all href="" elements that do not end in a .htm or .html, and, if it matches, capture the content between quotes into a group
These are the ones I've tried so far:
href\s*[=]\s*"(.+?)(?![.]htm[l]?)"
href\s*[=]\s*"(.*?)(?![.]htm[l]?)"
href\s*[=]\s*"(?![.]htm[l]?)"
I understand that with the first two, the entire string between quotes is being captured into the first group, including the .htm(l) if it is present.
Does anyone know how I can avoid this from happening?
You can just rearrange the expression, and move the negative look-ahead to before the capturing:
href\s*[=]\s*"(?!.+?[.]htm[l]?")(.+?)"
Here is a demo.
As a side answer, jsoup is a very good API when dealing with html.
Using jsoup:
Document doc = Jsoup.parse(html);
for(Element link : doc.select("a")) {
String linkHref = link.attr("href");
if(linkHref.endsWith(".htm") || linkHref.endsWith(".html")) {
// do something
}
}
Try this .*\.(?!(htm|html)$)
any character in any number .* followed by a dot . not followed by htm, htmt (?! ... )
I have some HTML (String) that I am putting through Jsoup just so I can add something to all href and src attributes, that works fine. However, I'm noticing that for some special HTML characters, Jsoup is converting them from say “ to the actual character “. I output the value before and after and I see that change.
Before:
THIS — IS A “TEST”. 5 > 4. trademark:
After:
THIS — IS A “TEST”. 5 > 4. trademark: ?
What the heck is going on? I was specifically converting those special characters to their HTML entities before any Jsoup stuff to avoid this. The quotes changed to the actual quote characters, the greater-than stayed the same, and the trademark changed into a question mark. Aaaaaaa.
FYI, my Jsoup code is doing:
Document document = Jsoup.parse(fileHtmlStr);
//some stuff
String modifiedFileHtmlStr = document.html();
Thanks for any help!
The code below will give similar to the input markup. It changes the escaping mode for specific characters and sets ASCII mode to escape the TM sign for systems which don't support Unicode.
The output:
<p>THIS — IS A “TEST”. 5 > 4. trademark: </p>
The code:
Document doc = Jsoup.parse("" +
"<p>THIS — IS A “TEST”. 5 > 4. trademark: </p>");
Document.OutputSettings settings = doc.outputSettings();
settings.prettyPrint(false);
settings.escapeMode(Entities.EscapeMode.extended);
settings.charset("ASCII");
String modifiedFileHtmlStr = doc.html();
System.out.println(modifiedFileHtmlStr);
I'm trying to extract the text within the title elements and ignore everything else.
I've looked at these articles, but they don't seem to help :\
Regular expression to extract text between square brackets
String Pattern Matching In Java
Java Regex to get the text from HTML anchor (<a>...</a>) tags
The main problem is I am not able to understand what the responders are saying while trying to hack up my own code.
Here is what I've managed from reading the Java API in the Pattern article.
<title>(.*?)</title>
Here's my code to return the title.
String title = null;
Matcher match = Pattern.compile("[<title>](.*?)[</title>]").matcher(this.webPage);
try{
title = match.group();
}
catch(IllegalStateException e)
{
e.printStackTrace();
}
I am getting the IllegalStateException, which says this:
java.lang.IllegalStateException: No match found
at java.util.regex.Matcher.group(Matcher.java:485)
at java.util.regex.Matcher.group(Matcher.java:445)
at BrowserModal.getWebPageTitle(BrowserModal.java:21)
at BrowserTest.main(BrowserTest.java:7)
Line 21 would be "title = match.group();"
What are the pros and cons of the leading Java HTML parsers? lists a bunch of HTML parsers. Parse your HTML to a DOM, then use getElementsByClassName("title") to get the title elements, and grab the text content by looking at its children which should be text nodes.
title = match.group();
This is failing because group() returns the entire matched text. group(1) will return just the content of the first parenthetical group.
[<title>](.*?)[</title>]
The square brackets are just breaking it. [<title>] will match any single character that is an angle bracket or a letter in the word "title".
<title>(.*?)</title>
is better, but will only match a title that is on one line (since . does not, by default, match newlines, and will not match minor variations like
<title lang=en>Foo</title>
It will also fail to find the title correctly in HTML like
<html>
<head>
<!-- <title>Old commented out title</title> -->
<title>Spiffy new title</title>
Try this:-
String title = null;
String subjectString = "<title>TextWithinTags</title>";
Pattern titleFinder = Pattern.compile("<title[^>]*>(.*?)</title>", Pattern.DOTALL | Pattern.CASE_INSENSITIVE);
Matcher regexMatcher = titleFinder.matcher(subjectString);
while (regexMatcher.find()) {
title = regexMatcher.group(1);
}
Edit:- Regex explained:-
[^>]* :- Anything but > is acceptable there. This is used as we can have attributes in the tags.
(.*?) :- Dot represents any character other than newline character. *? represents repeat any number of times, but as few as possible.
For more details on regex, check this out.
This gets the title in just one line of java code:
String title = html.replaceAll("(?s).*<title>(.*)</title>.*", "$1");
This regex assumes the HTML is "simple", and with the "DOTALL" switch (?s) (which means dots also match new-line chars), it will work with multi-line input, and even multi-line titles.