Replace Regex Variable - java

Hi I’m trying to use a replaceAll in java, to delete some html content of image:
This is my input
String html = ' asd<i> qwe qwe<u>qweqwe</u></i><u>wqeqwesd.<img alt="vechile" src="urldirectionstring" style="float:left; height:190px; width:400px" /></u>';
So what I’m trying to do is replace all content of <img ...> and just return in replace this:
"Image Url: urldirectionstring";
So just replace the tag img, all the rest, let it, only touch this tag, and for now I have this, but its not enougth;
String replaceImg = html.replaceAll("<img[^>]*/>","Image Url: "+$srcImgdirection);
So, as you can see, I don’t have an idea how to get the urldirectionstring as variable in the replace.
----------- LAST EDIT -----------
I found this regex to get the urlstringdirection, but now I don’t how to replace it only and add the text:
String replaceImg = html.replaceAll("<img.*src="(.*)"[^>]*/?>","Image Url: "+$srcImgdirection);

You could use:
String replaceImg = html.replaceAll(".*<img.*src=\"(.*?)\".*", "Image Url: $1");
This replaces the entire string and the output would be only Image Url: urldirectionstring (note that $1 contains the string matched in the expression, but just the part inside the parenthesis - basically each pair of parenthesis create "groups" that can be referenced later; as the regex contains only one pair, that's the first group, so you can reference it with $1)
If you want to replace only the img tag and keep the other tags intact, you could use:
String replaceImg = html.replaceAll("<img.*src=\"(.*?)\"[^>]*/?>", "Image Url: $1");
In this case, the output will be:
asd<i> qwe qwe<u>qweqwe</u></i><u>wqeqwesd.Image Url: urldirectionstring</u>

Related

Issue: Jsoup to parse string with < followed by word

I'm using Jsoup to parse the string which contains substring starts with < followed by a word to get the text but not getting text correctly
String input ="<p>testing with less than <string</p>";
String s = Jsoup.parse(input).text();
After extracting attribute text "testing with less than" is the result instead of testing with less than <string
String input = "<p>testing with less than <string</p>";
System.out.println(input);
Output:
<p>testing with less than <string</p>
If we print the input we will get the entire string as shown.
String s1 = Jsoup.parse(input).text();
System.out.println(s1);// when we use method text()
Output:
testing with less than
If we use the jsoup text() method, we get the plain text without HTML tags.
But still, we are not getting the entire input String because of the char "<".
the reason is justified in the following ex.
String s2 = Jsoup.parse(input).html();
System.out.println(s2);// when we use method html()
Output:
<html>
<head></head>
<body>
<p>testing with less than
<string></string> //the end tag is auto generated by the method
</p>
</body>
</html>
If we use the jsoup html() method, we get the entire formatted HTML code.
here we can clearly see that the word which is written after the char "<" in-between another HTML tag is automatically converted into an HTML tag.
(if we only write a start tag the end tag is automatically created whether it is valid or not)
this is the reason we are not getting entire input as shown in the first example

Regex <img > Tag parsing with src, width, height

You may react to this saying that HTML Parsing using regex is a totally bad idea, following this for example, and you are right.
But in my case, the following html node is created by our own server so we know that it will always look like this, and as the regex will be in a mobile android library, I don't want to use a library like Jsoup.
What I want to parse: <img src="myurl.jpg" width="12" height="32">
What should be parsed:
match a regular img tag, and group the src attribute value: <img[^>]+src\\s*=\\s*['\"]([^'\"]+)['\"][^>]*>
width and height attribute values: (width|height)\s*=\s*['"]([^'"]*)['"]*
So the first regex will have a #1 group with the img url, and the second regex will have two matches with subgroups of their values.
How can I merge both?
Desired output:
img url
width value
height value
To match any img tag with src, height and width attributes that can come in any order and that are in fact optional, you can use
"(<img\\b|(?!^)\\G)[^>]*?\\b(src|width|height)=([\"']?)([^>]*?)\\3"
See the regex demo and an IDEONE Java demo:
String s = "<img height=\"132\" src=\"NEW_myurl.jpg\" width=\"112\"><link src=\"/test/test.css\"/><img src=\"myurl.jpg\" width=\"12\" height=\"32\">";
Pattern pattern = Pattern.compile("(<img\\b|(?!^)\\G)[^>]*?\\b(src|width|height)=([\"']?)([^\"]*)\\3");
Matcher matcher = pattern.matcher(s);
while (matcher.find()){
if (!matcher.group(1).isEmpty()) { // We have a new IMG tag
System.out.println("\n--- NEW MATCH ---");
}
System.out.println(matcher.group(2) + ": " + matcher.group(4));
}
The regex details:
(<img\\b|(?!^)\\G) - the initial boundary matching the <img> tag start or the end of the previous successful match
[^>]*? - match any optional attributes we are not interested in (0+ characters other than > so as to stay inside the tag)
-\\b(src|width|height)= - a whole word src=, width= or height=
([\"']?) - a technical 3rd group to check the attribute value delimiter
([^>]*?) - Group 4 containing the attribute value (0+ characters other than a > as few as possible up to the first
\\3 - attribute value delimiter matched with the Group 3 (NOTE if a delimiter may be empty, add (?=\\s|/?>) at the end of the pattern)
The logic:
Match the start of img tag
Then, match everything that is inside, but only capture the attributes we need
Since we are going to have multiple matches, not groups, we need to find a boundary for each new img tag. This is done by checking if the first group is not empty (if (!matcher.group(1).isEmpty()))
All there remains to do is to add a list for keeping matches.
If you want to combine the both the things here is the answer.
<img\s+src="([^"]+)"\s+width="([^"]+)"\s+height="([^"]+)"
sample I tested
<img src="rakesh.jpg" width="25" height="45">
try this
You may want this :
"(?i)(src|width|height)=\"(.*?)\""
Update:
I misunderstood your question, you need something like :
"(?i)<img\\s+src=\"(.*?)\"\\s+width=\"(.*?)\"\\s+height=\"(.*?)\">"
Regex101 Demo
Update 2
The regex below will capture the img tag attributes in any order:
"(?i)(?><img\\s+)src=\"(.*?)\"|width=\"(.*?)\"|height=\"(.*?)\">"
Regex101 Demo v2

Unable to parse string with a dot in Java with REGEX

When copy and pasting content from a word document into a Vaadin7 RichTextArea (or any other Richtextfield), there are plenty of unwanted HTML tags and attributes. Since in a current project the attribute width does some funny business, I'd like to remove them with the following funtion
private String cleanUpHTMLcontent(String content) {
LOG.log(Level.INFO, "Cleaning up that rubbish now");
content = content.replaceAll("width=\"[0-9]*\"",""); // this works fine
content = content.replaceAll("width:[0-9]*[\\.|]*[0-9]*pt;",""); // not working
content = content.replaceAll(";width:[0-9]*[\\.|]*[0-9]*pt",""); // not working
content = content.replaceAll("width:[0-9]*[\\.|]*[0-9]*pt",""); // not working
return content;
}
The first line works fine to remove old html tags like width="500", the other lines are going into the style attribute and try to remove the properties like width:300.45pt; with different positions of the colon.
The code works fine on the test page http://www.regexplanet.com/advanced/java/index.html . I generated my regex strings here, specially for java, but it's still not working. Anyone any idea?
Here an example where it doesn't find the width property
td style="width:453.1pt;border:solid windowtext 1.0pt;
UPDATE
content = content.replaceAll("width:\\s*[.0-9]*pt;",""); // doesn't work
content = content.replaceAll(";width:\\s*[.0-9]*pt",""); // doesn't work
content = content.replaceAll("width:\\s*[.0-9]*pt",""); // works :-)
it appears, that I have to escape the semi-colon as well with a backslash? I will test that
To remove any number of digits with a dot you can use a negated character class [.\d]* or [.0-9]*:
"\\bwidth:\\s*[.0-9]*pt;"
See the regex demo
The \b is a word boundary (makes sure we only match width as a whole word).
Details:
\b - leading word boundary
width: - literal string width:
\s* - 0+ whitespace symbols
[.0-9]* - 0+ dots or digits
pt; - literal pt;

Delete tabulation character from text retrieved with Jsoup

I'm parsing a HTML file using Jsoup. When getting the text of a h1 it retrieves also tabulations and newlines.
'Name' is what I'm trying to retreive from here:
<h1>\n\t\t\tNAME\n\t\t</h1>
I'm trying to get rid of these characters this way:
String name = document.select( "div header > h1" ).first().ownText().replaceAll( "[^a-zA-Z]+", "" ).trim().toUpperCase();
But this is the result:
NTTTTNAMETNTTT
How can I get the text without all the tabulations and newlines characters?
It seems that the html really contains the strings "\t" and "\n" literally. In that case you probably should clean the source prior to parsing. Something like this should do:
String html = Jsoup.connect(URL).userAgent("Mozilla/5.0").execute().body();
html = html.replaceAll("\\\\[nt]", "");
Document doc = Jsoup.parse(html);

Cannot figure out regex issue

I'm trying to extract the text within the title elements and ignore everything else.
I've looked at these articles, but they don't seem to help :\
Regular expression to extract text between square brackets
String Pattern Matching In Java
Java Regex to get the text from HTML anchor (<a>...</a>) tags
The main problem is I am not able to understand what the responders are saying while trying to hack up my own code.
Here is what I've managed from reading the Java API in the Pattern article.
<title>(.*?)</title>
Here's my code to return the title.
String title = null;
Matcher match = Pattern.compile("[<title>](.*?)[</title>]").matcher(this.webPage);
try{
title = match.group();
}
catch(IllegalStateException e)
{
e.printStackTrace();
}
I am getting the IllegalStateException, which says this:
java.lang.IllegalStateException: No match found
at java.util.regex.Matcher.group(Matcher.java:485)
at java.util.regex.Matcher.group(Matcher.java:445)
at BrowserModal.getWebPageTitle(BrowserModal.java:21)
at BrowserTest.main(BrowserTest.java:7)
Line 21 would be "title = match.group();"
What are the pros and cons of the leading Java HTML parsers? lists a bunch of HTML parsers. Parse your HTML to a DOM, then use getElementsByClassName("title") to get the title elements, and grab the text content by looking at its children which should be text nodes.
title = match.group();
This is failing because group() returns the entire matched text. group(1) will return just the content of the first parenthetical group.
[<title>](.*?)[</title>]
The square brackets are just breaking it. [<title>] will match any single character that is an angle bracket or a letter in the word "title".
<title>(.*?)</title>
is better, but will only match a title that is on one line (since . does not, by default, match newlines, and will not match minor variations like
<title lang=en>Foo</title>
It will also fail to find the title correctly in HTML like
<html>
<head>
<!-- <title>Old commented out title</title> -->
<title>Spiffy new title</title>
Try this:-
String title = null;
String subjectString = "<title>TextWithinTags</title>";
Pattern titleFinder = Pattern.compile("<title[^>]*>(.*?)</title>", Pattern.DOTALL | Pattern.CASE_INSENSITIVE);
Matcher regexMatcher = titleFinder.matcher(subjectString);
while (regexMatcher.find()) {
title = regexMatcher.group(1);
}
Edit:- Regex explained:-
[^>]* :- Anything but > is acceptable there. This is used as we can have attributes in the tags.
(.*?) :- Dot represents any character other than newline character. *? represents repeat any number of times, but as few as possible.
For more details on regex, check this out.
This gets the title in just one line of java code:
String title = html.replaceAll("(?s).*<title>(.*)</title>.*", "$1");
This regex assumes the HTML is "simple", and with the "DOTALL" switch (?s) (which means dots also match new-line chars), it will work with multi-line input, and even multi-line titles.

Categories