Parsing html with Jsoup and removing spans with certain style

Parsing html with Jsoup and removing spans with certain style - java

I'm writing an app for a friend but I ran into a problem, the website has these
<span style="display:none">&0000000000000217000000</span>
And we have no idea even what they are, but I need them removed because my app is outputting their value.
Is there any way I can check to see if this is in the Elements and remove it? I have a for-each loop parsing however I cant figure out how to effectively remove this element.
thanks

If you want to remove those spans completely based on the style attribute, try this code:
String html = "<span style=\"display:none\">&0000000000000217000000</span>";
html += "<span style=\"display:none\">&1111111111111111111111111</span>";
html += "<p>Test paragraph should not be removed</p>";
Document doc = Jsoup.parse(html);
doc.select("span[style*=display:none]").remove();
System.out.println(doc);
Here is the output:
<html>
<head></head>
<body>
<p>Test paragraph should not be removed</p>
</body>
</html>

Just try this:
//Assuming you have all the data in a Document called doc:
String cleanData = doc.select("query").text();
The .text(); method will clean all html tags and substitute all encoding, with human readable content. Oh yeah, and then there's the method ownText(); that might help as well. I can't say which will best fit your purposes.

You can use JSOUP to access the innerHTML of the elements, remove the escaped characters, and replace the innerHTML:
Elements elements = doc.select('span');
for(Element e : elements) {
e.html( e.html().replaceAll("&","") );
}
In the above example, get a collection of all of the elements, using the selector for all of the elements that contain the offending character. Afterwards, replace the & with the empty string or whatever character you wish.
Additionally, you should know that & is the escape code for the & character. Without escaping & characters, you may have HTML validation issues. In your case, without additional information, I'm assuming you just really want to eliminate them. If not, this will help get you started. Good luck!
If you need to remove the trailing numbers:
// eliminate ampersand and all trailing numbers
e.html( e.html().replaceAll("&[0-9]*","") );
For more information on regular expressions, see the Javadocs on Regex Pattern.

Related

Get element from multiple div class with colon in css html

There are 2 classes with the same name
<div class="website text:middle"> A</div>
<div class="website text:middle"> 1</div>
How to get A and 1? I tried using getElementById with :eq(0) and it gives out null

Method getElementById queries for elements with a specified id, not class; I'm not sure what you were trying to query with :eq(0) either.
Try:
// String html = ...
Document doc = Jsoup.parse(html);
List<String> result = doc.getElementsByClass("text:middle").eachText();
// result = ["A", "1"]
EDIT
You can query for elements that match multiple classes! See Jsoup select div having multiple classes.
However, a colon (:) is a special character in css and needs to be escaped when it appears as part of a class name in a selector query. I don't think that jsoup currently supports this and simply treats everything after a colon as a pseudo-class.

To add to Janez's correct answer - while jsoup's CSS selector (currently) doesn't support escaping a : character in the class name, there are other ways to get it to work if you want to use the select() method instead of getElementsByXXX -- e.g. if you want to combine selectors in one call:
Elements divs = doc.select("div[class=website text:middle]");
That will find div elements with the literal attribute class="website text:middle". Example.
Or:
Elements divs = doc.select("div[class~=text:middle]");
That finds elements with the class attribute that matches the regex /text:middle/. Example
For the presented data though, I think think the getElementsByClass() DOM method is the way to go and the most general. I just wanted to show a couple alternatives for other cases.

document.querySelectorAll(".website")[0] // 0 is child index
you should use querySelector it is fully supported by every browser
check this for support details support

How to append string with dynamic data in Java

How to append a string with dynamic data
I have an HTML string, want to add !important-tag in a font-size element.
Conditions:
The font size is dynamic value - like 36pt, 24pt, 14pt - Its not constant.
There is no order span tags. It's not coming regular span order format.
I have tried the regexp and replace method but it's not working. Please help me.
Input String
String htmlData = "<span style=\"font-weight: bold;\">First</span><span style=\"font-size: 36pt;\">Second</span><span style=\"font-family: Arial;\">third</span>";
Output
!important added in font-size element
String htmlData = "<span style=\"font-weight: bold;\">First</span><span style=\"font-size: 36pt;!important\">Second</span><span style=\"font-family: Arial;\">third</span>";

It depends on what the input string ends up being. Your list of 'conditions' is not adequate.
Any HTML, really
Okay, then, you need a heck of a lot more information about which font size style needs to be changed. All of them, anywhere in the document? As in, 'take the input HTML, find any and all spans anywhere with a style attribute, then parse those style attributes to find any font-size CSS keys, and add !important to those.
Then the answer is very, very tricky. HTML is not regular and thus cannot be parsed with regular expressions. You'd need to add a third party dep like JSoup, use that to parse this HTML and change it.
Furthermore, CSS parsing is not exactly trivial either, so on top of this you'd need a CSS parser.
Really, you need to go back to the drawing board. You have systems in place that ended up in 'I need to parse any HTML for CSS in style attributes and modify those', and the solution to that problem lies further up the chain. Uses classes instead of style attributes, or find the place that makes these style attributes and fix it there.
It's always this exact form
This will fail horribly unless, your input is exactly like this:
a span with a style and nothing else, with plain text content.
Optionally, any number of those spans, but not nested.
If style is present, then via style=, and all style keys necessarily end in a semi-colon, even though that is optional in HTML.
The font size is always specified using the exact spelling font-size.
Well, as long as you take great care to write down someplace that the HTML you input into this algorithm is restricted to remain that simple, you could in theory do this with regular expressions, but be aware that any fancying up of this HTML is going to break this, and the only true answer, capable of dealing with future changes, remains JSoup!
input.replaceAll("font-size\\s*:(.*?);", "font-size:$1 !important;");
will do the job.
NB: If font-size appears as text inside the spans, that's bad. You'd have to extend your regex to look for e.g. too, but strings are hard to parse with regexes either (specifically, trying to dance around backslash-escaped quotes is tricky). This all goes back to: You have found yourself in a nasty place; you really should fix this elsewhere in the chain.

You can try using replaceAll:
String html = "<span style=\"font-weight: bold;\">First</span><span style=\"font-size: 36pt;\">Second</span><span style=\"font-family: Arial;\">third</span>";
String replaced = html.replaceAll("font-size: ([0-9]*)pt;", "font-size: $1pt !important;");
System.out.println(replaced);

i think you can create a string such as
"font-size: "+value+"pt;" and in your html string, you can simply find and replace this string with "font-size: "+value+"pt;!important" by using htmlData.replaceFirst(string1, string2);

It is better to convert your string to html, and to set the important option (this solution let you change every thing you want in the html and css)
But if you only need to do this specific change "!important" then you need to chose the easy way with :
input.replaceAll

Regular expression for removing HTML tags from a string

I am looking for a regular expression to removing all HTML tags from a string in JSP.
Example 1
sampleString = "test string <i>in italics</i> continues";
Example 2
sampleString = "test string <i>in italics";
Example 3
sampleString = "test string <i";
The HTML tag might be complete, partial (without closing tag) or without proper starting tag (missing closing angle bracket in 3rd example) itself.
Thanks in advance

Case 3 is not possible with regex or a parser. It might represent legitimate content. So forget it.
As to the concrete question which covers cases 1 and 2, just use a HTML parser. My favourite is Jsoup.
String text = Jsoup.parse(html).text();
That's it. It has by the way also a HTML cleaner, if that is what you're actually after.
Since you're using JSP, you could also just use JSTL <c:out> or fn:escapeXml() to avoid that user-controlled HTML input get inlined among your HTML (which may thus open XSS holes).
<c:out value="${bean.property}" />
<input type="text" name="foo" value="${fn:escapeXml(param.foo)}" />
HTML tags will then not be interpreted, but just displayed as plain text.

<\/?font(\s\w+(\=\".*\")?)*\>
I used this little gem about a week ago to strip a variety of 12-year-old html tags, and it worked pretty great. Just replace 'font' with whatever tag you're looking for, or with \w* to get rid of all of them.
Edit removed '?' from the end of my string after realizing that could remove non-tag data from a file. Basically, this will consistently find case 1 and 2, but if used with case 3 (with the '?' appended to the end of the regex), caution should be used to ensure what is removed is a tag.

optimize regex which matches two html tags

((<(\\s*?)(object|OBJECT|EMBED|embed))+(.*?)+((object|OBJECT|EMBED|embed)(\\s*?)>))
I need to get object and embed tags from some html files stored locally on disk. I've come up with the above regex to match the tags in java then
use matcher.group(1); to get the entire tag and its contents
Can anyone perhaps improve this? Is there anything that stands out immediately to you that i should change?
It does work BTW, just wanting an input to see if it can be better because i'm fairly new to regex myself.

Yes, here's the improvement:
Download a fullworthy Java HTML parser like Jsoup and put it in classpath.
Now you can select all <object> and <embed> elements as follows:
Document document = Jsoup.parse(new File("/path/to/file.html"), "UTF-8");
Elements elements = document.select("object,embed");
for (Element element : elements) {
System.out.println(element.outerHtml());
}
See also:
Regular Expressions - Now you have two problems
Parsing HTML - The Cthulhu way
Pros and cons of HTML parsers in Java

get text between html tags

Possible duplicate: RegEx matching HTML tags and extracting text
I need to get the text between the html tag like <p></p> or whatever. My pattern is this
Pattern pText = Pattern.compile(">([^>|^<]*?)<");
Anyone knows some better pattern, because this one its not very usefull. I need it to get for index the content from web page.
Thanks

SO is about to descend on you. But let me be the first to say, don't use regular expressions to parse HTML. Here is a list of Java HTML Parsers. Look around until you see an API that suits your fancy and use that instead.

It looks like you are trying to use the | operator inside a negative set, which is neither working nor needed. Just specify the characters that you don't want to match:
Pattern pText = Pattern.compile(">([^<>]*?)<");

Don't use regular expressions when parsing HTML.
Use XPath instead (if your HTML is well formed). You can reference text nodes using the text() function very easily.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.