How to parse html from javascript variables with Jsoup in Java? - java

I'm using Jsoup to parse html file and pull all the visible text from elements. The problem is that there are some html bits in javascript variables which are obviously ignored. What would be the best solution to get those bits out?
Example:
<!DOCTYPE html>
<html>
<head>
<script>
var html = "<span>some text</span>";
</script>
</head>
<body>
<p>text</p>
</body>
</html>
In this example Jsoup only picks up the text from p tag which is what it's supposed to do. How do I pick up the text from var html span? The solution must be applied to thousands of different pages, so I can't rely on something like javascript variable having the same name.

You can use Jsoup to parse all the <script>-tags into DataNode-objects.
DataNode
A data node, for contents of style, script tags etc, where contents should not show in text().
Elements scriptTags = doc.getElementsByTag("script");
This will give you all the Elements of tag <script>.
You can then use the getWholeData()-method to extract the node.
// Get the data contents of this node.
String getWholeData()
for (Element tag : scriptTags){
for (DataNode node : tag.dataNodes()) {
System.out.println(node.getWholeData());
}
}
Jsoup API - DataNode

I am not so sure about the answer, but I saw a similar situation before here.
You probably can use Jsoup and manual parsing to get the text according to that answer.
I just modify that piece of code for your specific case:
Document doc = ...
Element script = doc.select("script").first(); // Get the script part
Pattern p = Pattern.compile("(?is)html = \"(.+?)\""); // Regex for the value of the html
Matcher m = p.matcher(script.html()); // you have to use html here and NOT text! Text will drop the 'html' part
while( m.find() )
{
System.out.println(m.group()); // the whole html text
System.out.println(m.group(1)); // value only
}
Hope it will be helpful.

Related

JavaFX how to parse HTML String into HTML Element?

I wrote a method to insert a div with text passed as parameter.
And then I noticed I need to add various HTML content into that div. Current method works on these basic 5 lines of instruction:
//engine is the WebEngine object of some WebView object
Node html = engine.getDocument().getChildNodes().item(0);
Node body = html.getChildNodes().item(1);
Element e = engine.getDocument().createElement("div");
e.setTextContent(msg);
body.appendChild(e);
So here comes my question. Is there a way of parsing some HTML content into an Element object, so I can append that element to the document?
Example HTML String: <b>SomeText</b>
I solved the problem with Javascript! I could append any HTML data with JS.
Example:
engine.executeScript("document.body.innerHTML += '<div><b>SomeText</b></div>' ");
I recently created such a tool, I hope it helps a lot
https://github.com/graycatdeveloper/JavaFXHtmlText

Formatting issue with HTML while using JSoup for Java

I'm trying to scrape "text" off of a website with JSoup. I can get the text cleanly (with no formatting at all, just the text), or with all the formatting still attached (i.e. < br > along with < p > and < /p >).
However, I can't seem to get the formatted version to include < br/ > to any extent, and that's the only thing that was specifically requested to go along with the text.
For example, I can get this:
<p><br>Worldwide database</p>
and this:
Worldwide database
but I can't get this, which is my desired result:
Worldwide database<br/>
I don't see any < br / >'s while looking at the HTML code via the FireBug plugin on Firefox so I'm wondering if that might be the issue? Or maybe there's an issue with the method's I'm using in my code to pull the text?
Anyways, here's my code:
Elements descriptionHTML = doc.select("div[jsname]"); // <-- Get access to the text w/ JSoup
String descText = descriptionHTML.text(); // <-- Get the code w/o any formating at all
// This prints out the desired text with the <p><br> and </p>, but no <br/>
for (Element link : descriptionHTML)
{
String jsname = link.attr("jsname");
if( jsname.equals("C4s9Ed")){
System.out.println(link);
break;
}
}
I'd really apprecaite any help with this issue.
Thanks,
Jack
HTML does not define a closing tag for <br> elements. XHTML however requires that the tag is marked as empty: <br />. JSoup parses both, but will print out only normal HTML (<br>).
If you use the XML parser in Jsoup, the <br> tags are not closed and so Jsoup tries to guess where to place matching closing tags </br> which are neither HTML nor XHTML compliant.
If you want to keep the line break info and strip out all other tags, I think you need to program that part outside of Jsoup. You could for example replace all <br> and <br /> strings with a uniqe other string, say "_brSplitPos_", then parse the document with JSoup, print out the text only and replace the "_brSplitPos_" against <br />:
String html = "<div>This<br>is<br />a<br>test</div>";
html = html.replaceAll("<br(?:\\s+/)?>", "_brSplitPos_");
Document docH = Jsoup.parse(html);
String onlyText = docH.text();
onlyText = onlyText.replace("_brSplitPos_", "<br />");
System.out.println(onlyText);

Parse html (potentially not well-formed) without downloading the whole in Java

I'm looking for a way to read and parse partial html file from the InputStream.
Say the input is like this:
<html>
<head>
<meta something="something">
The ideal solution would be after seeing that tag, store it somewhere and close the connection. In this case the HTML may not be well formed (since we only got partial of it), so the xml parsers may fail. Is there a way to do that?
You can use JSoup
String partialHtml = "<html><head><meta something=\"something\">";
Document document = Jsoup.parse(partialHtml);
Elements values = document.getElementsByAttribute("something");
for (Element el : values) {
System.out.println(el.attr("something"));
}

Presence of HTML tags using Jsoup

With Jsoup it is easy to count number of times a particular tag's presence in a text. For example I am trying to see how many times anchor tag is present in the given text.
String content = "<p>An <a href='http://example.com/'><b>example</b></a> link.</p>. <p>An <a href='http://example.com/'><b>example</b></a> link.</p>. <p>An <a href='http://example.com/'><b>example</b></a> link.</p>. <p>An <a href='http://example.com/'><b>example</b></a> link.</p>";
Document doc = Jsoup.parse(content);
Elements links = doc.select("a[href]"); // a with href
System.out.println(links.size());
This gives me a count of 4. If I have a sentence and I want to know if the sentence contains any html tags or not, is it possible with Jsoup? Thank you.
You are possibly better off with a regular expression, but if you really want to use JSoup, then you can try to match for all ellements, and then subtract 4, as JSoup automatically adds four elements, that is, first the root element, and then a <html>, <head> and <body> element.
This might loosely look like:
// attempt to count html elements in string - incorrect code, see below
public static int countHtmlElements(String content) {
Document doc = Jsoup.parse(content);
Elements elements = doc.select("*");
return elements.size()-4;
}
However this gives a wrong result if the text contains a <html>, <head> or <body>; compare the results of:
// gives a correct count of 2 html elements
System.out.println(countHtmlElements("some <b>text</b> with <i>markup</i>"));
// incorrectly counts 0 elements, as the body is subtracted
System.out.println(countHtmlElements("<body>this gives a wrong result</body>"));
So to make this work, you would have to check for the "magic" tags separately; that is why I feel a regular expression might be simpler.
More failed attempts to make this work: Using parseBodyFragment instead of parse does not help, as this gets sanitized in the same way by JSoup. Same, counting as doc.select("body *"); saves you the trouble to subtract 4, but it still yields the wrong count if a <body> is involved. Only if you have an application where you are sure that no <html>, <head> or <body> elements are present in the strings to be checked, it might work under that limitiation.

Using jsoup to escape disallowed tags

I am evaluating jsoup for the functionality which would sanitize (but not remove!) the non-whitelisted tags. Let's say only <b> tag is allowed, so the following input
foo <b>bar</b> <script onLoad='stealYourCookies();'>baz</script>
has to yield the following:
foo <b>bar</b> <script onLoad='stealYourCookies();'>baz</script>
I see the following problems/questions with jsoup:
document.getAllElements() always assumes <html>, <head> and <body>. Yes, I can call document.body().getAllElements() but the point is that I don't know if my source is a full HTML document or just the body -- and I want the result in the same shape and form as it came in;
how do I replace <script>...</script> with <script>...</script>? I only want to replace brackets with escaped entities and do not want to alter any attributes, etc. Node.replaceWith sounds like an overkill for this.
Is it possible to completely switch off pretty printing (e.g. insertion of new lines, etc.)?
Or maybe I should use another framework? I have peeked at htmlcleaner so far, but the given examples don't suggest my desired functionality is supported.
Answer 1
How do you load / parse your Document with Jsoup? If you use parse() or connect().get() jsoup will automaticly format your html (inserting html, body and head tags). This this ensures you always have a complete Html document - even if input isnt complete.
Let's assume you only want to clean an input (no furhter processing) you should use clean() instead the previous listed methods.
Example 1 - Using parse()
final String html = "<b>a</b>";
System.out.println(Jsoup.parse(html));
Output:
<html>
<head></head>
<body>
<b>a</b>
</body>
</html>
Input html is completed to ensure you have a complete document.
Example 2 - Using clean()
final String html = "<b>a</b>";
System.out.println(Jsoup.clean("<b>a</b>", Whitelist.relaxed()));
Output:
<b>a</b>
Input html is cleaned, not more.
Documentation:
Jsoup
Answer 2
The method replaceWith() does exactly what you need:
Example:
final String html = "<b><script>your script here</script></b>";
Document doc = Jsoup.parse(html);
for( Element element : doc.select("script") )
{
element.replaceWith(TextNode.createFromEncoded(element.toString(), null));
}
System.out.println(doc);
Output:
<html>
<head></head>
<body>
<b><script>your script here</script></b>
</body>
</html>
Or body only:
System.out.println(doc.body().html());
Output:
<b><script>your script here</script></b>
Documentation:
Node.replaceWith(Node in)
TextNode
Answer 3
Yes, prettyPrint() method of Jsoup.OutputSettings does this.
Example:
final String html = "<p>your html here</p>";
Document doc = Jsoup.parse(html);
doc.outputSettings().prettyPrint(false);
System.out.println(doc);
Note: if the outputSettings() method is not available, please update Jsoup.
Output:
<html><head></head><body><p>your html here</p></body></html>
Documentation:
Document.OutputSettings.prettyPrint(boolean pretty)
Answer 4 (no bullet)
No! Jsoup is one of the best and most capable Html library out there!

Categories