Does the document tree returned by JSoup when it parses an HTML document support getComputedStyle on the individual document elements?
What I would like to do is inline the CSS in an HTML fragment so that I can insert the fragment into a larger HTML document, with all of its formatting preserved but without messing with any other formatting in the document.
The research I've done would seem to suggest that I can accomplish this by iterating through all of the elements in the document, calling getComputedStyle on each one, and assigning the result to be the style for the element.
Yes, I realize that this may very well bloat the resulting HTML by putting a bunch of redundant / unnecessary style information on the individual elements, but I'm willing to pay the price of larger HTML, and as far as I can tell, embedding the style inline like this is the only way to preserve the formatting exactly while also making the HTML fragments fully portable. (If you've got another suggestion for accomplishing that purpose, I'm all ears. :-)
Getting back on topic... If I can't use getComputedStyle (or the equivalent) with JSoup, is there another Java HTML+CSS parser that supports getComputedStyle or the equivalent?
Thanks.
That's not possible. Jsoup is just a HTML parser with CSS selector support, it is not a HTML renderer.
You may want to take a look at Lobobrowser which is a Java based HTML renderer supporting JavaScript and like. I do not know nor guarantee that getComputedStyle() is supported by Lobo.
No other tools comes to mind. HtmlUnit comes close as it can also access/invoke JavaScript, but some Google results suggests that getComputedStyle() doesn't work on HtmlUnit as well. It's after all actually also not a real HTML renderer as well.
Related
I want to know is it possible to retrieve HTML tag and plain text such as
<p>This is text </p> or <div> or This is text
by using XmlPullParser ? I read here that it is not recommended. So is there any alternative way or a simple code that allow you to retrieve HTML and plain text like I wanted above ? I'm still a beginner in android. Thank you for your help.
I think your best option (which I have also used) is JSOUP.
JSOUP provides a very convenient API for extracting and manipulating data, using DOM, CSS, and jquery-like methods. JSOUP allows you to scrape and parse HTML from a URL, file, or string and many more.
jSoup: https://jsoup.org/
You have here a nice tutorial (not mine)
http://www.androidbegin.com/tutorial/android-basic-jsoup-tutorial/
JSOUP is a great parser and is one of the most commonly used ones.
Another thing that might be helpful for you is HTML organizer, a common thing that happens when writing parsers is errors due to Malformed HTML files. This happens more often then what you expect so a HTML organizer can reduce the amount of errors.
A good organizer I used is: Tidy
We allow users to create rich content using TinyMCE and this includes Javascript and CSS.
However, when the content reaches server (Java), we want to filter out all XSS code or potentially malicious code, things like document.cookie, eval, etc, whether they are in CSS, inline JS, XSS Javascript crafted using string text (eg. document.write), etc. Everything else, eg. changing color on mouse over, set gradient on CSS, etc are fine.
We want to allow flexibility to our users but at the same time we want to ensure users are secured. We researched on libs like HTML Purifier, jSoup, but they do not seem smart enough to distinguish potentially malicious JS from safe one.
We are wondering if there is any way to do it?
Thank you.
Have you looked at google caja? It is a compiler for third party javascript so it can be safely embedded in another site:
https://developers.google.com/caja/
It sounds like what you are looking for.
You can use JSoup for this job. JSoup has a XSS Cleaner Parser which can work on a whilelist Object (list of permitted tags).The jsoup whitelist sanitizer works by parsing the input HTML, and then iterating through the parse tree and only allowing known-safe tags and attributes (and values) through into the cleaned output. It does not use regular expressions, which are inappropriate for this task.
jsoup provides a range of Whitelist configurations to suit most requirements; they can be modified if necessary. Read this link for more details [http://jsoup.org/cookbook/cleaning-html/whitelist-sanitizer].
Is this possible? I am using javascript, AJAX and JSON to pull data from a Java servlet and i'd like to display elements of the javascript array I create within my HTML (as opposed to creating large chunks of HTML within javascript). I know I can muck up my html using:
<script>document.write(arrayVar[0].firstName);</script>
But i'd really like to avoid that. In the past I would use JSTL and EL tags when pulling data from the server, but is there a similar way to do this purely with javascript? I wouldn't be opposed to using an external library - I just don't know of any because I don't have a ton of experience with JS.
You could use any javascript templating library like http://mustache.github.com/ or ones that come with underscore.js ( or jquery templates (although I understand that these are deprecated now). Javascript templates are probably as magical as you will get for this.
I do try to avoid document.write wherever possible as it can mess up pages when this call is returned in HTML through AJAX.
Unfortunately, to answer your question, probably not - at least not in a way better than you're currently doing. document.write can be avoided by instead using the DOM methods, such as:
document.getElementById('test').innerHTML = arrayVar[0].firstName;
With pure JavaScript, I don't see any other way of doing this that would be similar to EL syntax.
Saxon-CE (currently an Alpha release) is an XSLT 2 processor with extensions for browser interaction - it's implemented in JavaScript and deployed in the same way as a JS library.
With this, you could write a template to iterate the JavaScript array (by calling a JS function) and insert each value into an HTML literal-result element within the template. The resulting HTML can then be appended to, or replace, any part of the HTML DOM.
The template can either run on page load, or be linked to any DOM event.
What is a fast and simple way to validate HTML from Java? I’m looking for an open-source/PD class (or set of classes) that describes the various properties of the 100-odd HTML tags, such as:
Is the tag optional? Empty? Is it legal to omit its closing tag?
Which other tags can this tag contain (if any)?
Which attributes are legal for this tag, and what are their types? (not required, but nice to have)
Thanks!
EDIT
I'm looking to do to a tag-by-tag analysis of an HTML document, so I'm less interested in whether the document as a whole is valid, but rather what the specific requirements are for each type of tag.
I could encode the rules based on the W3C spec, but wanted to see which ready-made solutions are available first.
If you want to verify certain tags follow certain specifications, there seems to be no end of Java based HTML parsers:
Open Source HTML Parsers in Java
In other words, you could parse you HTML, and then inspect the resulting document for the tags you were looking for and determine if they meet the specifications you require. If they don't you could then just throw an error.
I don't think you'll find a HTML analysis tool which was written with exactly your requirements in mind, mostly because those requirements haven't been voiced and are probably a bit nebulous.
If the parser doesn't do what you want out of the box, at least this list is open source, so you can hack the parser as long as you publish your changes.
Check JTidy (http://jtidy.sourceforge.net/) and VietSpider HTMLParser ( http://sourceforge.net/projects/binhgiang/ ) both are Java HTML parser and some syntax checking capabilities. Some eclipse based HTML editor plugin use JTidy (or port of Tidy) for syntax checking. Or as David Said, submit the page to w3c.org
I want to write a Java tool to assess HTML pages of an existing site and if any image has no alt attribute, the tool will insert alt="" to that image. One approach is using an HTML parser (like HtmlCleaner) to generate the DOM then adding the alt attribute to the images in the DOM before writing back the HTML.
However, this approach won't keep the original HTML intact and probably cause some unpredictable side effects, esp. when the existing amount of HTML pages is huge and there is no guarantee about their being well-formed.
Is there any safer way to accomplish this (i.e. should keep the original HTML intact and only add the alt attribute)?
Short of writing some horrible mess of regexp or other string manipulation code, I don't believe that there is another way of doing this.
I do question why you want to do this? The only reason I can imagine is to pass some sort of automatic validation, but the reason for requiring alt tags is a matter of usability. Adding empty alt tags does not help that in any way. You are just hiding the problem.
Instead I'd suggest writing a bit of Javascript that throws a red border around any image missing an alt tag and making the front end designers add meaningful alt tags to every image thus flagged.
It's kind of pointless to add empty alt tags to your layout. I second Kris in that it's defeating the purpose of having the alt tags in the first place and I agree with David Dorward's comment.
But, if there is some ulterior motive here, you could do it after the fact in the browser with javascript (or, preferably, jQuery). The client's browser certainly won't be able to change the original HTML and is smart enough to parse through it even if it's not perfectly well-formed.
Using jQuery, place this script in the head section of your page:
<script language="javscript">
$(function() {
$('img:not([alt])').attr('alt','');
});
</script>
And make sure you include the jQuery library.
I've used the Jericho HTML Parser library in the past with success for parsing HTML. It's supposed to work well with poorly formed HTML. This would alter the original HTML though.