jsoup to w3c-document: INVALID_CHARACTER_ERR - java

My usecase: Get html-pages by jsoup and returns a w3c-DOM for further processing by XML-transformations:
...
org.jsoup.nodes.Document document = connection.get();
org.w3c.dom.Document dom = new W3CDom().fromJsoup(document);
...
Works well for most documents but for some it throws INVALID_CHARACTER_ERR without telling where.
It seems extremely difficult to find the error. I changed the code to first import the url to a String and then checking for bad characters by regexp. But that does not help for bad attributes (eg. without value) etc.
My current solution is to minimize the risk by removing elements by tag in the jsoup-document (head, img, script ...).
Is there a more elegant solution?

Try setting the outputSettings to 'XML' for your document:
document
.outputSettings()
.syntax(OutputSettings.Syntax.xml);
document
.outputSettings()
.charset("UTF-8");
This should ensure that the resulting XML is valid.

Solution found by OP in reply to nyname00:
Thank you very much; this solved the problem:
Whitelist whiteList = Whitelist.relaxed();
Cleaner cleaner = new Cleaner(whiteList);
jsoupDom = cleaner.clean(jsoupDom);
"relaxed" in deed means relaxed developer...

Related

using jsoup to parse html but not follow/fetch links

What is the "correct" way to use JSoup to parse html string or stream without fetching external data for link/img/area/iframe (and whatever other) tags? Right now I am doing something like this after I fetch a page using Apache HttpComponents:
HttpEntity entity = response.getEntity();
InputStream is = entity.getContent();
Document = JSoup.parse(is, null, "");
Which actually works fine. But passing the baseUri as empty just feels wrong, because I am betting JSoup tries to use it, only to fail and move on. I only want to use JSoup as an html parser and DOM manipulation kit, not an http framework. I am also a bit worried that JSoup might try to look for ="/foo" resources in the current directory or something. What does it do with an empty string? I tried passing null as the baseUri, which would be a natural interface for doing what I want, but it dies with an IllegalStateException.
Is there a way to do this, or am I worried about nothing?
... I don't think think that JSoup does that. The URL parameter is just for the canonicalization of relative URLs, what you do with them is your responsibility. JSoup will not by itself try to access resources.

why can't I load this URL with JDOM? Browser spoofing?

I'm writing some code to load and parse HTML docs from the web.
I'm using JDOM like so:
SAXBuilder parser = new SAXBuilder();
Document document = (Document)parser.build("http://www.google.com");
Element rootNode = document.getRootElement();
/* and so on ...*/
It works fine like that. However, when I change the URL to some other web sites, like "http://www.kijiji.com", for example, the parser.build(...) line hangs.
Any idea why it hangs? I'm wondernig if it might be because kijiji knows I'm not a "real" web browser -- perhaps I have to spoof my http request so it looks like it's coming from IE or something like that?
Any ideas are useful, thanks!
Rob
I think a few things may be going on here. The firdt issue is that you cannot parse regular HTML with JDOM, HTML is not XML....
Secondly, when I run kijiji.com through JDOM I get an immediate HTTP_400 response
When I parse google.com I get an immediate XML error about well-formedness.
If you happen to be parsing xhtml at some point though, you will likely run in to this problem here: http://www.w3.org/blog/systeam/2008/02/08/w3c_s_excessive_dtd_traffic/
XHTML has a doctype that references other doctypes, etc. Thes each take 30 seconds to load from w3c.org....

Using XPath in XMLObject to query by namespace

I have a simple XML document
<abc:MyForm xmlns:abc='http://myform.com'>
<abc:Forms>
<def:Form1 xmlns:def='http://decform.com'>
....
</def:Form1>
<ghi:Form2 xmlns:ghi='http://ghiform.com'>
....
</ghi:Form2>
</abc:Forms>
</abc:MyForm>
I'm using XMLObjects from Apache and when I try to do the following xpath expression it works perfectly
object.selectPath("declare namespace abc='http://myform.com'
abc:Form/abc:Forms/*");
this gives me the 2 Form nodes (def and ghi). However I want to be able to query by specifying a namespace, so let's say I only want Form2. I've tried this and it fails
object.selectPath("declare namespace abc='http://myform.com'
abc:Form/abc:Forms/*
[namespace-uri() = 'http://ghiform.com']");
The selectPath returns 0 nodes. Does anyone know what is going on?
Update:
If I do the following in 2 steps, then I can get the result that I want.
XmlObject forms = object.selectPath("declare namespace abc='http://myform.com'
abc:Form/abc:Forms")[0];
forms.selectPath("*[namespace-uri() = 'http://ghiform.com']");
this gives me the ghi:Form node just like it should, I don't understand why it doesn't do it as a single XPath expression though.
Thanks
The simple answer is that you can't. The namespace prefix is just a shorthand for the namespace URI, which is all that matters.
For a namespace-aware parser, your two tags are identical.
If you really want to differentiate using the prefix (although you really, really shouldn't be doing it), you can use a non namespace-aware parser and just treat the prefix as if it was part of the element name.
But ideally you should read a tutorial on how namespaces work and try to use them as they were designed to be used.

Some help scraping a page in Java

I need to scrape a web page using Java and I've read that regex is a pretty inefficient way of doing it and one should put it into a DOM Document to navigate it.
I've tried reading the documentation but it seems too extensive and I don't know where to begin.
Could you show me how to scrape this table in to an array? I can try figuring out my way from there. A snippet/example would do just fine too.
Thanks.
You can try jsoup: Java HTML Parser. It is an excellent library with good sample codes.
Transform the web page you are trying to scrap into an XHTML document. There are several options to do this with Java, such as JTidy and HTMLCleaner. These tools will also automatically fix malformed HTML (e.g., close unclosed tags). Both work very well, but I prefer JTidy because it integrates better with Java's DOM API;
Extract required information using XPath expressions.
Here is a working example using JTidy and the Web Page you provided, used to extract all file names from the table.
public static void main(String[] args) throws Exception {
// Create a new JTidy instance and set options
Tidy tidy = new Tidy();
tidy.setXHTML(true);
// Parse an HTML page into a DOM document
URL url = new URL("http://www.cs.grinnell.edu/~walker/fluency-book/labs/sample-table.html");
Document doc = tidy.parseDOM(url.openStream(), System.out);
// Use XPath to obtain whatever you want from the (X)HTML
XPath xpath = XPathFactory.newInstance().newXPath();
XPathExpression expr = xpath.compile("//td[#valign = 'top']/a/text()");
NodeList nodes = (NodeList)expr.evaluate(doc, XPathConstants.NODESET);
List<String> filenames = new ArrayList<String>();
for (int i = 0; i < nodes.getLength(); i++) {
filenames.add(nodes.item(i).getNodeValue());
}
System.out.println(filenames);
}
The result will be [Integer Processing:, Image Processing:, A Photo Album:, Run-time Experiments:, More Run-time Experiments:] as expected.
Another cool tool that you can use is Web Harvest. It basically does everything I did above but using an XML file to configure the extraction pipeline.
Regex is definitely the way to go. Building a DOM is overly complicated and itself requires a lot of text parsing.
If all you are doing is scraping a table into a datafile, regex will be just fine, and may be even better than using a DOM document. DOM documents will use up a lot of memory (especially for really large data tables) so you probably want a SAX parser for large documents.

XML parsing issue with '&' in element text

I have the following code:
import javax.xml.parsers.DocumentBuilder;
import javax.xml.parsers.DocumentBuilderFactory;
import org.w3c.dom.Document;
DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
DocumentBuilder builder = factory.newDocumentBuilder();
Document document = builder.parse(new InputSource(new StringReader(inputXml)));
And the parse step is throwning:
SAXParseException: The entity name must immediately follow
the '&' in the entity reference
due to the following '&' in my inputXml:
<Line1>Day & Night</Line1>
I'm not in control of in the inbound XML. How can I safely/correctly parse this?
Quite simply, the input "XML" is not valid XML. The entity should be encoded, i.e.:
<Line1>Day & Night</Line1>
Basically, there's no "proper" way to fix this other than telling the XML supplier that they're giving you garbage and getting them to fix it. If you're in some horrible situation where you've just got to deal with it, then the approach you take will likely depend on what range of values you're expected to receive.
If there's no entities in the document at all, a regex replace of & with & before processing would do the trick. But if they're sending some entities correctly, you'd need to exclude these from the matching. And on the rare chance that they actually wanted to send the entity code (i.e. sent & but meant &amp;) you're going to be completely out of luck.
But hey - it's the supplier's fault anyway, and if your attempt to fix up invalid input isn't exactly what they wanted, there's a simple thing they can do to address that. :-)
Your input XML isn't valid XML; unfortunately you can't realistically use an XML parser to parse this.
You'll need to pre-process the text before passing it to an XML parser. Although you can do a string replace, replacing '& ' with '& ', this isn't going to catch every occurrence of & in the input, but you may be able to come up with something that does.
I used Tidy framework before xml parsing
final StringWriter errorMessages = new StringWriter();
final String res = new TidyChecker().doCheck(html, errorMessages);
...
DocumentBuilder db = DocumentBuilderFactory.newInstance().newDocumentBuilder();
Document doc = db.parse(new InputSource(new StringReader(addRoot(html))));
...
And all Ok
is inputXML a string? Then use this:
inputXML = inputXML.replaceAll("&\\s+", "&");

Categories