I'm writing some code to load and parse HTML docs from the web.
I'm using JDOM like so:
SAXBuilder parser = new SAXBuilder();
Document document = (Document)parser.build("http://www.google.com");
Element rootNode = document.getRootElement();
/* and so on ...*/
It works fine like that. However, when I change the URL to some other web sites, like "http://www.kijiji.com", for example, the parser.build(...) line hangs.
Any idea why it hangs? I'm wondernig if it might be because kijiji knows I'm not a "real" web browser -- perhaps I have to spoof my http request so it looks like it's coming from IE or something like that?
Any ideas are useful, thanks!
Rob
I think a few things may be going on here. The firdt issue is that you cannot parse regular HTML with JDOM, HTML is not XML....
Secondly, when I run kijiji.com through JDOM I get an immediate HTTP_400 response
When I parse google.com I get an immediate XML error about well-formedness.
If you happen to be parsing xhtml at some point though, you will likely run in to this problem here: http://www.w3.org/blog/systeam/2008/02/08/w3c_s_excessive_dtd_traffic/
XHTML has a doctype that references other doctypes, etc. Thes each take 30 seconds to load from w3c.org....
Related
My usecase: Get html-pages by jsoup and returns a w3c-DOM for further processing by XML-transformations:
...
org.jsoup.nodes.Document document = connection.get();
org.w3c.dom.Document dom = new W3CDom().fromJsoup(document);
...
Works well for most documents but for some it throws INVALID_CHARACTER_ERR without telling where.
It seems extremely difficult to find the error. I changed the code to first import the url to a String and then checking for bad characters by regexp. But that does not help for bad attributes (eg. without value) etc.
My current solution is to minimize the risk by removing elements by tag in the jsoup-document (head, img, script ...).
Is there a more elegant solution?
Try setting the outputSettings to 'XML' for your document:
document
.outputSettings()
.syntax(OutputSettings.Syntax.xml);
document
.outputSettings()
.charset("UTF-8");
This should ensure that the resulting XML is valid.
Solution found by OP in reply to nyname00:
Thank you very much; this solved the problem:
Whitelist whiteList = Whitelist.relaxed();
Cleaner cleaner = new Cleaner(whiteList);
jsoupDom = cleaner.clean(jsoupDom);
"relaxed" in deed means relaxed developer...
What is the "correct" way to use JSoup to parse html string or stream without fetching external data for link/img/area/iframe (and whatever other) tags? Right now I am doing something like this after I fetch a page using Apache HttpComponents:
HttpEntity entity = response.getEntity();
InputStream is = entity.getContent();
Document = JSoup.parse(is, null, "");
Which actually works fine. But passing the baseUri as empty just feels wrong, because I am betting JSoup tries to use it, only to fail and move on. I only want to use JSoup as an html parser and DOM manipulation kit, not an http framework. I am also a bit worried that JSoup might try to look for ="/foo" resources in the current directory or something. What does it do with an empty string? I tried passing null as the baseUri, which would be a natural interface for doing what I want, but it dies with an IllegalStateException.
Is there a way to do this, or am I worried about nothing?
... I don't think think that JSoup does that. The URL parameter is just for the canonicalization of relative URLs, what you do with them is your responsibility. JSoup will not by itself try to access resources.
Everything is okay when I read the data from webpage using InputStreamReader.
I have problem with parsing data to DocumentHTML.
Main reason is that the HTML script has some special characters which are used incorrectly.
There is an & sign twice ( "&&" ) and I believe that is causing the code to crash.
My code looks like this:
URL url = new URL(PageUrl);
URLConnection conn = url.openConnection();
// ... omitted ...
// parsing
HTMLDocument doc = (HTMLDocument)db.parse(conn.getInputStream());
Since I am making an Android application, I don't use standard parsing functions since the DocumentHTML object is going to be too large.
I found many existing examples of parsing HTML like using jsoup but they are not what I want.
I want to write my own code for parsing so that the HTMLDocument object will be kept small.
Why dont you use all the available Html parsers that are available in java?
They have community support they so are the best option.
Open Source HTML Parsers in Java
Trying to figure out a way to strip out specific information(name,description,id,etc) from an html file leaving behind the un-wanted information and storing it in an xml file.
I thought of trying using xslt since it can do xml to html... but it doesn't seem to work the other way around.
I honestly don't know what other language i should try to accomplish this. i know basic java and javascript but not to sure if it can do it.. im kind of lost on getting this started.
i'm open to any advice/help. willing to learn a new language too as i'm just doing this for fun.
There are a number of Java libraries for handling HTML input that isn't well-formed (according to XML). These libraries also have built-in methods for querying or manipulating the document, but it's important to realize that once you've parsed the document it's usually pretty easy to treat it as though it were XML in the first place (using the standard Java XML interfaces). In other words, you only need these libraries to parse the malformed input; the other utilities they provide are mostly superfluous.
Here's an example that shows parsing HTML using HTMLCleaner and then converting that object into a standard org.w3c.dom.Document:
TagNode tagNode = new HtmlCleaner().clean("<html><div><p>test");
DomSerializer ser = new DomSerializer(new CleanerProperties());
org.w3c.dom.Document doc = ser.createDOM(tagNode);
In Jsoup, simply parse the input and serialize it into a string:
String text = Jsoup.parse("<html><div><p>test").outerHtml();
And convert that string into a W3C Document using one of the methods described here:
How to parse a String containing XML in Java and retrieve the value of the root node?
You can now use the standard JAXP interfaces to transform this document:
TransformerFactory tFact = TransformerFactory.newInstance();
Transformer transformer = tFact.newTransformer();
Source source = new DOMSource(doc);
Result result = new StreamResult(System.out);
transformer.transform(source, result);
Note: Provide some XSLT source to tFact.newTransformer() to do something more useful than the identity transform.
I would use HTMLAgilityPack or Chris Lovett's SGMLReader.
Or, simply HTML Tidy.
Ideally, you can treat your HTML as XML. If you're lucky, it will already be XHTML, and you can process it as HTML. If not, use something like http://nekohtml.sourceforge.net/ (a HTML tag balancer, etc.) to process the HTML into something that is XML compliant so that you can use XSLT.
I have a specific example and some notes around doing this on my personal blog at http://blogger.ziesemer.com/2008/03/scraping-suns-bug-database.html.
TagSoup
JSoup
Beautiful Soup
I need to scrape a web page using Java and I've read that regex is a pretty inefficient way of doing it and one should put it into a DOM Document to navigate it.
I've tried reading the documentation but it seems too extensive and I don't know where to begin.
Could you show me how to scrape this table in to an array? I can try figuring out my way from there. A snippet/example would do just fine too.
Thanks.
You can try jsoup: Java HTML Parser. It is an excellent library with good sample codes.
Transform the web page you are trying to scrap into an XHTML document. There are several options to do this with Java, such as JTidy and HTMLCleaner. These tools will also automatically fix malformed HTML (e.g., close unclosed tags). Both work very well, but I prefer JTidy because it integrates better with Java's DOM API;
Extract required information using XPath expressions.
Here is a working example using JTidy and the Web Page you provided, used to extract all file names from the table.
public static void main(String[] args) throws Exception {
// Create a new JTidy instance and set options
Tidy tidy = new Tidy();
tidy.setXHTML(true);
// Parse an HTML page into a DOM document
URL url = new URL("http://www.cs.grinnell.edu/~walker/fluency-book/labs/sample-table.html");
Document doc = tidy.parseDOM(url.openStream(), System.out);
// Use XPath to obtain whatever you want from the (X)HTML
XPath xpath = XPathFactory.newInstance().newXPath();
XPathExpression expr = xpath.compile("//td[#valign = 'top']/a/text()");
NodeList nodes = (NodeList)expr.evaluate(doc, XPathConstants.NODESET);
List<String> filenames = new ArrayList<String>();
for (int i = 0; i < nodes.getLength(); i++) {
filenames.add(nodes.item(i).getNodeValue());
}
System.out.println(filenames);
}
The result will be [Integer Processing:, Image Processing:, A Photo Album:, Run-time Experiments:, More Run-time Experiments:] as expected.
Another cool tool that you can use is Web Harvest. It basically does everything I did above but using an XML file to configure the extraction pipeline.
Regex is definitely the way to go. Building a DOM is overly complicated and itself requires a lot of text parsing.
If all you are doing is scraping a table into a datafile, regex will be just fine, and may be even better than using a DOM document. DOM documents will use up a lot of memory (especially for really large data tables) so you probably want a SAX parser for large documents.