I've got an arbitrary XHTML document which are usually not well formed, since websites can be made like that and browser will show it. How can I support XSLT translation for not well formed XHTML code? Is there a way that it can avoid those parts which are not well formed?
I have this code in Java, but as I've said it's not supporting not well formed XHTML:
try {
TransformerFactory tFactory=TransformerFactory.newInstance();
Source xslDoc=new StreamSource("path1");
Source xmlDoc=new StreamSource("path2");
String outputFileName="path3";
OutputStream htmlFile=new FileOutputStream(outputFileName);
Transformer trasform=tFactory.newTransformer(xslDoc);
trasform.transform(xmlDoc, new StreamResult(htmlFile));
}
catch (Exception e) {...}
You can use JSoup library to parse and fix your HTML and then use XSLT.
You can try to use an HTML parser like http://about.validator.nu/htmlparser/ or like TagSoup.
Related
Everything is okay when I read the data from webpage using InputStreamReader.
I have problem with parsing data to DocumentHTML.
Main reason is that the HTML script has some special characters which are used incorrectly.
There is an & sign twice ( "&&" ) and I believe that is causing the code to crash.
My code looks like this:
URL url = new URL(PageUrl);
URLConnection conn = url.openConnection();
// ... omitted ...
// parsing
HTMLDocument doc = (HTMLDocument)db.parse(conn.getInputStream());
Since I am making an Android application, I don't use standard parsing functions since the DocumentHTML object is going to be too large.
I found many existing examples of parsing HTML like using jsoup but they are not what I want.
I want to write my own code for parsing so that the HTMLDocument object will be kept small.
Why dont you use all the available Html parsers that are available in java?
They have community support they so are the best option.
Open Source HTML Parsers in Java
Hi Can anyone help me in html to xml conversion using xslt in java.I converted xml to html using xslt in java.This is the code i used for that converstion:
import javax.xml.transform.*;
import java.net.*;
import java.io.*;
public class HowToXSLT {
public static void main(String[] args) {
try {
TransformerFactory tFactory = TransformerFactory.newInstance();
Transformer transformer =
tFactory.newTransformer
(new javax.xml.transform.stream.StreamSource
("howto.xsl"));
transformer.transform
(new javax.xml.transform.stream.StreamSource
("howto.xml"),
new javax.xml.transform.stream.StreamResult
( new FileOutputStream("howto.html")));
}
catch (Exception e) {
e.printStackTrace( );
}
}
}
But i dont know the reverse process of this program that is to convert html to xml? Is there is any jar files available to do that? please help me...
Generally, it isn't possible to "reverse" a transformation, because a transformation in the general case isn't a 1:1 mapping.
For example, if the transformation does this:
<xsl:value-of select= "/x * /x"/>
and we get as result: 16
(and we know that the source XML document had only one element),
it isn't possible to determine from the value 16 whether the source XML document was:
<x>4</x>
or whether it was:
<x>-4</x>
And the above was only a simple example! :)
This will depend on what you wish to do exactly.
Apparently, howto.xsl contains the rules to be applied on the xml to get the html.
You will have to write another xsl file to do the reverse.
I believe it is not possible. XLST input must be XML conforming and HTML is not conforming to XML (unless you talk about XHTML).
May be you need to first make your html xhtml complaint, then use a xsl (reverse of the original xsl)which has instruction to convert the xhtml file to xml.
Its not possible, you can use Microsoft.XMLDOM for converting from HTML to XML.
Trying to figure out a way to strip out specific information(name,description,id,etc) from an html file leaving behind the un-wanted information and storing it in an xml file.
I thought of trying using xslt since it can do xml to html... but it doesn't seem to work the other way around.
I honestly don't know what other language i should try to accomplish this. i know basic java and javascript but not to sure if it can do it.. im kind of lost on getting this started.
i'm open to any advice/help. willing to learn a new language too as i'm just doing this for fun.
There are a number of Java libraries for handling HTML input that isn't well-formed (according to XML). These libraries also have built-in methods for querying or manipulating the document, but it's important to realize that once you've parsed the document it's usually pretty easy to treat it as though it were XML in the first place (using the standard Java XML interfaces). In other words, you only need these libraries to parse the malformed input; the other utilities they provide are mostly superfluous.
Here's an example that shows parsing HTML using HTMLCleaner and then converting that object into a standard org.w3c.dom.Document:
TagNode tagNode = new HtmlCleaner().clean("<html><div><p>test");
DomSerializer ser = new DomSerializer(new CleanerProperties());
org.w3c.dom.Document doc = ser.createDOM(tagNode);
In Jsoup, simply parse the input and serialize it into a string:
String text = Jsoup.parse("<html><div><p>test").outerHtml();
And convert that string into a W3C Document using one of the methods described here:
How to parse a String containing XML in Java and retrieve the value of the root node?
You can now use the standard JAXP interfaces to transform this document:
TransformerFactory tFact = TransformerFactory.newInstance();
Transformer transformer = tFact.newTransformer();
Source source = new DOMSource(doc);
Result result = new StreamResult(System.out);
transformer.transform(source, result);
Note: Provide some XSLT source to tFact.newTransformer() to do something more useful than the identity transform.
I would use HTMLAgilityPack or Chris Lovett's SGMLReader.
Or, simply HTML Tidy.
Ideally, you can treat your HTML as XML. If you're lucky, it will already be XHTML, and you can process it as HTML. If not, use something like http://nekohtml.sourceforge.net/ (a HTML tag balancer, etc.) to process the HTML into something that is XML compliant so that you can use XSLT.
I have a specific example and some notes around doing this on my personal blog at http://blogger.ziesemer.com/2008/03/scraping-suns-bug-database.html.
TagSoup
JSoup
Beautiful Soup
I need to scrape a web page using Java and I've read that regex is a pretty inefficient way of doing it and one should put it into a DOM Document to navigate it.
I've tried reading the documentation but it seems too extensive and I don't know where to begin.
Could you show me how to scrape this table in to an array? I can try figuring out my way from there. A snippet/example would do just fine too.
Thanks.
You can try jsoup: Java HTML Parser. It is an excellent library with good sample codes.
Transform the web page you are trying to scrap into an XHTML document. There are several options to do this with Java, such as JTidy and HTMLCleaner. These tools will also automatically fix malformed HTML (e.g., close unclosed tags). Both work very well, but I prefer JTidy because it integrates better with Java's DOM API;
Extract required information using XPath expressions.
Here is a working example using JTidy and the Web Page you provided, used to extract all file names from the table.
public static void main(String[] args) throws Exception {
// Create a new JTidy instance and set options
Tidy tidy = new Tidy();
tidy.setXHTML(true);
// Parse an HTML page into a DOM document
URL url = new URL("http://www.cs.grinnell.edu/~walker/fluency-book/labs/sample-table.html");
Document doc = tidy.parseDOM(url.openStream(), System.out);
// Use XPath to obtain whatever you want from the (X)HTML
XPath xpath = XPathFactory.newInstance().newXPath();
XPathExpression expr = xpath.compile("//td[#valign = 'top']/a/text()");
NodeList nodes = (NodeList)expr.evaluate(doc, XPathConstants.NODESET);
List<String> filenames = new ArrayList<String>();
for (int i = 0; i < nodes.getLength(); i++) {
filenames.add(nodes.item(i).getNodeValue());
}
System.out.println(filenames);
}
The result will be [Integer Processing:, Image Processing:, A Photo Album:, Run-time Experiments:, More Run-time Experiments:] as expected.
Another cool tool that you can use is Web Harvest. It basically does everything I did above but using an XML file to configure the extraction pipeline.
Regex is definitely the way to go. Building a DOM is overly complicated and itself requires a lot of text parsing.
If all you are doing is scraping a table into a datafile, regex will be just fine, and may be even better than using a DOM document. DOM documents will use up a lot of memory (especially for really large data tables) so you probably want a SAX parser for large documents.
I have a Java String with SGML, something like this...
<misspell></misspell><plain>I</plain> <plain>know</plain> <plain>you</plain> <suggestion>ducky</suggestion> <plain>suck</plain> <plain>and</plain> <plain>I</plain> <plain>rocky</plain> <plain>rock</plain>
How do I parse it to get for instance say the text inside <suggestion> </suggestion>so as to get "ducky" out??
Will javax.swing.text.html.parser.Parse can be of any help? or I can only parse HTML docs with it?
The string you show is not HTML, but it could be parsed by an XML parser.
The SAX API is part of the JDK and AFAIK most XML parsers implement it.
try an html parser, they are (by necessity) quite forgiving of malformed markup and html is by nature based on SGML.
e.g. http://htmlparser.sourceforge.net/