How to parse JSP Pages into a XML file? - java

I am trying to convert a JSP page document into a XML file.I have been using jsoup and very well reading the whole content except server tags, but I can't understand how can the whole HTML be converted to XML tags. I mean how can I fetch data line by line?
My Code:
File Html=new File("genXML.jsp");
Document doc=Jsoup.parse(Html,"UTF-8","http://www.example.com");
System.out.println(doc.html());
Any assistance would be great

First of all, it is not the same to convert JSP to XML with converting HTML to XML. I suppose you want to translate the HTML generated from a JSP to XML. Second of all, you don't want to do this line by line. An HTML block usually does not begin and ends in a line.
Anyway, you could use a tool like tagsoup to convert HTML code to XHTML. XHTML is actually XML. Tagsoup can be called to make the translation. I don't know if it has a usefule API, but at least it could be called from your code as an external process using something like this:
Process tr = Runtime.getRuntime().exec(new String[]{ "..." } );
Then if you want to transform it to a target XML schema, you could apply an XSLT transformation using a tool like ones found online (check this and this). You could apply the XSLT transformation programmatically using JAXP.
Hope I helped!

Related

Extract the first page content from docx file by XML parsing

I need to extract the first page content from the docx file and save it as a seperate document. I need everything from the first page( images, tables, text) to be saved as it is in new docx file.
What i tried is :
I looked into the xml of the unzipped docx file. Since word document is reflowable i couldnt find a page break after each page ends. So i couldnt find the end of each page via the document.xml
Is there any way to get the XML content of the first page of the document alone using java XML DOM parser ?
Do not write a new parser, there are tons of already existing tools for that (e.g., what if your input changes from XML to binary Word files?).
Use Apache POI for example, as #JFB suggested.

docx4j convert docx in wrong html format

I have some problems with docx4j samples. I need to convert a file from docx in html format and back. I'm try to compile ConvertInXHTMLDocument.java sample. Html file it creates fine, but when trying to convert it back into docx, throws an exception that is missing close tags (META, img etc). Has anyone encountered this problem?
XHTMLImporter requires its input to be well-formed XML. So you need to ensure you don't have missing close tags (META, img etc); if you do, run JTidy or similar first.
docx4j's (X)HTML output can either be HTML or XML. From 3.0, the property Convert.Out.HTML.OutputMethodXML will control which.

Parsing HTML and get all the nodes

I need to parse an HTML file in java. Unlike XML there is no repetitive tags. So I need a code that can parse the html file and reach all nodes, it includes nested tags .. etc. The HTML code is not fixed. In other words given any HTML code I need to reach all the tags in the HTML.
try this HTML Parser
http://htmlparser.sourceforge.net/samples.html
I think you need this...
var els=document.getElementsByTagName("*");
for(var i=0;i<els.length;i+)document.write(els.nodeName+"<br />");

Extraction of HTML Tags using Java

I wanted to extract the various HTML tags available from the source code of a web page is there any method in Java to do that or do HTML parser support this?
I want to seperate all the HTML tags .
Java comes with an XML parser with similar methods to the DOM in JavaScript:
DocumentBuilder builder = DocumentBuilderFactory.newDocumentBuilder();
Document doc = builder.parse(html);
doc.getElementById("someId");
doc.getElementsByTagName("div");
doc.getChildNodes();
The document builder can take many different inputs (input stream, raw html string, etc).
http://download.oracle.com/javase/1.5.0/docs/api/org/w3c/dom/Document.html
The cyber neko parser is also good if you need more.
Check out CyberNeko HTML Parser.
You can use regular expressions.
If your html is valid XML -- you can use XML parser
I've used HTMLParser in one project, was pretty happy with it.
Edit: If you check the samples page, the parser sample does pretty much what you're asking for.
You can write your own util method to extract tags.
Check for < and /> or > for complete tag and write those tags to another file.

Is there a simple java program that can extract URL & title of html files

Hi I am looking for a simple URL & title extractor from html files in Java. I am trying to parse bookmarks.html (IE,Firefox) etc and add the title & url to a db. I need to do this in java (no 3rd party libraries allowed) so proably I have to use sax/dom/regex.
You can load up the file into a DOM document and then use an XPath expression to find all the instances of an tag. Extracting the HREF attribute and the tag contents should do what you want to do. The XPath would probably be something as simple as '//A'.

Categories