I need to parse non well-formed xml data (HTML) - java

I have some non well-formed xml (HTML) data in JAVA, I used JAXP Dom, but It complains.
The Question is :Is there any way to
use JAXP to parse such documents ??
I have a file containing data such as :
<employee>
<name value="ahmed" > <!-- note, this element is not closed, So it is not well-formed xml-->
</employee>

You could try running your document through the jtidy API first - that has the ability to convert html into valid xhtml: http://jtidy.sourceforge.net/howto.html
Tidy tidy = new Tidy();
tidy.setXHTML(true);
tidy.parse(......)...

You could use TagSoup. I have used it with great success. It is completely compatible with the Java XML APIs, including SAX, DOM, XSLT, and StAX. For example, here is how I used it to apply XSLT transforms to particularly poor HTML:
public static void transform(InputStream style, InputStream data)
throws SAXException, TransformerException {
XMLReader reader =
XMLReaderFactory.createXMLReader("org.ccil.cowan.tagsoup.Parser");
Source input = new SAXSource(reader, new InputSource(data));
Source xsl = new StreamSource(style);
Transformer transformer =
TransformerFactory.newInstance().newTransformer(xsl);
transformer.transform(input, new StreamResult(System.out));
}

Not really. JAXP wants well-formed markup. Have you considered the Cyberneko HTML Parser? We've been very successful with it at our shop.
EDIT: I see you are wanting to parse XML too. Hrmm.... Cyberneko works well for HTML but I don't know about others. It has a tag balancer that would close some tags off, but I don't know if you can train it to recognize tags that are not HTML.

Related

How to use java to get element of xml document, but in xml string format?

I have read some links on parsing xml document like below:
<inventory>
<book year="2000">
<title>Snow Crash</title>
<author>Neal Stephenson</author>
<publisher>Spectra</publisher>
<isbn>0553380958</isbn>
<price>14.95</price>
</book>
<book year="2005">
<title>Burning Tower</title>
<author>Larry Niven</author>
<author>Jerry Pournelle</author>
<publisher>Pocket</publisher>
<isbn>0743416910</isbn>
<price>5.99</price>
</book>
<!-- more books... -->
</inventory>
using DOM parsing:
DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
DocumentBuilder builder = factory.newDocumentBuilder();
Document doc = builder.parse(<uri_as_string>);
XPathFactory xPathfactory = XPathFactory.newInstance();
XPath xpath = xPathfactory.newXPath();
XPathExpression expr = xpath.compile(<xpath_expression>);
however, their purpose are mostly to get VALUE of some node(s) by tag or by attribute from the document.
My purpose is to get the entire XML STRING of the node(s) back. For example, using Xpath /inventory/book[#year='2005'], i want to get the following xml back in a single string, i.e.
<book year="2005">
<title>Burning Tower</title>
<author>Larry Niven</author>
<author>Jerry Pournelle</author>
<publisher>Pocket</publisher>
<isbn>0743416910</isbn>
<price>5.99</price>
</book>
What is the API used for this purpose? And do i even need the DOM parsing in this case? Thanks,
COMMENT:
Maybe I should emphasize that I am asking this question as a XML related one, not a text file processing question. Concepts like 'tag', 'attribute', 'Xpath' still apply. The DOM model is not totally irrelevant. It's just that instead of getting the 'element' or value of a node, i want to get the whole node.
The given answers can not solve problems like: how to get a node in xml string format, given the node's Xpath representation, such as //book or /inventory/book[1]?
DOM parsers are designed to get values from the them not for actual file content.
You can use a simple file reader instead of XML.
Read line by line using a simple FileReader and check the line for the Condition and if the condition is met start the read content to concat as you want until the End of the node .
You can do it as
if(lineReadFromFile=="Your String Condition"){
//collect the desired file content here untill the end of the Node is found
}
You can simply read XML from file (consider it to be a normal text file) using FileReader. Simple apply the condition for example :
if(line.equals("<book year="2005"><title>Burning Tower</title>")) {
// retrieve/save the required content
}

multiple html as output from 1 xsl with java

I want to know how can I generate multiple output (html) from one xml using java and xsl.
For example, having this xml:
<ARTICLE>
<SECT>
<PARA>The First 1st Major Section</PARA>
</SECT>
<SECT>
<PARA>The Second 2nd Major Section</PARA>
</SECT>
</ARTICLE>
For each child element "SECT" from "ARTICLE" I would like to have one ".html" as an output, example of the output:
sect1.html
<html>
<body>
<div>
<h1>The First 1st Major Section</h1>
</div>
</body>
</html>
sect2.html
<html>
<body>
<div>
<h1>The First 2nd Major Section</h1>
</div>
</body>
</html>
I've been working in java to transform the .xml document with the next code:
File stylesheet = new File(argv[0]);
File datafile = new File(argv[1]);
DocumentBuilder builder = factory.newDocumentBuilder();
document = builder.parse(datafile);
// Use a Transformer for output
TransformerFactory tFactory = TransformerFactory.newInstance();
StreamSource stylesource = new StreamSource(stylesheet);
Transformer transformer = tFactory.newTransformer(stylesource);
DOMSource source = new DOMSource(document);
OutputStream result=new FileOutputStream("sections.html");
transformer.transform(source, new StreamResult(result));
The problem is that I have only one output, Could you help me to write the .xslt document please? and tell me how to get more than 1 output?
To create more than one result document, you need an XSLT Processor which supports multiple result documents. The feature of multiple result documents was introduced in XSLT 2.0. Some XSLT Processors which do not yet implement XSLT 2.0 or newer feature multiple result documents as a proprietary extension.
Creating multiple result documents is, unlike the primary result document, not controlled directly from the Java source code. Instead, the XSLT code needs to contain the XSLT elements that create the multiple result documents.
In XSLT 2.0 and newer, the <xsl:result-document/> element is used to create multiple result documents. See XSLT 2.0, <xsl:result-document/> for more information and examples.
As far as I am aware, the XSLT Processor shipped with Java is Xalan-J, and Xalan-J does not yet support XSLT 2.0 or newer (according to their website http://xml.apache.org/xalan-j/). You might want to use Saxon instead, which supports XSLT 3.0. Or as described in this previous question Xalan XSLT multiple output files? you could use the Redirect extension.

java dom xml parser get html tags(<p color="something">some text</p>) from xml

I have an xml file with html tags like:
<?xml version="1.0" encoding="utf-8" ?>
<blog>
<blogid>49</blogid>
<title>[FIXED] Job requests page broken</title>
<fulltext>
<img title="page broken" src="images/west/blog/site-broken.jpg" alt="page broken" />
<p><span style="background-color: #ccffcc;">Update 28/05/2011</span>: Job requests page seems to be working OK now. If you find any issues please use the contact page to notify us. Thank you for your patience!</p>
<p>Â </p>
<p>Well, what can I say? Why does it always have to be that way? You are trying to create something new and something else gets broken on the way...</p>
</fulltext>
Now I want the whole html part between tag as it is.
What I get right now is blank as I think dom is parsing html tags as well.
I tried xpath but it is not working with android.
I don't think you can get this not well-formed XML into a DOM as-is. (EDIT: or is it well-formed?)
You would need to a) either escape the characters - making the XML well-formed and parseable (but probably not into a DOM you want, I guess you want to display the HTML in a different system) or b) parse it using a stream processor or c) fix it using string manipulation (add <[[CDATA .. ]]>) and then parse it into a DOM.
HTH
HTML is a sub-language of XML (without getting into details related to XHTML). Therefore, there is no reason for the DOM parser not to treat those inner tags as XML tags.
Maybe what you're looking for is a way to flatten what's inside <fulltext>?
use a library like Jsoup for this purpose.
public static void main(String args[]){
String html = "<?xml version="1.0"?><foo>" +
"<bar>Some text — invalid!</bar></foo>";
Document doc = Jsoup.parse(html, "", Parser.xmlParser());
for (Element e : doc.select("bar")) {
System.out.println(e);
}
}

How to put String text without converting content to xml file in Java?

I need to put String content to xml in Java. I use this kind of code to insert information in xml:
Document doc = DocumentBuilderFactory.newInstance().newDocumentBuilder().parse(new File ("file.xml"));
DOMSource source = new DOMSource (doc);
Node cards = doc.getElementsByTagName ("cards").item (0);
Element card = doc.createElement ("card");
cards.appendChild(card);
Element question = doc.createElement("question");
question.appendChild(doc.createTextNode("This <b>is</b> a test.");
card.appendChild (question);
StreamResult result = new StreamResult (new File (file));
Transformer tf = TransformerFactory.newInstance().newTransformer();
tf.setOutputProperty(OutputKeys.INDENT, "yes");
tf.transform(source, result);
But string is converted in xml like this:
<cards>
<card>
<question>This <b>is</b> a test.</question>
</card>
</cards>
It should be like this:
<cards>
<card>
<question>This <b>is</b> a test.</question>
</card>
</cards>
I tried to use CDDATA method but it puts code like this:
// I changed this code
question.appendChild(doc.createTextNode("This <b>is</b> a test.");
// to this
question.appendChild(doc.createCDATASection("This <b>is</b> a test.");
This code gets a xml file look like:
<cards>
<card>
<question><![CDATA[This <b>is</b> a test.]]></question>
</card>
</cards>
I hope that somebody can help me to put String content in the xml file exactly with same content.
Thanks in advance!
This would be expected behaviour.
Consider if the brackets were kept as you put them, the end result would essentially be:
<cards>
<card>
<question>
This
<b>
is
</b>
a test.
</question>
</card>
</cards>
Basically, it would result in the <b> being an additional node in the xml tree. Encoding the brackets to < and > ensures that when displayed by any xml parser, the brackets will be displayed, and not confused as being an additional node.
If you really wanted them to display as you say you do, you will need to create elements named b. This will not only be awkward, it will also not display quite as you've written above - it would display as additional nested nodes as I've shown above. So you would need to amend the xml writer to output inline for those tags.
Nasty.
Check this solution: how to unescape XML in java
Maybe you could solve it in this way (code only for <question> tag part):
Element question = doc.createElement("question");
question.appendChild(doc.createTextNode("This ");
Element b = doc.createElement("b");
b.appendChild(doc.createTextNode("is");
question.appendChild(b);
question.appendChild(doc.createTextNode(" a test.");
card.appendChild(question);
What you are effectively trying to do is to insert XML into the middle of a DOM without parsing it. You can't do this since the DOM APIs don't support it.
You have three choices:
You could serialize the DOM and then insert the String at the appropriate point. The end result may or may not be well-formed XML ... depending on what is in the String that you inserted.
You could create and insert DOM nodes representing the text and the <b>...</b> element. This requires you to know the XML structure of the stuff that you are inserting. #bluish's answer gives an example.
You could wrap the String in some container element, parse it using an XML parser to give a second DOM, find the nodes of interest, and add them to the original DOM. This requires that the String is well-formed XML when wrapped in the container element.
Or, since you're already using a Transformation, why not go all the way?
<xsl:stylesheet version="1.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output omit-xml-declaration="yes"/>
<xsl:template match="#*|node()">
<xsl:copy>
<xsl:apply-templates select="#*|node()" />
</xsl:copy>
</xsl:template>
<xsl:template match="cards">
<card>
<question>This <b>is</b> a test</question>
</card>
</xsl:template>
</xsl:stylesheet>

XML Editor in java(jsp,sevlet)

I am developing xml editor using jsp and servlet. In this case i am using DOM parser.
I have one problem in XML editor ,
How to edit the following xml file without losing elements.
eg:
<book id="b1">
<bookbegin id="bb1">
<para id="p1">This is<b>first</b>line</para>
<para id="p2">This is<b>second</b>line</para>
<para id="p3">This is<b>third</b>line</para>
</bookbegin>
</book>
I try to edit the above xml file using dtd using jsp,servlet. but while i read the textvalue from xml, it return only first,second,third.How to read the 'This is' and 'line '. Then how to store back to the xml file using xpath.
thank in advance.
The <b> tag inside the <para> tag is another element, not a formatting tag (in XML). Therefore, you need to traverse down to it.
Like #JRL says, the <b> tags are cosnidered as well-formed XML and, as a consequence, splitted by your DOM processor.
I think youf ail to read other text elements because you only read text when an XML node has no more XML node, which is not your case here.

Categories