Reading XML online and Storing It (Using Java)

Reading XML online and Storing It (Using Java) - java

I found and followed an example from Stackoverflow (http://stackoverflow.com/questions/2310139/how-to-read-xml-response-from-a-url-in-java) of how to read an XML file from a URL (as you can see in my code pasted below). My only trouble is that now that I got the program to read the XML, how do I get it to store it? For example, could I make it save the information to a XML file built into the project (this would be the best solution for me, if it's possible)? Such as, take for example, I have a blank XML file built into the project. The program runs, reads the XML code off of the URL, and stores it all into the pre-built blank XML file. Could I do this?
If I sound confusing or un-clear about anything, just ask me to clarify what I'm looking for.
And here is my code, if you'd like to look at what I have so far:
package xml.parsing.example;
import java.io.IOException;
import java.net.URL;
import java.net.URLConnection;
import javax.xml.parsers.DocumentBuilder;
import javax.xml.parsers.DocumentBuilderFactory;
import javax.xml.parsers.ParserConfigurationException;
import javax.xml.transform.Transformer;
import javax.xml.transform.TransformerException;
import javax.xml.transform.TransformerFactory;
import javax.xml.transform.dom.DOMSource;
import javax.xml.transform.stream.StreamResult;
import org.w3c.dom.Document;
import org.xml.sax.SAXException;
public class XmlParser {
public static void main (String[] args) throws IOException, ParserConfigurationException, SAXException, TransformerException {
URL url = new URL("http://totheriver.com/learn/xml/code/employees.xml");
URLConnection conn = url.openConnection();
DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
DocumentBuilder builder = factory.newDocumentBuilder();
Document doc = builder.parse(conn.getInputStream());
TransformerFactory tfactory = TransformerFactory.newInstance();
Transformer xform = tfactory.newTransformer();
// that’s the default xform; use a stylesheet to get a real one
xform.transform(new DOMSource(doc), new StreamResult(System.out));
}
}

Very simply:
File myOutput = new File("c:\\myDirectory\\myOutput.xml");
xform.transform(new DOMSource(doc), new StreamResult(myOutput));

This page has some great examples of how to serialize the DOM object to a neatly formatted XML file.

Related

How to parse the full content of a XML Tag in java

I have some kind of complex XML data structure. The structure contains different fragments like in the following example:
<data>
<content-part-1>
<h1>Hello <strong>World</strong>. This is some text.</h1>
<h2>.....</h2>
</content-part1>
....
</data>
The h1 tag within the tag 'content-part-1' is of interest. I want to get the full content of the xml tag 'h1'.
In java I used the javax.xml.parsers.DocumentBuilder and tried something like this:
String my_content="<h1>Hello <strong>World</strong>. This is some text.</h1>";
// parse h1 tag..
DocumentBuilder documentBuilder = DocumentBuilderFactory.newInstance().newDocumentBuilder();
Document doc = documentBuilder.parse(new InputSource(new StringReader(my_content)));
Node node = doc.importNode(doc.getDocumentElement(), true);
if (node != null && node.getNodeName().equals("h1")) {
return node.getTextContent();
}
But the method 'getTextContent()' will return:
Hello World. This is some text.
The tag "strong" is removed by the xml parser (as it is the documented behavior).
My question is how I can extract the full content of a single XML Node within a org.w3c.dom.Document without any further parsing the node content?

Although java DOM parser provides functionality for parsing mixed content, in this particular case it could be more convenient to use Jsoup library. When using it code to extract h1 element content would be as follows:
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
String text = "<data>\n"
+ " <content-part1>\n"
+ " <h1>Hello <strong>World</strong>. This is some text.</h1>\n"
+ " <h2></h2>\n"
+ " </content-part1>\n"
+ "</data>";
Document doc = Jsoup.parse(text);
Elements h1Elements = doc.select("h1");
for (Element h1 : h1Elements) {
System.out.println(h1.html());
}
Output in this case will be "Hello <strong>World</strong>. This is some text."

What you probaly want is XML generation from some subnode of your document.
So with slighlty modified nodeToString from earlier answer to similar question I can propose to
generate text <h1>Hello <strong>World</strong>. This is some text.</h1>. Some extra effor might be needed to get rid of <h1> and </h1>
package com.github.vtitov.test;
import org.junit.Test;
import org.w3c.dom.Document;
import org.w3c.dom.Node;
import org.xml.sax.InputSource;
import javax.xml.parsers.DocumentBuilder;
import javax.xml.parsers.DocumentBuilderFactory;
import javax.xml.transform.OutputKeys;
import javax.xml.transform.Transformer;
import javax.xml.transform.TransformerException;
import javax.xml.transform.TransformerFactory;
import javax.xml.transform.dom.DOMSource;
import javax.xml.transform.stream.StreamResult;
import java.io.StringReader;
import java.io.StringWriter;
import static org.hamcrest.MatcherAssert.*;
import static org.hamcrest.Matchers.*;
public class XmlTest {
#Test
public void buildXml() throws Exception {
String my_content="<h1>Hello <strong>World</strong>. This is some text.</h1>";
// parse h1 tag..
DocumentBuilder documentBuilder = DocumentBuilderFactory.newInstance().newDocumentBuilder();
Document doc = documentBuilder.parse(new InputSource(new StringReader(my_content)));
Node node = doc.importNode(doc.getDocumentElement(), true);
String h1Content = null;
if (node != null && node.getNodeName().equals("h1")) {
h1Content = nodeToString(node);
}
assertThat("h1", h1Content, equalTo("<h1>Hello <strong>World</strong>. This is some text.</h1>"));
}
private static String nodeToString(Node node) throws TransformerException {
StringWriter sw = new StringWriter();
Transformer t = TransformerFactory.newInstance().newTransformer();
t.setOutputProperty(OutputKeys.OMIT_XML_DECLARATION, "yes");
t.setOutputProperty(OutputKeys.INDENT, "no");
t.transform(new DOMSource(node), new StreamResult(sw));
return sw.toString();
}
}

Parse String to XML Document jdom2

I have a problem. I want to convert the string to a xml document.
But this code is throw exception:
Exception in thread "main" java.lang.ClassCastException: com.sun.org.apache.xerces.internal.dom.DeferredDocumentImpl cannot be cast to org.jdom2.Document
at ru.unicus.osp.Test.main(Test.java:17)
import java.io.StringReader;
import javax.xml.parsers.DocumentBuilder;
import javax.xml.parsers.DocumentBuilderFactory;
import org.jdom2.Document;
import org.xml.sax.InputSource;
public class Test {
public static void main(String[] args) throws Exception {
DocumentBuilder db = DocumentBuilderFactory.newInstance().newDocumentBuilder();
InputSource is = new InputSource();
is.setCharacterStream(new StringReader("<root><node1></node1></root>"));
Document doc = (Document) db.parse(is);
}
}
Got any ideas?

You mixed two libraries here. The simplest thing that can be done is to change the import from:
import org.jdom2.Document;
to
import org.w3c.dom.Document;
So the code should look like:
import java.io.StringReader;
import javax.xml.parsers.DocumentBuilder;
import javax.xml.parsers.DocumentBuilderFactory;
import org.w3c.dom.Document;
import org.xml.sax.InputSource;
public class Test {
public static void main(String[] args) throws Exception {
DocumentBuilder db = DocumentBuilderFactory.newInstance().newDocumentBuilder();
InputSource is = new InputSource();
is.setCharacterStream(new StringReader("<root><node1></node1></root>"));
Document doc = (Document) db.parse(is);
}
}

You are mixing two libraries here, DOM, and JDOM.
your current output Document is a JDOM document, it seems, but you are trying to cast a w3 DOM Document instance to be one, and you are failing.
If you want to have a JDOM Document as your output, then your code should likely be:
SAXBuilder builder = new SAXBuilder();
Document doc = builder.build(new StringReader("<root><node1></node1></root>"));
For a DOM document output, just use a DOM Document (org.w3c.dom.Document) instead of org.jdom2.Document

StaX parsing: Transformer.transform method moves cursor automatically, not always nice

I am using XMLStreamReader to achieve my goal(splitting xml file). It looks good, but still does not give the desired result. My aim is to split every node "nextTag" from an input file:
<?xml version="1.0" encoding="UTF-8"?>
<firstTag>
<nextTag>1</nextTag>
<nextTag>2</nextTag>
</firstTag>
The outcome should look like this:
<?xml version="1.0" encoding="UTF-8"?><nextTag>1</nextTag>
<?xml version="1.0" encoding="UTF-8"?><nextTag>2</nextTag>
Referring to Split 1GB Xml file using Java I achieved my goal with this code:
import java.io.BufferedReader;
import java.io.FileInputStream;
import java.io.InputStream;
import java.io.InputStreamReader;
import java.io.StringWriter;
import javax.xml.stream.XMLInputFactory;
import javax.xml.stream.XMLStreamReader;
import javax.xml.transform.Transformer;
import javax.xml.transform.TransformerFactory;
import javax.xml.transform.stax.StAXSource;
import javax.xml.transform.stream.StreamResult;
public class Demo4 {
public static void main(String[] args) throws Exception {
InputStream inputStream = new FileInputStream("input.xml");
BufferedReader in = new BufferedReader(new InputStreamReader(inputStream));
XMLInputFactory factory = XMLInputFactory.newInstance();
TransformerFactory tf = TransformerFactory.newInstance();
Transformer t = tf.newTransformer();
XMLStreamReader streamReader = factory.createXMLStreamReader(in);
while (streamReader.hasNext()) {
streamReader.next();
if (streamReader.getEventType() == XMLStreamReader.START_ELEMENT
&& "nextTag".equals(streamReader.getLocalName())) {
StringWriter writer = new StringWriter();
t.transform(new StAXSource(streamReader), new StreamResult(
writer));
String output = writer.toString();
System.out.println(output);
}
}
}
}
Actually very simple. But, my input file is in form from a single line:
<?xml version="1.0" encoding="UTF-8"?><firstTag><nextTag>1</nextTag><nextTag>2</nextTag></firstTag>
My Java code does not produce the desired output anymore, instead just this result:
<?xml version="1.0" encoding="UTF-8"?><nextTag>1</nextTag>
After spending hours, I am pretty sure to already find out the reason:
t.transform(new StAXSource(streamReader), new StreamResult(writer));
It is because, after the transform method being executed, the cursor will automatically moved forward to the next event. And in the code, I have this fraction:
while (streamReader.hasNext()) {
streamReader.next();
...
t.transform(new StAXSource(streamReader), new StreamResult(writer));
...
}
After the first transform, the streamReader gets directly 2 times next():
1. from the transform method
2. from the next method in the while loop
So, in case of this specific line XML, the cursor can never achive the second open tag .
In opposite, if the input XML has a pretty print form, the second can be reached from the cursor because there is a space-event after the first closing tag
Unfortunately, I could not find anything how to do settings, so that the transformator does not automatically spring to next event after performing the transform method. This is so frustating.
Does anybody have any idea how I can deal with it? Also semantically is very welcome. Thank you so much.
Regards,
Ratna
PS. I can surely write a workaround for this problem(pretty print the xml document before transforming it, but this would mean that the input xml was being modified before, this is not allowed)

As you elaborated did the transformation step proceed to the next create element if the element-nodes follow directly each other.
In order to deal with this, you can rewrite you code using nested while loops, like this:
while(reader.next() != XMLStreamConstants.END_DOCUMENT) {
while(reader.getEventType() == XMLStreamConstants.START_ELEMENT && reader.getLocalName().equals("nextTag")) {
StringWriter writer = new StringWriter();
// will transform the current node to a String, moves the cursor to the next START_ELEMENT
t.transform(new StAXSource(reader), new StreamResult(writer));
System.out.println(writer.toString());
}
}

In case your xml file fits in memory, you can try with the help of the JOOX library, imported in gradle like:
compile 'org.jooq:joox:1.3.0'
And the main class, like:
import java.io.File;
import java.io.IOException;
import org.joox.JOOX;
import org.joox.Match;
import org.w3c.dom.Document;
import org.xml.sax.SAXException;
import javax.xml.parsers.DocumentBuilder;
import javax.xml.transform.Transformer;
import javax.xml.transform.TransformerException;
import javax.xml.transform.TransformerFactory;
import javax.xml.transform.dom.DOMSource;
import javax.xml.transform.stream.StreamResult;
import static org.joox.JOOX.$;
public class Main {
public static void main(String[] args)
throws IOException, SAXException, TransformerException {
DocumentBuilder builder = JOOX.builder();
Document document = builder.parse(new File(args[0]));
Transformer transformer =
TransformerFactory.newInstance().newTransformer();
transformer.setOutputProperty("omit-xml-declaration", "no");
final Match $m = $(document);
$m.find("nextTag").forEach(tag -> {
try {
transformer.transform(
new DOMSource(tag),
new StreamResult(System.out));
System.out.println();
}
catch (TransformerException e) {
System.exit(1);
}
});
}
}
It yields:
<?xml version="1.0" encoding="UTF-8"?><nextTag>1</nextTag>
<?xml version="1.0" encoding="UTF-8"?><nextTag>2</nextTag>

Printing XML in Java without the xml file tag

Is there a way to print the XML content without the XML header tag in Java?
For example if I have an XML like this:
<?xml version='1.0' encoding='UTF-8'?>
<rootElement>
<childElement>Text</childElement>
</rootElement>
I just want to print
<rootElement>
<childElement>Text</childElement>
</rootElement>
This is very similar to what I am doing so far:
http://sacrosanctblood.blogspot.com/2008/07/convert-xml-file-to-xml-string-in-java.html
I cannot give out the exact source code but the above link example should give you some idea. Here's that code with imports:
import javax.xml.parsers.DocumentBuilder;
import javax.xml.parsers.DocumentBuilderFactory;
import javax.xml.transform.TransformerFactory;
import javax.xml.transform.Transformer;
import javax.xml.transform.dom.DOMSource;
import javax.xml.transform.stream.StreamResult;
import javax.xml.transform.TransformerConfigurationException;
import javax.xml.transform.TransformerException;
import org.w3c.dom.Attr;
import org.w3c.dom.Document;
import org.w3c.dom.Text;
public String convertXMLFileToString(String fileName)
{
try{
DocumentBuilderFactory documentBuilderFactory = DocumentBuilderFactory.newInstance();
InputStream inputStream = new FileInputStream(new File(fileName));
org.w3c.dom.Document doc = documentBuilderFactory.newDocumentBuilder().parse(inputStream);
StringWriter stw = new StringWriter();
Transformer serializer = TransformerFactory.newInstance().newTransformer();
serializer.transform(new DOMSource(doc), new StreamResult(stw));
return stw.toString();
}
catch (Exception e) {
e.printStackTrace();
}
return null;
}

Transformer serializer = TransformerFactory.newInstance().newTransformer();
serializer.setOutputProperty("omit-xml-declaration", "yes");
serializer.transform(new DOMSource(doc), new StreamResult(stw));
Good and old XSL ;).

Standardise on a XML reader methodology

In an open source project I maintain, we have at least three different ways of reading, processing and writing XML files and I would like to standardise on a single method for ease of maintenance and stability.
Currently all of the project files use XML from the configuration to the stored data, we're hoping to migrate to a simple database at some point in the future but will still need to read/write some form of XML files.
The data is stored in an XML format that we then use a XSLT engine (Saxon) to transform into the final HTML files.
We currently utilise these methods:
- XMLEventReader/XMLOutputFactory (javax.xml.stream)
- DocumentBuilderFactory (javax.xml.parsers)
- JAXBContext (javax.xml.bind)
Are there any obvious pros and cons to each of these?
Personally, I like the simplicity of DOM (Document Builder), but I'm willing to convert to one of the others if it makes sense in terms of performance or other factors.
Edited to add:
There can be a significant number of files read/written when the project runs, between 100 & 10,000 individual files of around 5Kb each

It depends on what you are doing with the data.
If you are simply performing XSLT transforms on XML files to produce HTML files then you may not need to touch a parser directly:
import java.io.File;
import javax.xml.transform.Transformer;
import javax.xml.transform.TransformerFactory;
import javax.xml.transform.stream.StreamResult;
import javax.xml.transform.stream.StreamSource;
public class Demo {
public static void main(String[] args) throws Exception {
TransformerFactory tf = TransformerFactory.newInstance();
StreamSource xsltTransform = new StreamSource(new File("xslt.xml"));
Transformer transformer = tf.newTransformer(xsltTransform);
StreamSource source = new StreamSource(new File("source.xml"));
StreamResult result = new StreamResult(new File("result.html"));
transformer.transform(source, result);
}
}
If you need to make changes to the input document before you transform it, DOM is a convenient mechanism for doing this:
import java.io.File;
import javax.xml.parsers.DocumentBuilder;
import javax.xml.parsers.DocumentBuilderFactory;
import javax.xml.transform.Transformer;
import javax.xml.transform.TransformerFactory;
import javax.xml.transform.dom.DOMSource;
import javax.xml.transform.stream.StreamResult;
import javax.xml.transform.stream.StreamSource;
import org.w3c.dom.Document;
public class Demo {
public static void main(String[] args) throws Exception {
TransformerFactory tf = TransformerFactory.newInstance();
StreamSource xsltTransform = new StreamSource(new File("xslt.xml"));
Transformer transformer = tf.newTransformer(xsltTransform);
DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance();
DocumentBuilder db = dbf.newDocumentBuilder();
Document document = db.parse(new File("source.xml"));
// modify the document
DOMSource source = new DOMSource(document);
StreamResult result = new StreamResult(new File("result.html"));
transformer.transform(source, result);
}
}
If you prefer a typed model to make changes to the data then JAXB is a perfect fit:
import java.io.File;
import javax.xml.bind.JAXBContext;
import javax.xml.bind.Unmarshaller;
import javax.xml.bind.util.JAXBSource;
import javax.xml.transform.Transformer;
import javax.xml.transform.TransformerFactory;
import javax.xml.transform.stream.StreamResult;
import javax.xml.transform.stream.StreamSource;
public class Demo {
public static void main(String[] args) throws Exception {
TransformerFactory tf = TransformerFactory.newInstance();
StreamSource xsltTransform = new StreamSource(new File("xslt.xml"));
Transformer transformer = tf.newTransformer(xsltTransform);
JAXBContext jc = JAXBContext.newInstance("com.example.model");
Unmarshaller unmarshaller = jc.createUnmarshaller();
Model model = (Model) unmarshaller.unmarshal(new File("source.xml"));
// modify the domain model
JAXBSource source = new JAXBSource(jc, model);
StreamResult result = new StreamResult(new File("result.html"));
transformer.transform(source, result);
}
}

This is a very subjective topic. It primarily depends on how you are going to use the xml and size of XML. If XML is (always) small enough to be loaded in to memory, then you don't have to worry about memory foot print. You can use DOM parser. If you need to a parse through 150 MB xml you may want to think of using SAX. etc.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Reading XML online and Storing It (Using Java) - java

Very simply: File myOutput = new File("c:\\myDirectory\\myOutput.xml"); xform.transform(new DOMSource(doc), new StreamResult(myOutput));

This page has some great examples of how to serialize the DOM object to a neatly formatted XML file.

Related

How to parse the full content of a XML Tag in java

Parse String to XML Document jdom2

StaX parsing: Transformer.transform method moves cursor automatically, not always nice

Printing XML in Java without the xml file tag

Standardise on a XML reader methodology

Categories

Resources