How to modify a huge XML file by StAX? - java

I have a huge XML (~2GB) and I need to add new Elements and modify the old ones. For example, I have:
<books>
<book>....</book>
...
<book>....</book>
</books>
And want to get:
<books>
<book>
<index></index>
....
</book>
...
<book>
<index></index>
....
</book>
</books>
I used the following code:
XMLInputFactory inFactory = XMLInputFactory.newInstance();
XMLEventReader eventReader = inFactory.createXMLEventReader(new FileInputStream(file));
XMLOutputFactory factory = XMLOutputFactory.newInstance();
XMLStreamWriter writer = factory.createXMLStreamWriter(new FileWriter(file, true));
while (eventReader.hasNext()) {
XMLEvent event = eventReader.nextEvent();
if (event.getEventType() == XMLEvent.START_ELEMENT) {
if (event.asStartElement().getName().toString().equalsIgnoreCase("book")) {
writer.writeStartElement("index");
writer.writeEndElement();
}
}
}
writer.close();
But the result was the following:
<books>
<book>....</book>
....
<book>....</book>
</books><index></index>
Any ideas?

Try this
XMLInputFactory inFactory = XMLInputFactory.newInstance();
XMLEventReader eventReader = inFactory.createXMLEventReader(new FileInputStream("1.xml"));
XMLOutputFactory factory = XMLOutputFactory.newInstance();
XMLEventWriter writer = factory.createXMLEventWriter(new FileWriter(file));
XMLEventFactory eventFactory = XMLEventFactory.newInstance();
while (eventReader.hasNext()) {
XMLEvent event = eventReader.nextEvent();
writer.add(event);
if (event.getEventType() == XMLEvent.START_ELEMENT) {
if (event.asStartElement().getName().toString().equalsIgnoreCase("book")) {
writer.add(eventFactory.createStartElement("", null, "index"));
writer.add(eventFactory.createEndElement("", null, "index"));
}
}
}
writer.close();
Notes
new FileWriter(file, true) is appending to the end of the file, you hardly really need it
equalsIgnoreCase("book") is bad idea because XML is case-sensitive

Well it is pretty clear why it behaves the way it does. What you are actually doing is opening the existing file in output append mode and writing elements at the end. That clearly contradicts what you are trying to do.
(Aside: I'm surprised that it works as well as it does given that the input side is likely to see the elements that the output side is added to the end of the file. And indeed the exceptions like Evgeniy Dorofeev's example gives are the sort of thing I'd expect. The problem is that if you attempt to read and write a text file at the same time, and either the reader or writer uses any form of buffering, explicit or implicit, the reader is liable to see partial states.)
To fix this you have to start by reading from one file and writing to a different file. Appending won't work. Then you have to arrange that the elements, attributes, content etc that are read from the input file are copied to the output file. Finally, you need to add the extra elements at the appropriate points.
And is there any possibility to open the XML file in mode like RandomAccessFile, but write in it by StAX methods?
No. That is theoretically impossible. In order to to be able to navigate around an XML file's structure in a "random" file, you'd first need to parse the whole thing and build an index of where all the elements are. Even when you've done that, the XML is still stored as characters in a file, and random access does not allow you to insert and remove characters in the middle of a file.
Maybe your best bet would be combining XSL and a SAX style parser; e.g. something along the lines of this IBM article: http://ibm.com/developerworks/xml/library/x-tiptrax

Maybe this StAX Read-and-Write Example in JavaEE tutorial helps: http://docs.oracle.com/javaee/5/tutorial/doc/bnbfl.html#bnbgq
You can download the tutorial examples here: https://java.net/projects/javaeetutorial/downloads

Related

How to parse XML in java to save values in db using STAX?

I am new to JAVA STAX Parser and I have to parse a xml to populate my database table.
While trying to read XML file using STAX I came across this problem.
In an XML file, I may have child nodes with the same name in different root nodes. I couldn't quite figure out how to read specific child nodes from root nodes.
XML File Sample:-
<?xml version="1.0" encoding="UTF-8"?>
<DOC xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:noNamespaceSchemaLocation="XML.xsd">
<FMTR>
<TITLEPG>
<TITLENUM>Title 1</TITLENUM>
<SUBJECT>Test 1</SUBJECT>
</TITLEPG>
<BTITLE>
<P></P>
</BTITLE>
<TOC>
<EXPL>
<SUBJECT>Explanation</SUBJECT>
</EXPL>
<TITLENO>
<CHAPTI>
<SUBJECT>Chapter I—Test 1</SUBJECT>
</CHAPTI>
</TITLENO>
<FAIDS>
<SUBJECT>Table of Titles and Chapters</SUBJECT>
<SUBJECT>Alphabetical List</SUBJECT>
</FAIDS>
</TOC>
</FMTR>
</DOC>
For eg:- I have to read the SUBJECT tag of TITLEPG root tag and populate the database table accordingly.
Can we get the child nodes of a root node using STAX?
What is the best approach to do parse it: STAX or JDOM?
The XMLEventReader class in Java StAX is an Iterator based API for reading XML files. It will let you move from event to event in the XML, allowing you to decide when to move to the next event.
You are looking for "events" here. pls bear in mind Stax and xpath are very different things. Stax allows you to parse a streaming XML document in a forward direction only.
You can create an XMLEventReader via the javax.xml.stream.XMLInputFactory class. Try running demo code ( make changes as required) and you can see the object you want.
XMLInputFactory factory = XMLInputFactory.newInstance();
//get Reader connected to XML input from somewhere..
Reader reader = getXmlReader();
try {
XMLEventReader eventReader =
factory.createXMLEventReader(reader);
} catch (XMLStreamException e) {
e.printStackTrace();
}
Now play around with this eventReader to get you what you want...iterate over it, see in debug mode,
while(eventReader.hasNext()){
// this is what you want...
XMLEvent event = eventReader.nextEvent();
if(event.getEventType() == XMLStreamConstants.START_ELEMENT){
StartElement startElement = event.asStartElement();
System.out.println(startElement.getName().getLocalPart());
}
//handle more event types here...
}

XMLStreamReader - What happens at the end of the file?

When traversing an XML document like so
while(streamReader.hasNext()){
streamReader.next();
if(streamReader.getEventType() == XMLStreamReader.START_ELEMENT){
System.out.println(streamReader.getLocalName());
}
}
Do I need to create a new streamReader if I need to traverse the XML document again, like so?
XMLStreamReader streamReader =
factory.createXMLStreamReader(reader);
I don't see a method like 'reset()' to move the cursor back to the start of the XML file
Yes, you should create a new reader at that point.
If you need to traverse the document multiple times, do you definitely want to parse it in a streaming fashion in the first place, rather than loading it into a DOM of some description?

Generating Output in JAVA

Can we generate an .html doc using java? Usually we get ouput in cmd prompt wen we run java programs. I want to generate output in the form of .html or .doc format is their a way to do it in java?
For HTML
Just write data into .html file (they are simply text files with .html extension), using raw file io operation
For Example :
StringBuilder sb = new StringBuilder();
sb.append("<html>");
sb.append("<head>");
sb.append("<title>Title Of the page");
sb.append("</title>");
sb.append("</head>");
sb.append("<body> <b>Hello World</b>");
sb.append("</body>");
sb.append("</html>");
FileWriter fstream = new FileWriter("MyHtml.html");
BufferedWriter out = new BufferedWriter(fstream);
out.write(sb.toString());
out.close();
For word document
This thread answers it
HTML is simply plain text with a bunch of tags, as others have answered. My suggestion, if you are doing something that is more complex than just outputting a basic HTML snippet, is to use a template engine such as StringTemplate.
StringTemplate lets you create a text file (actually, a HTML file) that looks like this:
<html>
<head>
<title>Example</title>
</head>
<body>
<p>Hello $name$</p>
</body>
</html>
That is your template. Then in your Java code, you would fill in the $name$ placeholder like this and then output the resulting HTML page:
StringTemplate page = group.getInstanceOf("page");
page.setAttribute("name", "World");
System.out.println(page.toString());
This will print out the following result on your screen:
<html>
<head>
<title>Example</title>
</head>
<body>
<p>Hello World</p>
</body>
</html>
Of course, the above example Java code isn't the complete code, but it illustrates how to use a template that's still valid HTML (makes it easier to edit in a HTML editor) while keeping your Java code simple (by avoiding having a bunch of HTML tags in your System.out.println statements).
As for MS Office .doc format, that is more complex and you can look into Apache POI for that.
I already felt that need in the past and I end up developing a java library--HtmlFlow (deployed at Maven Central Repository)--that provides a simple API to write HTML in a fluent style. Check it here: https://github.com/fmcarvalho/HtmlFlow.
You can use HtmlFlow with, or without, data binding, but here I present an example of binding the properties of a Task object into HTML elements. Consider a Task Java class with three properties: Title, Description and a Priority and then we can produce an HTML document for a Task object in the following way:
import htmlflow.HtmlView;
import model.Priority;
import model.Task;
import java.io.FileOutputStream;
import java.io.IOException;
import java.io.PrintStream;
public class App {
private static HtmlView<Task> taskDetailsView(){
HtmlView<Task> taskView = new HtmlView<>();
taskView
.head()
.title("Task Details")
.linkCss("https://maxcdn.bootstrapcdn.com/bootstrap/3.3.6/css/bootstrap.min.css");
taskView
.body().classAttr("container")
.heading(1, "Task Details")
.hr()
.div()
.text("Title: ").text(Task::getTitle)
.br()
.text("Description: ").text(Task::getDescription)
.br()
.text("Priority: ").text(Task::getPriority);
return taskView;
}
public static void main(String [] args) throws IOException{
HtmlView<Task> taskView = taskDetailsView();
Task task = new Task("Special dinner", "Have dinner with someone!", Priority.Normal);
try(PrintStream out = new PrintStream(new FileOutputStream("Task.html"))){
taskView.setPrintStream(out).write(task);
Runtime.getRuntime().exec("explorer Task.html");
}
}
}
Output is just output. What it means and how you use it is entirely up to you.
If you System.out.println('<p>Hello world!</p>'); you just produced HTML.
The .doc format is obviously a bit trickier, since it's not a simple matter of putting in tags, but there are libraries to get the job done. Google can suggest more than a few.
HTML is just plain text. Just write the HTML code to a file or standard out.
Word files are more complicated. Have a look at libraries such as Apache POI.
I don't know why you say this:
Usually we get ouput in cmd prompt wen
we run java programs .
I've been running some java programs today, but they do not do anything with a cmd prompt. If you use system.out.println, yes, but most advanced programs have a little bit more for communciation. Like an interface :)
What you want to do is look into file handlers. Open (or create) a file, write content to that file, and close it. Then you have a file. You can write anything you want to that file, so obviously also something that would make it an HTML or a doc. It's easy to find howtos on file-writing
Check this:
try {
BufferedWriter out = new BufferedWriter(new FileWriter("outfilename.html"));
out.write("aString"); //Here you pass your output
out.close();
} catch (IOException e) {
}
You will need to import BufferedWriter, FileWriter and IOException, wich are under java.io
The "aString" should be a String variable that stores html code or doc xml
Sure.
The general approach: You create the document in memory, namely in a StringBuilder and write the content of that builder to a file.
StringBuilder htmlBuilder = new StringBuilder();
htmlBuilder.append("<html><body>");
htmlBuilder.append("Hello world!");
htmlBuilder.append("</body></html>\n");
FileWriter writer = new FileWriter(System.getProperty("user.home") + "/hello.html");
writer.write(htmlBuilder.toString());
writer.close();
Put this in a main method, execute and you'll find a html file in your home directory
To generate an HTML document, you should write to a file. Since HTML is a text format, you would write to a text file. Doing this requires these classes
java.io.File - this represents locations in your file system
java.io.FileWriter - this establishes a connection from your program to a file
java.io.BufferedWriter -this enables buffered writing of text, which is much faster
java.io.IOException - one of these nasties is thrown if there is a problem writing to
the file. It is a checked (vs. runtime) exception and you must handle it.
The Head First Java book contains a very nice coverage of these classes and show you how to use them. To use these you must first know about exception handling. That is also covered in Head First Java.
I hope this gets you started.
A very straightforward and reliable approach to creation of plain HTML may be based on a SAX handler and default XSLT transformer, the latter having intrinsic capability of HTML output:
String encoding = "UTF-8";
FileOutputStream fos = new FileOutputStream("myfile.html");
OutputStreamWriter writer = new OutputStreamWriter(fos, encoding);
StreamResult streamResult = new StreamResult(writer);
SAXTransformerFactory saxFactory =
(SAXTransformerFactory) TransformerFactory.newInstance();
TransformerHandler tHandler = saxFactory.newTransformerHandler();
tHandler.setResult(streamResult);
Transformer transformer = tHandler.getTransformer();
transformer.setOutputProperty(OutputKeys.METHOD, "html");
transformer.setOutputProperty(OutputKeys.ENCODING, encoding);
transformer.setOutputProperty(OutputKeys.INDENT, "yes");
writer.write("<!DOCTYPE html>\n");
writer.flush();
tHandler.startDocument();
tHandler.startElement("", "", "html", new AttributesImpl());
tHandler.startElement("", "", "head", new AttributesImpl());
tHandler.startElement("", "", "title", new AttributesImpl());
tHandler.characters("Hello".toCharArray(), 0, 5);
tHandler.endElement("", "", "title");
tHandler.endElement("", "", "head");
tHandler.startElement("", "", "body", new AttributesImpl());
tHandler.startElement("", "", "p", new AttributesImpl());
tHandler.characters("5 > 3".toCharArray(), 0, 5); // note '>' character
tHandler.endElement("", "", "p");
tHandler.endElement("", "", "body");
tHandler.endElement("", "", "html");
tHandler.endDocument();
writer.close();
Note that XSLT transformer will release you from the burden of escaping special characters like >, as it takes necessary care of it by itself.
And it is easy to wrap SAX methods like startElement() and characters() to something more convenient to one's taste...
And it may be worth noting that dealing without templates and document allocation in memory (e.g. DOM) gives you more freedom in terms of the resulting document size...
If you have some document-like data (structured), I'll suggest to use DOM (document object model) and than convert it in desired format (xml, html, doc, whatever). But if you have just some application output, you can easily wrap it with html. Not necessarily within java - you can also store your program's output in plain text file and convert it in html later (add body, paragprahs, headers and other HTML elements).

How to create XML file?

I have some data which my program discovers after observing a few things about files.
For instance, i know file name, time file was last changed, whether file is binary or ascii text, file content (assuming it is properties) and some other stuff.
i would like to store this data in XML format.
How would you go about doing it?
Please provide example.
If you want something quick and relatively painless, use XStream, which lets you serialise Java Objects to and from XML. The tutorial contains some quick examples.
Use StAX; it's so much easier than SAX or DOM to write an XML file (DOM is probably the easiest to read an XML file but requires you to have the whole thing in memory), and is built into Java SE 6.
A good demo is found here on p.2:
OutputStream out = new FileOutputStream("data.xml");
XMLOutputFactory factory = XMLOutputFactory.newInstance();
XMLStreamWriter writer = factory.createXMLStreamWriter(out);
writer.writeStartDocument("ISO-8859-1", "1.0");
writer.writeStartElement("greeting");
writer.writeAttribute("id", "g1");
writer.writeCharacters("Hello StAX");
writer.writeEndDocument();
writer.flush();
writer.close();
out.close();
Standard are the W3C libraries.
final Document docToSave = DocumentBuilderFactory.newInstance().newDocumentBuilder().newDocument();
final Element fileInfo = docToSave.createElement("fileInfo");
docToSave.appendChild(fileInfo);
final Element fileName = docToSave.createElement("fileName");
fileName.setNodeValue("filename.bin");
fileInfo.appendChild(fileName);
return docToSave;
XML is almost never the easiest thing to do.
You can use to do that SAX or DOM, review this link: https://web.archive.org/web/1/http://articles.techrepublic%2ecom%2ecom/5100-10878_11-1044810.html
I think is that you want

SAXReader not re-ecape characters

I'm reading a XML file with dom4j. The file looks like this:
...
<Field>
hello, world...</Field>
...
I read the file with SAXReader into a Document. When I use getText() on a the node I obtain the followin String:
\r\n hello, world...
I do some processing and then write another file using asXml(). But the characters are not escaped as in the original file which results in error in the external system which uses the file.
How can I escape the special character and have
when writing the file?
You cannot easily. Those aren't 'escapes', they are 'character entities'. They are a fundamental part of XML. Xerces has some very complex support for 'unparsed entities', but I doubt that it applies to these, as opposed to the species that are defined in a DTD.
It depends on what you're getting and what you want (see my previous comment.)
The SAX reader is doing nothing wrong - your XML is giving you a literal newline character. If you control this XML, then instead of the newline characters, you will need to insert a \ (backslash) character following by the "r" or "n" characters (or both.)
If you do not control this XML, then you will need to do a literal conversion of the newline character to "\r\n" after you've gotten your string back. In C# it would be something like:
myString = myString.Replace("\r\n", "\\r\\n");
XML entities are abstracted away in DOM. Content is exposed with String without the need to bother about the encoding -- which in most of the case is what you want.
But SAX has some support for how entities are processed. You could try to create a XMLReader with a custom EntityResolver#resolveEntity, and pass it as parameter to the SAXReader. But I feat it may not work:
The Parser will call this method
before opening any external entity
except the top-level document entity
(including the external DTD subset,
external entities referenced within
the DTD, and external entities
referenced within the document
element)
Otherwise you could try to configure a LexicalHandler for SAX in a way to be notified when an entity is encountered. Javadoc for LexicalHandler#startEntity says:
Report the beginning of some internal
and external XML entities.
You will not be able to change the resolving, but that may still help.
EDIT
You must read and write XML with the SAXReader and XMLWriter provided by dom4j. See reading a XML file and writing an XML file. Don't use asXml() and dump the file yourself.
FileOutputStream fos = new FileOutputStream("simple.xml");
OutputFormat format = OutputFormat.createPrettyPrint();
XMLWriter writer = new XMLWriter(fos, format);
writer.write(doc);
writer.flush();
You can pre-process the input stream to replace & to e.g. [$AMPERSAND_CHARACTER$], then do the stuff with dom4j, and post-process the output stream making the back substitution.
Example (using streamflyer):
import com.github.rwitzel.streamflyer.util.ModifyingReaderFactory;
import com.github.rwitzel.streamflyer.util.ModifyingWriterFactory;
// Pre-process
Reader originalReader = new InputStreamReader(myInputStream, "utf-8");
Reader modifyingReader = new ModifyingReaderFactory().createRegexModifyingReader(originalReader, "&", "[\\$AMPERSAND_CHARACTER\\$]");
// Read and modify XML via dom4j
SAXReader xmlReader = new SAXReader();
Document xmlDocument = xmlReader.read(modifyingReader);
// ...
// Post-process
Writer originalWriter = new OutputStreamWriter(myOutputStream, "utf-8");
Writer modifyingWriter = new ModifyingWriterFactory().createRegexModifyingWriter(originalWriter, "\\[\\$AMPERSAND_CHARACTER\\$\\]", "&");
// Write to output stream
OutputFormat xmlOutputFormat = OutputFormat.createPrettyPrint();
XMLWriter xmlWriter = new XMLWriter(modifyingWriter, xmlOutputFormat);
xmlWriter.write(xmlDocument);
xmlWriter.close();
You can also use FilterInputStream/FilterOutputStream, PipedInputStream/PipedOutputStream, or ProxyInputStream/ProxyOutputStream for pre- and post-processing.

Categories