I am using InputSource to parse a large xml file (1.2 mb). I need to keep this InputSource in memory. I do not want to keep loading it. What's the best why to do this? I have tried using a singleton, but the Sax Parser complains that the document is missing an end tag after the 2nd time the object reference is accessed.
Any suggestions would be greatly appreciated.
InputStream ins = getResources().openRawResource(R.raw.cfr_title_index);
InputSource xmlSource = new InputSource(ins);
MySinglton.xmlInput = xmlSource;
Thanks
Streams are "read once". You should not plan to hang onto it. If you need its contents in memory, read them into an object or data structure and cache it.
Related
I am running into some out of memory exceptions when reading in very very large XML strings and converting them into a Document object.
The way I am doing this is I am opening a URL stream to the XML file, wrapping that in an InputStreamReader, then wrapping that in a BufferedReader.
Then I read from the BufferedReader and append to a StringBuffer:
StringBuffer doc = new StringBuffer();
BufferedReader in = new BufferedReader(newInputStreamReader(downloadURL.openStream()));
String inputLine;
while ((inputLine = in.readLine()) != null) {
doc.append(inputLine);
}
Now this is the part I am having an issue with. I am using toString on the StringBuffer to be able to get the bytes to create a byte array which is then used to create a ByteArrayInputStream. I believe that this step is causing me to have the same data in memory twice, is that right?
Here is what I am doing:
byte xmlBytes[] = doc.toString().getBytes();
ByteArrayInputStream is = new ByteArrayInputStream(xmlBytes);
XMLReader xmlReader = XMLReaderFactory.createXMLReader();
Builder xmlBuilder = new Builder(xmlReader,false);
Document d = xmlBuilder.build(is);
Is there a way I can avoid creating duplicate memory (if I am doing so in the first place) or is there a way to convert the BufferedReader straight into a ByteArrayInputStream?
Thanks
Here is how you can consume an InputStream to create a Document using a DOM parser:
DocumentBuilderFactory domFactory = DocumentBuilderFactory.newInstance();
DocumentBuilder builder = domFactory.newDocumentBuilder();
Document document = builder.parse(inputStream);
This creates less intermediary copies. However, if the XML document is very large, instead of parsing it completely in memory, the best solution is to use a StAX parser.
With a StAX parser, you don't load the entire parsed document in memory. Instead, you handle each element found sequentially (and the element is thrown away immediately).
Here is a good explanation: Java: Parsing XML files: DOM, SAX or StAX?
There are also SAX parsers, but it's much easier to use StAX. Discussion here: When should I choose SAX over StAX?
If your XML (or JSON) file is large then it is not a good idea to load the whole content to memory because as you mentioned the parsing process consumes huge memory.
This issue can be more serious in case of more users (I mean more then one thread). Just imagine what will happen if your application needs to serve two, ten or more parallel requests...
The best way to process huge file as a stream and after you read the payload from the stream you can close it without read the stream till the end. It is more faster and memory friendly solution.
Apache Commons IO can help you to do the job:
LineIterator it = FileUtils.lineIterator(theFile, "UTF-8");
try {
while (it.hasNext()) {
String line = it.nextLine();
// do something with line
}
} finally {
LineIterator.closeQuietly(it);
}
The another way to handle this issue is to split your XML file to parts and then you can process the smaller parts without any issue.
I have to make performance test on VTD-XML library in order to make not just simple parsing but additional transformation in the parsing.
So I have 30MB input XML and then I transform it with custom logic to other XML.
SO I want to remove all thinks which slow the whole process which comes from my side(because of not good use of VTD library).
I tried to search tips for optimization but can not find them.
I noutised that:
'0'. What is better to use for selection selectXPath, or selectElement?
Use parsing without namespace is much faster.
File file = new File(fileName);
VTDGen vtdGen = new VTDGen();
vtdGen.setDoc_BR(new byte[(int) file.length()]);
vtdGen.parse(false);
Read from byte or pass to VTDGen ?
final VTDGen vg = new VTDGen();
vg.parseFile("books.xml", false);
or
// open a file and read the content into a byte array
File f = new File("books.xml");
FileInputStream fis = new FileInputStream(f);
byte[] b = new byte[(int) f.length()];
fis.read(b);
VTDGen vg = new VTDGen();
vg.setDoc(b);
vg.parse(true);
Using the second approach - 0.01 times faster...(can be from everything)
What is the difference with parseFile the file is limited upTo 2GB with namespaceaware true and 1GB witout but what for the byte approach?
Reuse buffers
You can ask VTDGen to reuse VTD buffers for the next parsing task.
Otherwise, by default, VTDGen will allocate new buffer for each
parsing run.
Can you give an example for that?
Adjust LC level to 5
By default, it is 3. But you can set it to 5. When your XML are deeply
nested, setting LC level to 5 results in better XPath performance. But
it increases memory usage and parsing time very slightly.
VTDGen vg = new VTDGen();
vtdGen.selectLcDepth(5);
But have runtime exception. Only works with 3
Indexing
Use VTD+XML indexing- Instead of parsing XML files at the time of
processing request, you can pre-index your XML into VTD+XML format and
dump them on disk. When the processing request commences, simply load
VTD+xml in memory and voila, parsing is no longer needed!!
VTDGen vg = new VTDGen();
if (vg.parseFile(inputName,true)){
vg.writeIndex(new FileOutputStream(outputName));
}
Can anyone knows how to use it? What happens if the file is changes, how to tripper new re-indexing. And if there is 10kb change in 3GB does the parsing will take time for the whole new file parsing or just for the changed lines?
overwrite feature
The overwrite feature aka. data templating- Because VTD-XML retains
XML in memory as is, you can actually create a template XML file
(pre-indexed in vtd+xml) whose value fields are left blank and let
your app fill in the blank, thus creating XML data that never need to
be parsed.
I think you should look at the examples bundled with vtd-xml release... and build up the expertise gradually... fortunately, vtd-xml is in my view one of the easiest XML API by a large margin... so the learning curve won't be SAX/STAX kind of difficult.
My answer to your numbered lists above...
selectXPath is for xpath evaluation. selectElement is similar to getElementByTag()
turning on Namespace awareness has little/no effect on parsing performance whatsoever... can you reference the source of your 100x slowdown claim?
you can read from bytes or read from files directly... here is a link to a blog post
https://ximpleware.wordpress.com/2016/06/02/parsefile-vs-parse-a-quick-comparison/
3.Buffer reuse is somewhat an advanced feature..let's get to that at a later time
4.If you get the latest version (2.13), you will not get runtime exception with that method call...
to parse xml doc larger than 2GB, you need to switch to extended edition of vtd-xml which is a separate API bundled with standard vtd-xml...
There are examples bundled with vtd-xml distribution that you might want to look at first... here is an article on this subject
http://www.codeproject.com/Articles/24663/Index-XML-Documents-with-VTD-XML
When traversing an XML document like so
while(streamReader.hasNext()){
streamReader.next();
if(streamReader.getEventType() == XMLStreamReader.START_ELEMENT){
System.out.println(streamReader.getLocalName());
}
}
Do I need to create a new streamReader if I need to traverse the XML document again, like so?
XMLStreamReader streamReader =
factory.createXMLStreamReader(reader);
I don't see a method like 'reset()' to move the cursor back to the start of the XML file
Yes, you should create a new reader at that point.
If you need to traverse the document multiple times, do you definitely want to parse it in a streaming fashion in the first place, rather than loading it into a DOM of some description?
I have some data which my program discovers after observing a few things about files.
For instance, i know file name, time file was last changed, whether file is binary or ascii text, file content (assuming it is properties) and some other stuff.
i would like to store this data in XML format.
How would you go about doing it?
Please provide example.
If you want something quick and relatively painless, use XStream, which lets you serialise Java Objects to and from XML. The tutorial contains some quick examples.
Use StAX; it's so much easier than SAX or DOM to write an XML file (DOM is probably the easiest to read an XML file but requires you to have the whole thing in memory), and is built into Java SE 6.
A good demo is found here on p.2:
OutputStream out = new FileOutputStream("data.xml");
XMLOutputFactory factory = XMLOutputFactory.newInstance();
XMLStreamWriter writer = factory.createXMLStreamWriter(out);
writer.writeStartDocument("ISO-8859-1", "1.0");
writer.writeStartElement("greeting");
writer.writeAttribute("id", "g1");
writer.writeCharacters("Hello StAX");
writer.writeEndDocument();
writer.flush();
writer.close();
out.close();
Standard are the W3C libraries.
final Document docToSave = DocumentBuilderFactory.newInstance().newDocumentBuilder().newDocument();
final Element fileInfo = docToSave.createElement("fileInfo");
docToSave.appendChild(fileInfo);
final Element fileName = docToSave.createElement("fileName");
fileName.setNodeValue("filename.bin");
fileInfo.appendChild(fileName);
return docToSave;
XML is almost never the easiest thing to do.
You can use to do that SAX or DOM, review this link: https://web.archive.org/web/1/http://articles.techrepublic%2ecom%2ecom/5100-10878_11-1044810.html
I think is that you want
I'm trying to find a way to validate a large XML file against an XSD. I saw the question ...best way to validate an XML... but the answers all pointed to using the Xerces library for validation. The only problem is, when I use that library to validate a 180 MB file then I get an OutOfMemoryException.
Are there any other tools,libraries, strategies for validating a larger than normal XML file?
EDIT: The SAX solution worked for java validation, but the other two suggestions for the libxml tool were very helpful as well for validation outside of java.
Instead of using a DOMParser, use a SAXParser. This reads from an input stream or reader so you can keep the XML on disk instead of loading it all into memory.
SAXParserFactory factory = SAXParserFactory.newInstance();
factory.setValidating(true);
factory.setNamespaceAware(true);
SAXParser parser = factory.newSAXParser();
XMLReader reader = parser.getXMLReader();
reader.setErrorHandler(new SimpleErrorHandler());
reader.parse(new InputSource(new FileReader ("document.xml")));
Use libxml, which performs validation and has a streaming mode.
Personally I like to use XMLStarlet which has a command line interface, and works on streams. It is a set of tools built on Libxml2.
SAX and libXML will help, as already mentioned. You could also try increasing the maximum heap size for the JVM using the -Xmx option. E.g. to set the maximum heap size to 512MB: java -Xmx512m com.foo.MyClass