Optimization on VTD-XML parse?

Optimization on VTD-XML parse? - java

I have to make performance test on VTD-XML library in order to make not just simple parsing but additional transformation in the parsing.
So I have 30MB input XML and then I transform it with custom logic to other XML.
SO I want to remove all thinks which slow the whole process which comes from my side(because of not good use of VTD library).
I tried to search tips for optimization but can not find them.
I noutised that:
'0'. What is better to use for selection selectXPath, or selectElement?
Use parsing without namespace is much faster.
File file = new File(fileName);
VTDGen vtdGen = new VTDGen();
vtdGen.setDoc_BR(new byte[(int) file.length()]);
vtdGen.parse(false);
Read from byte or pass to VTDGen ?
final VTDGen vg = new VTDGen();
vg.parseFile("books.xml", false);
or
// open a file and read the content into a byte array
File f = new File("books.xml");
FileInputStream fis = new FileInputStream(f);
byte[] b = new byte[(int) f.length()];
fis.read(b);
VTDGen vg = new VTDGen();
vg.setDoc(b);
vg.parse(true);
Using the second approach - 0.01 times faster...(can be from everything)
What is the difference with parseFile the file is limited upTo 2GB with namespaceaware true and 1GB witout but what for the byte approach?
Reuse buffers
You can ask VTDGen to reuse VTD buffers for the next parsing task.
Otherwise, by default, VTDGen will allocate new buffer for each
parsing run.
Can you give an example for that?
Adjust LC level to 5
By default, it is 3. But you can set it to 5. When your XML are deeply
nested, setting LC level to 5 results in better XPath performance. But
it increases memory usage and parsing time very slightly.
VTDGen vg = new VTDGen();
vtdGen.selectLcDepth(5);
But have runtime exception. Only works with 3
Indexing
Use VTD+XML indexing- Instead of parsing XML files at the time of
processing request, you can pre-index your XML into VTD+XML format and
dump them on disk. When the processing request commences, simply load
VTD+xml in memory and voila, parsing is no longer needed!!
VTDGen vg = new VTDGen();
if (vg.parseFile(inputName,true)){
vg.writeIndex(new FileOutputStream(outputName));
}
Can anyone knows how to use it? What happens if the file is changes, how to tripper new re-indexing. And if there is 10kb change in 3GB does the parsing will take time for the whole new file parsing or just for the changed lines?
overwrite feature
The overwrite feature aka. data templating- Because VTD-XML retains
XML in memory as is, you can actually create a template XML file
(pre-indexed in vtd+xml) whose value fields are left blank and let
your app fill in the blank, thus creating XML data that never need to
be parsed.

I think you should look at the examples bundled with vtd-xml release... and build up the expertise gradually... fortunately, vtd-xml is in my view one of the easiest XML API by a large margin... so the learning curve won't be SAX/STAX kind of difficult.
My answer to your numbered lists above...
selectXPath is for xpath evaluation. selectElement is similar to getElementByTag()
turning on Namespace awareness has little/no effect on parsing performance whatsoever... can you reference the source of your 100x slowdown claim?
you can read from bytes or read from files directly... here is a link to a blog post
https://ximpleware.wordpress.com/2016/06/02/parsefile-vs-parse-a-quick-comparison/
3.Buffer reuse is somewhat an advanced feature..let's get to that at a later time
4.If you get the latest version (2.13), you will not get runtime exception with that method call...
to parse xml doc larger than 2GB, you need to switch to extended edition of vtd-xml which is a separate API bundled with standard vtd-xml...
There are examples bundled with vtd-xml distribution that you might want to look at first... here is an article on this subject
http://www.codeproject.com/Articles/24663/Index-XML-Documents-with-VTD-XML

Related

Replacing text in XWPFParagraph without changing format of the docx file

I am developing font converter app which will convert Unicode font text to Krutidev/Shree Lipi (Marathi/Hindi) font text. In the original docx file there are formatted words (i.e. Color, Font, size of the text, Hyperlinks..etc. ).
I want to keep format of the final docx same as the original docx after converting words from Unicode to another font.
PFA.
Here is my Code
try {
fileInputStream = new FileInputStream("StartDoc.docx");
document = new XWPFDocument(fileInputStream);
XWPFWordExtractor extractor = new XWPFWordExtractor(document);
List<XWPFParagraph> paragraph = document.getParagraphs();
Converter data = new Converter() ;
for(XWPFParagraph p :document.getParagraphs())
{
for(XWPFRun r :p.getRuns())
{
String string2 = r.getText(0);
data.uniToShree(string2);
r.setText(string2,0);
}
}
//Write the Document in file system
FileOutputStream out = new FileOutputStream(new File("Output.docx");
document.write(out);
out.close();
System.out.println("Output.docx written successully");
}
catch (IOException e) {
System.out.println("We had an error while reading the Word Doc");
}

Thank you for ask-an-answer.
I have worked using POI some years ago, but over excel-workbooks, but still I’ll try to help you reach the root cause of your error.
The Java compiler is smart enough to suggest good debugging information in itself!
A good first step to disambiguate the error is to not overwrite the exception message provided to you via the compiler complain.
Try printing the results of e.getLocalizedMessage()or e.getMessage() and see what you get.
Getting the stack trace using printStackTrace method is also useful oftentimes to pinpoint where your error lies!
Share your findings from the above method calls to further help you help debug the issue.
[EDIT 1:]
So it seems, you are able to process the file just right with respect to the font conversion of the data, but you are not able to reconstruct the formatting of the original data in the converted data file.
(thus, "We had an error while reading the Word Doc", is a lie getting printed ;) )
Now, there are 2 elements to a Word document:
Content
Structure or Schema
You are able to convert the data as you are working only on the content of your respective doc files.
In order to be able to retain the formatting of the contents, your solution needs to be aware of the formatting of the doc files as well and take care of that.
MS Word which defined the doc files and their extension (.docx) follows a particular set of schemas that define the rules of formatting. These schemas are defined in Microsoft's XML Namespace packages[1].
You can obtain the XML(HTML) format of the doc-file you want quite easily (see steps in [1] or code in link [2]) and even apply different schemas or possibly your own schema definitions based on the definitions provided by MS's namespaces, either programmatically, for which you need to get versed with XML, XSL and XSLT concepts (w3schools[3] is a good starting point) but this method is no less complex than writing your own version of MS-Word; or using MS-Word's inbuilt tools as shown in [1].
[1]. https://www.microsoftpressstore.com/articles/article.aspx?p=2231769&seqNum=4#:~:text=During%20conversion%2C%20Word%20tags%20the,you%20can%20an%20HTML%20file.
[2]. https://svn.apache.org/repos/asf/poi/trunk/src/scratchpad/testcases/org/apache/poi/hwpf/converter/TestWordToHtmlConverter.java
[3]. https://www.w3schools.com/xml/
My answer provides you with a cursory overview of how to achieve what you want to, but depending on your inclination and time availability, you may want to use your discretion before you decide to head onto one path than the other.
Hope it helps!

Saxon: Can't open XML with schema in .NET, works fine in Java

I am trying to create a Saxon XPathCompiler. I have the same code in Java & .NET, each calling the appropriate Saxon library. The code is:
protected void ctor(InputStream xmlData, InputStream schemaFile, boolean preserveWhiteSpace) throws SAXException, SchemaException, SaxonApiException {
this.rootNode = makeDataSourceNode(null);
XMLReader reader = XMLReaderFactory.createXMLReader();
InputSource xmlSource = new InputSource(xmlData);
SAXSource saxSource = new SAXSource(reader, xmlSource);
Source schemaSource = new StreamSource(schemaFile);
Configuration config = createEnterpriseConfiguration();
config.addSchemaSource(schemaSource);
// ...
In the case of .NET the InputStreams are a class that wrpas a .NET Stream and makes it a Java InputStream. For Java the above code works fine. But in .NET, the last line, config.addSchemaSource(schemaSource) throws:
$exception {"Content is not allowed in
prolog."} org.xml.sax.SAXParseException
In both Java & .NET it works fine if there is no schema.
The files it is using are http://www.thielen.com/test/SouthWind.xml & http://www.thielen.com/test/SouthWind.xsd
It does not appear to be any of the issues in this question. And if that was the issue, shouldn't both Java and .NET have the same problem.
I'm thinking maybe it's the wrapper around the .NET Stream to make it a Java InputStream, but we use that class everywhere without any other issues.

The "content is not allowed in Prolog" exception is absolutely infuriating - if only it told you what the bytes are that it is complaining about! One diagnostic technique is to display the initial bytes delivered by the InputStream: do a few calls on
System.err.println(schemaFile.next())
My first guess as to the cause would be something to do with byte order marks, but rather than speculate, I would focus on diagnostics to see what the parser is seeing in that InputStream that it doesn't like.

How do we deal with a large GATE Document

I'm getting Error java.lang.OutOfMemoryError: GC overhead limit exceeded when I try to execute Pipeline if the GATE Document I use is slightly large.
The code works fine if the GATE Document is small.
My JAVA code is something like this:
TestGate Class:
public void gateProcessor(Section section) throws Exception {
Gate.init();
Gate.getCreoleRegister().registerDirectories(....
SerialAnalyserController pipeline .......
pipeline.add(All the language analyzers)
pipeline.add(My Jape File)
Corpus corpus = Factory.newCorpus("Gate Corpus");
Document doc = Factory.newDocument(section.getContent());
corpus.add(doc);
pipeline.setCorpus(corpus);
pipeline.execute();
}
The Main Class Contains:
StringBuilder body = new StringBuilder();
int character;
FileInputStream file = new FileInputStream(
new File(
"filepath\\out.rtf")); //The Document in question
while (true)
{
character = file.read();
if (character == -1) break;
body.append((char) character);
}
Section section = new Section(body.toString()); //Creating object of Type Section with content field = body.toString()
TestGate testgate = new TestGate();
testgate.gateProcessor(section);
Interestingly this thing fails in GATE Developer tool as well the tools basically gets stuck if the document is more than a sepcific limit, say more than 1 page.
This proves that my code is logically correct but my approach is wrong. How do we deal with large chunks data in GATE Document.

You need to call
corpus.clear();
Factory.deleteResource(doc);
after each document, otherwise you'll eventually get OutOfMemory on any size of docs if you run it enough times (Although by the way you initialize gate in the method it seems like you really need to process a single document only once).
Besides that, annotations and features usually take lots of memory. If you have an annotation-intensive pipeline, i.e. you generate lots of annotations with lots of features and values you may run out of memory. Make sure you don't have a processing resource that generates annotations exponentially - for instance a jape or groovy generates n to the power of W annotations, where W is number of words in your doc. Or if you have a feature for each possible word combination in your doc, that would generate factorial of W strings.

every time its create pipeline object that's why it takes huge memory. That's why every time you use 'Annie' cleanup.
pipeline.cleanup();
pipeline=null;

Reusing an InputSource through application life cycle

I am using InputSource to parse a large xml file (1.2 mb). I need to keep this InputSource in memory. I do not want to keep loading it. What's the best why to do this? I have tried using a singleton, but the Sax Parser complains that the document is missing an end tag after the 2nd time the object reference is accessed.
Any suggestions would be greatly appreciated.
InputStream ins = getResources().openRawResource(R.raw.cfr_title_index);
InputSource xmlSource = new InputSource(ins);
MySinglton.xmlInput = xmlSource;
Thanks

Streams are "read once". You should not plan to hang onto it. If you need its contents in memory, read them into an object or data structure and cache it.

How to create XML file?

I have some data which my program discovers after observing a few things about files.
For instance, i know file name, time file was last changed, whether file is binary or ascii text, file content (assuming it is properties) and some other stuff.
i would like to store this data in XML format.
How would you go about doing it?
Please provide example.

If you want something quick and relatively painless, use XStream, which lets you serialise Java Objects to and from XML. The tutorial contains some quick examples.

Use StAX; it's so much easier than SAX or DOM to write an XML file (DOM is probably the easiest to read an XML file but requires you to have the whole thing in memory), and is built into Java SE 6.
A good demo is found here on p.2:
OutputStream out = new FileOutputStream("data.xml");
XMLOutputFactory factory = XMLOutputFactory.newInstance();
XMLStreamWriter writer = factory.createXMLStreamWriter(out);
writer.writeStartDocument("ISO-8859-1", "1.0");
writer.writeStartElement("greeting");
writer.writeAttribute("id", "g1");
writer.writeCharacters("Hello StAX");
writer.writeEndDocument();
writer.flush();
writer.close();
out.close();

Standard are the W3C libraries.
final Document docToSave = DocumentBuilderFactory.newInstance().newDocumentBuilder().newDocument();
final Element fileInfo = docToSave.createElement("fileInfo");
docToSave.appendChild(fileInfo);
final Element fileName = docToSave.createElement("fileName");
fileName.setNodeValue("filename.bin");
fileInfo.appendChild(fileName);
return docToSave;
XML is almost never the easiest thing to do.

You can use to do that SAX or DOM, review this link: https://web.archive.org/web/1/http://articles.techrepublic%2ecom%2ecom/5100-10878_11-1044810.html
I think is that you want

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Optimization on VTD-XML parse? - java

Related

Replacing text in XWPFParagraph without changing format of the docx file

Saxon: Can't open XML with schema in .NET, works fine in Java

How do we deal with a large GATE Document

Reusing an InputSource through application life cycle

How to create XML file?

Categories

Resources