How do we deal with a large GATE Document - java

I'm getting Error java.lang.OutOfMemoryError: GC overhead limit exceeded when I try to execute Pipeline if the GATE Document I use is slightly large.
The code works fine if the GATE Document is small.
My JAVA code is something like this:
TestGate Class:
public void gateProcessor(Section section) throws Exception {
Gate.init();
Gate.getCreoleRegister().registerDirectories(....
SerialAnalyserController pipeline .......
pipeline.add(All the language analyzers)
pipeline.add(My Jape File)
Corpus corpus = Factory.newCorpus("Gate Corpus");
Document doc = Factory.newDocument(section.getContent());
corpus.add(doc);
pipeline.setCorpus(corpus);
pipeline.execute();
}
The Main Class Contains:
StringBuilder body = new StringBuilder();
int character;
FileInputStream file = new FileInputStream(
new File(
"filepath\\out.rtf")); //The Document in question
while (true)
{
character = file.read();
if (character == -1) break;
body.append((char) character);
}
Section section = new Section(body.toString()); //Creating object of Type Section with content field = body.toString()
TestGate testgate = new TestGate();
testgate.gateProcessor(section);
Interestingly this thing fails in GATE Developer tool as well the tools basically gets stuck if the document is more than a sepcific limit, say more than 1 page.
This proves that my code is logically correct but my approach is wrong. How do we deal with large chunks data in GATE Document.

You need to call
corpus.clear();
Factory.deleteResource(doc);
after each document, otherwise you'll eventually get OutOfMemory on any size of docs if you run it enough times (Although by the way you initialize gate in the method it seems like you really need to process a single document only once).
Besides that, annotations and features usually take lots of memory. If you have an annotation-intensive pipeline, i.e. you generate lots of annotations with lots of features and values you may run out of memory. Make sure you don't have a processing resource that generates annotations exponentially - for instance a jape or groovy generates n to the power of W annotations, where W is number of words in your doc. Or if you have a feature for each possible word combination in your doc, that would generate factorial of W strings.

every time its create pipeline object that's why it takes huge memory. That's why every time you use 'Annie' cleanup.
pipeline.cleanup();
pipeline=null;

Related

Replacing text in XWPFParagraph without changing format of the docx file

I am developing font converter app which will convert Unicode font text to Krutidev/Shree Lipi (Marathi/Hindi) font text. In the original docx file there are formatted words (i.e. Color, Font, size of the text, Hyperlinks..etc. ).
I want to keep format of the final docx same as the original docx after converting words from Unicode to another font.
PFA.
Here is my Code
try {
fileInputStream = new FileInputStream("StartDoc.docx");
document = new XWPFDocument(fileInputStream);
XWPFWordExtractor extractor = new XWPFWordExtractor(document);
List<XWPFParagraph> paragraph = document.getParagraphs();
Converter data = new Converter() ;
for(XWPFParagraph p :document.getParagraphs())
{
for(XWPFRun r :p.getRuns())
{
String string2 = r.getText(0);
data.uniToShree(string2);
r.setText(string2,0);
}
}
//Write the Document in file system
FileOutputStream out = new FileOutputStream(new File("Output.docx");
document.write(out);
out.close();
System.out.println("Output.docx written successully");
}
catch (IOException e) {
System.out.println("We had an error while reading the Word Doc");
}
Thank you for ask-an-answer.
I have worked using POI some years ago, but over excel-workbooks, but still I’ll try to help you reach the root cause of your error.
The Java compiler is smart enough to suggest good debugging information in itself!
A good first step to disambiguate the error is to not overwrite the exception message provided to you via the compiler complain.
Try printing the results of e.getLocalizedMessage()or e.getMessage() and see what you get.
Getting the stack trace using printStackTrace method is also useful oftentimes to pinpoint where your error lies!
Share your findings from the above method calls to further help you help debug the issue.
[EDIT 1:]
So it seems, you are able to process the file just right with respect to the font conversion of the data, but you are not able to reconstruct the formatting of the original data in the converted data file.
(thus, "We had an error while reading the Word Doc", is a lie getting printed ;) )
Now, there are 2 elements to a Word document:
Content
Structure or Schema
You are able to convert the data as you are working only on the content of your respective doc files.
In order to be able to retain the formatting of the contents, your solution needs to be aware of the formatting of the doc files as well and take care of that.
MS Word which defined the doc files and their extension (.docx) follows a particular set of schemas that define the rules of formatting. These schemas are defined in Microsoft's XML Namespace packages[1].
You can obtain the XML(HTML) format of the doc-file you want quite easily (see steps in [1] or code in link [2]) and even apply different schemas or possibly your own schema definitions based on the definitions provided by MS's namespaces, either programmatically, for which you need to get versed with XML, XSL and XSLT concepts (w3schools[3] is a good starting point) but this method is no less complex than writing your own version of MS-Word; or using MS-Word's inbuilt tools as shown in [1].
[1]. https://www.microsoftpressstore.com/articles/article.aspx?p=2231769&seqNum=4#:~:text=During%20conversion%2C%20Word%20tags%20the,you%20can%20an%20HTML%20file.
[2]. https://svn.apache.org/repos/asf/poi/trunk/src/scratchpad/testcases/org/apache/poi/hwpf/converter/TestWordToHtmlConverter.java
[3]. https://www.w3schools.com/xml/
My answer provides you with a cursory overview of how to achieve what you want to, but depending on your inclination and time availability, you may want to use your discretion before you decide to head onto one path than the other.
Hope it helps!

Using StAX to create index for XML for quick access

Is there a way to use StAX and JAX-B to create an index and then get quick access to an XML file?
I have a large XML file and I need to find information in it. This is used in a desktop application and so it should work on systems with few RAM.
So my idea is this: Create an index and then quickly access data from the large file.
I can't just split the file because it's an official federal database that I want to use unaltered.
Using a XMLStreamReader I can quickly find some element and then use JAXB for unmarshalling the element.
final XMLStreamReader r = xf.createXMLStreamReader(filename, new FileInputStream(filename));
final JAXBContext ucontext = JAXBContext.newInstance(Foo.class);
final Unmarshaller unmarshaller = ucontext.createUnmarshaller();
r.nextTag();
while (r.hasNext()) {
final int eventType = r.next();
if (eventType == XMLStreamConstants.START_ELEMENT && r.getLocalName().equals("foo")
&& Long.parseLong(r.getAttributeValue(null, "bla")) == bla
) {
// JAX-B works just fine:
final JAXBElement<Foo> foo = unmarshaller.unmarshal(r,Foo.class);
System.out.println(foo.getValue().getName());
// But how do I get the offset?
// cache.put(r.getAttributeValue(null, "id"), r.getCursor()); // ???
break;
}
}
But I can't get the offset. I'd like to use this to prepare an index:
(id of element) -> (offset in file)
Then I should be able use the offset to just unmarshall from there: Open file stream, skip that many bytes, unmarshall.
I can't find a library that does this. And I can't do it on my own without knowing the position of the file cursor. The javadoc clearly states that there is a cursor, but I can't find a way of accessing it.
Edit:
I'm just trying to offer a solution that will work on old hardware so people can actually use it. Not everyone can afford a new and powerful computer. Using StAX I can get the data in about 2 seconds, which is a bit long. But it doesn't require RAM. It requires 300 MB of RAM to just use JAX-B. Using some embedded db system would just be a lot of overhead for such a simple task. I'll use JAX-B anyway. Anything else would be useless for me since the wsimport-generated classes are already perfect. I just don't want to load 300 MB of objects when I only need a few.
I can't find a DB that just needs an XSD to create an in-memory DB, which doesn't use that much RAM. It's all made for servers or it's required to define a schema and map the XML. So I assume it just doesn't exist.
You could work with a generated XML parser using ANTLR4.
The Following works very well on a ~17GB Wikipedia dump /20170501/dewiki-20170501-pages-articles-multistream.xml.bz2 but I had to increase heap size using -xX6GB.
1. Get XML Grammar
cd /tmp
git clone https://github.com/antlr/grammars-v4
2. Generate Parser
cd /tmp/grammars-v4/xml/
mvn clean install
3. Copy Generated Java files to your Project
cp -r target/generated-sources/antlr4 /path/to/your/project/gen
4. Hook in with a Listener to collect character offsets
package stack43366566;
import java.util.ArrayList;
import java.util.List;
import org.antlr.v4.runtime.ANTLRFileStream;
import org.antlr.v4.runtime.CommonTokenStream;
import org.antlr.v4.runtime.tree.ParseTreeWalker;
import stack43366566.gen.XMLLexer;
import stack43366566.gen.XMLParser;
import stack43366566.gen.XMLParser.DocumentContext;
import stack43366566.gen.XMLParserBaseListener;
public class FindXmlOffset {
List<Integer> offsets = null;
String searchForElement = null;
public class MyXMLListener extends XMLParserBaseListener {
public void enterElement(XMLParser.ElementContext ctx) {
String name = ctx.Name().get(0).getText();
if (searchForElement.equals(name)) {
offsets.add(ctx.start.getStartIndex());
}
}
}
public List<Integer> createOffsets(String file, String elementName) {
searchForElement = elementName;
offsets = new ArrayList<>();
try {
XMLLexer lexer = new XMLLexer(new ANTLRFileStream(file));
CommonTokenStream tokens = new CommonTokenStream(lexer);
XMLParser parser = new XMLParser(tokens);
DocumentContext ctx = parser.document();
ParseTreeWalker walker = new ParseTreeWalker();
MyXMLListener listener = new MyXMLListener();
walker.walk(listener, ctx);
return offsets;
} catch (Exception e) {
throw new RuntimeException(e);
}
}
public static void main(String[] arg) {
System.out.println("Search for offsets.");
List<Integer> offsets = new FindXmlOffset().createOffsets("/tmp/dewiki-20170501-pages-articles-multistream.xml",
"page");
System.out.println("Offsets: " + offsets);
}
}
5. Result
Prints:
Offsets: [2441, 10854, 30257, 51419 ....
6. Read from Offset Position
To test the code I've written class that reads in each wikipedia page to a java object
#JacksonXmlRootElement
class Page {
public Page(){};
public String title;
}
using basically this code
private Page readPage(Integer offset, String filename) {
try (Reader in = new FileReader(filename)) {
in.skip(offset);
ObjectMapper mapper = new XmlMapper();
mapper.configure(DeserializationFeature.FAIL_ON_UNKNOWN_PROPERTIES, false);
Page object = mapper.readValue(in, Page.class);
return object;
} catch (Exception e) {
throw new RuntimeException(e);
}
}
Find complete example on github.
I just had to solve this problem, and spent way too much time figuring it out. Hopefully the next poor soul who comes looking for ideas can benefit from my suffering.
The first problem to contend with is that most XMLStreamReader implementations provide inaccurate results when you ask them for their current offsets. Woodstox however seems to be rock-solid in this regard.
The second problem is the actual type of offset you use. You have to use char offsets if you need to work with a multi-byte charset, which means the random-access retrieval from the file using the provided offsets is not going to be very efficient - you can't just set a pointer into the file at your offset and start reading, you have to read through until you get to the offset (that's what skip does under the covers in a Reader), then start extracting. If you're dealing with very large files, that means retrieval of content near the end of the file is too slow.
I ended up writing a FilterReader that keeps a buffer of byte offset to char offset mappings as the file is read. When we need to get the byte offset, we first ask Woodstox for the char offset, then get the custom reader to tell us the actual byte offset for the char offset. We can get the byte offset from the beginning and end of the element, giving us what we need to go in and surgically extract the element from the file by opening it as a RandomAccessFile, which means it's super fast at any point in the file.
I created a library for this, it's on GitHub and Maven Central. If you just want to get the important bits, the party trick is in the ByteTrackingReader.
Some people have commented about how this whole thing is a bad idea and why would you want to do it? XML is a transport mechanism, you should just import it to a DB and work with the data with more appropriate tools. For most cases this is true, but if you're building applications or integrations that communicate via XML, you need tooling to analyze and operate on the files that are exchanged. I get daily requests to verify feed contents, having the ability to quickly extract a specific set of items from a massive file and verify not only the contents, but the format itself is essential.
Anyhow, hopefully this can save someone a few hours, or at least get them closer to a solution.

Optimization on VTD-XML parse?

I have to make performance test on VTD-XML library in order to make not just simple parsing but additional transformation in the parsing.
So I have 30MB input XML and then I transform it with custom logic to other XML.
SO I want to remove all thinks which slow the whole process which comes from my side(because of not good use of VTD library).
I tried to search tips for optimization but can not find them.
I noutised that:
'0'. What is better to use for selection selectXPath, or selectElement?
Use parsing without namespace is much faster.
File file = new File(fileName);
VTDGen vtdGen = new VTDGen();
vtdGen.setDoc_BR(new byte[(int) file.length()]);
vtdGen.parse(false);
Read from byte or pass to VTDGen ?
final VTDGen vg = new VTDGen();
vg.parseFile("books.xml", false);
or
// open a file and read the content into a byte array
File f = new File("books.xml");
FileInputStream fis = new FileInputStream(f);
byte[] b = new byte[(int) f.length()];
fis.read(b);
VTDGen vg = new VTDGen();
vg.setDoc(b);
vg.parse(true);
Using the second approach - 0.01 times faster...(can be from everything)
What is the difference with parseFile the file is limited upTo 2GB with namespaceaware true and 1GB witout but what for the byte approach?
Reuse buffers
You can ask VTDGen to reuse VTD buffers for the next parsing task.
Otherwise, by default, VTDGen will allocate new buffer for each
parsing run.
Can you give an example for that?
Adjust LC level to 5
By default, it is 3. But you can set it to 5. When your XML are deeply
nested, setting LC level to 5 results in better XPath performance. But
it increases memory usage and parsing time very slightly.
VTDGen vg = new VTDGen();
vtdGen.selectLcDepth(5);
But have runtime exception. Only works with 3
Indexing
Use VTD+XML indexing- Instead of parsing XML files at the time of
processing request, you can pre-index your XML into VTD+XML format and
dump them on disk. When the processing request commences, simply load
VTD+xml in memory and voila, parsing is no longer needed!!
VTDGen vg = new VTDGen();
if (vg.parseFile(inputName,true)){
vg.writeIndex(new FileOutputStream(outputName));
}
Can anyone knows how to use it? What happens if the file is changes, how to tripper new re-indexing. And if there is 10kb change in 3GB does the parsing will take time for the whole new file parsing or just for the changed lines?
overwrite feature
The overwrite feature aka. data templating- Because VTD-XML retains
XML in memory as is, you can actually create a template XML file
(pre-indexed in vtd+xml) whose value fields are left blank and let
your app fill in the blank, thus creating XML data that never need to
be parsed.
I think you should look at the examples bundled with vtd-xml release... and build up the expertise gradually... fortunately, vtd-xml is in my view one of the easiest XML API by a large margin... so the learning curve won't be SAX/STAX kind of difficult.
My answer to your numbered lists above...
selectXPath is for xpath evaluation. selectElement is similar to getElementByTag()
turning on Namespace awareness has little/no effect on parsing performance whatsoever... can you reference the source of your 100x slowdown claim?
you can read from bytes or read from files directly... here is a link to a blog post
https://ximpleware.wordpress.com/2016/06/02/parsefile-vs-parse-a-quick-comparison/
3.Buffer reuse is somewhat an advanced feature..let's get to that at a later time
4.If you get the latest version (2.13), you will not get runtime exception with that method call...
to parse xml doc larger than 2GB, you need to switch to extended edition of vtd-xml which is a separate API bundled with standard vtd-xml...
There are examples bundled with vtd-xml distribution that you might want to look at first... here is an article on this subject
http://www.codeproject.com/Articles/24663/Index-XML-Documents-with-VTD-XML

LingPipe POS tagger runs out of memory

I'm having trouble using the LingPipe POS tagger to count the most frequently used parts of speech in a large (~180MB) corpus of e-mails. Specifically, it consumes enormous amounts of memory (at least 4GB), such that no matter how much memory I give the JVM it fails with an OutOfMemoryError. Before I give up and try a different tagging library, I thought I'd ask to see if anyone here was familiar enough with LingPipe to know what I'm doing wrong.
I start by reading a HiddenMarkovModel object in from the file pos-en-general-brown.HiddenMarkovModel, included with the LingPipe library, which is boilerplate Java serialization code. Then I try to use it like this:
HmmDecoder decoder = new HmmDecoder(hmm, new FastCache<String, double[]>(1000),
new FastCache<String, double[]>(1000));
List<Email> emails = FileUtil.loadMLPosts(new File(args[1]));
Multiset<String> rHelpTagCounts = countTagsInEmails(decoder, emails);
Where countTagsInEmails is defined as follows:
static TokenizerFactory TOKENIZER_FACTORY = IndoEuropeanTokenizerFactory.INSTANCE;
public static Multiset<String> countTagsInEmails(HmmDecoder decoder, List<Email> emails) {
Multiset<String> tagCounts = HashMultiset.create();
for(Email email : emails) {
char[] bodyChars = email.body.toCharArray();
Tokenizer tokenizer = TOKENIZER_FACTORY.tokenizer(bodyChars, 0, bodyChars.length);
List<String> bodyTokens = new ArrayList<>();
tokenizer.tokenize(bodyTokens, new ArrayList<String>()); //Throw away the whitespaces list, we don't care
Tagging<String> taggedTokens = decoder.tag(bodyTokens);
tagCounts.addAll(taggedTokens.tags());
}
return tagCounts;
}
I don't think the deetails of FileUtil.loadMLPosts() are important; this just creates a list of Email objects from my 180MB e-mail archive file, where the body field of each Email is a String containing the e-mail's body. Note that the Multiset is Guava's implementation.
If I watch Java's memory usage while running my program, it starts at 1GB (already surprisingly high), then steadily climbs the more e-mails are tagged. At a few points it jumps dramatically, by several hundred megabytes at once. Before it can finish tagging the corpus, it reaches 4GB (the amount of memory I gave my JVM) and crashes.
Is LingPipe's HmmDecoder supposed to be this memory-inefficient? Or am I using it wrong? I notice that the example given on LingPipe's (rather sparse) documentation page for POS tagging always shows the decoder tagging one sentence at a time, so is it a mistake to pass the entire e-mail body to decoder.tag()?

Lucene Document creation in while loop slows down more and more

I have some efficiency problems. I'm developing an enterprise application that is deployed on a jboss EAP 6.1 server as an EAR archive. I create new objects based on entities in a while loop and write them to a file. I get those entities (with help of an EJB DAO) in limited amount (for example 2000 for each step). The problem is that I need to process millions of objects and the first million goes quite smoothly but the further loop goes the slower it works. Can anyone tell me why is this working slower and slower as the loop advances? How can I make it work smoothly all way long? Here are some crucial parts of the code:
public void createFullIndex(int stepSize) {
int logsNumber = systemLogDao.getSystemLogsNumber();
int counter = 0;
while (counter < logsNumber) {
for (SystemLogEntity systemLogEntity : systemLogDao.getLimitedSystemLogs(counter, stepSize)) {
addDocument(systemLogEntity);
}
counter = counter + stepSize;
}
commitIndex();
}
public void addDocument(SystemLogEntity systemLogEntity) {
try {
Document document = new Document();
document.add(new NumericField("id", Field.Store.YES, true).setIntValue(systemLogEntity.getId()));
document.add(new Field("resource", (systemLogEntity.getResource() == null ? "" : systemLogEntity
.getResource().getResourceCode()), Field.Store.YES, Field.Index.ANALYZED));
document.add(new Field("operationType", (systemLogEntity.getOperationType() == null ? "" : systemLogEntity
document.add(new Field("comment",
(systemLogEntity.getComment() == null ? "" : systemLogEntity.getComment()), Field.Store.YES,
Field.Index.ANALYZED));
indexWriter.addDocument(document);
} catch (CorruptIndexException e) {
LOGGER.error("Failed to add the following log to Lucene index:\n" + systemLogEntity.toString(), e);
} catch (IOException e) {
LOGGER.error("Failed to add the following log to Lucene index:\n" + systemLogEntity.toString(), e);
}
}
I would appreciate your help!
As far as I can see you do not write your stuff to file as far as you get it. Instead you try to create full DOM object and then flush it to file. This strategy is good for limited amount of objects. In your case when you have to deal with millions of them (as you said) you should not use DOM. Instead you should be able to create your XML fragments and write them to file while you are receiving the data. This will reduce your memory consumption and hopefully improve the performance.
I would try re-using the Document object. I've had looping issues with the garbage collection where my loops are too fast for the gc to reasonably keep up, and re-use of objects solved all my issues. I haven't tried re-using the Document object personally, but if it's possible, it may work for you.
Logging should be easy. Using Guava appending to a text looks like:
File to = new File("C:/Logs/log.txt");
CharSequence from = "Your data as string\n";
Files.append(from, to, Charsets.UTF_8);
Few of my notes :
I am not sure if your log entities are garbage collected
It is not clear that file content is kept in memory
If log is in xml format, then whole XML DOM might need to be parsed if new element is added

Categories