ANTLR4 techniques for LARGE files - breaking up the parsing tree - java

I have a working ANTLR4 compiler which works well for files up to ~300Mb, if I set the JAVA VM size to 8G with -Xmx8G. However larger files crash the parser/compiler with a HEAP out of memory message. I have been advised to check my code for memory consumption outside of the ANTLR4 process. (data below) I'm using token factory and unbufferedChar and token streams.
One strategy I'm working with is to test the size of the INPUT file/stream, (if knowable), in my case it is. If the file is small, parse using my top level rule which generates a parse tree that is large, but works for small files.
If the file is larger than an arbitrary threshold, I attempt to divide the parsing into chunks, by selecting a sub-rule. So for small files I parse the rule patFile (existing working code), for large files I'm exploring breaking things up by parsing sub rule "patFileHeader", followed by parsing the rule "bigPatternRec" which replaces the "patterns+" portion of the former rule.
In this way my expectation is that I can control how much of the token stream is read in.
At the moment this looks promising, but I see issues with controlling how much ANTLR4 parses when processing the header. I likely have a grammar rule that causes the patFileHeader to consume all available input tokens before exiting. Other cases seem to work, but I'm still testing. I'm just not sure that this approach to solving "large file" parsing is viable.
SMALL file Example Grammar:
patFile : patFileHeader patterns+
// {System.out.println("parser encountered patFile");}
;
patFileHeader : SpecialDirective? includes* gbl_directives* patdef
;
patterns : patdata+ patEnd
// {System.out.println("parser encountered patterns");}
;
bigPatternRec : patdata
| patEnd
;
...
In my case for a small file, I create the parse tree with:
parser = new myparser(tokens);
tree = parser.patFile(); // rule that parses to EOF
walker=walk(mylisteners,tree);
Which will parse the entire file to EOF.
For larger files I considered the following technique:
// Process the first few lines of the file
tree = parser.patFileHeader(); // sub rule that does not parse to EOF
walker=walk(mylisteners,tree);
//
// Process remaining lines one line/record at a time
//
while( inFile.available() ) {
parser = new myParser(tokens);
tree = parser.bigPatternRec();
walker=walk(mylisteners,tree);
}
In response to a suggestion that I profile the behavior, I have generated this screenshot of JVMonitor on the "whole file" during processing of my project.
One thing of intrest to me was the three Context sets of ~398Mb. In my grammar vec is a component of vecdata, so it appears that some context data is getting replicated. I may play with that. It's possible that the char[] entry is my code outside of ANTLR4. I'd have to disable my listeners and run to generate the parse tree witihout my code active to be sure. I do other things that consume memory (MappedByteBuffers) for high speed file I/O on output, which will contribute to exceeding the 8Gb image.
What is interesting however, is what happens to the memory image if I break the calls up and JUST process subrules. The memory consumption is ~%10 of the full size, and the ANTLR4 objects are not even on the radar in that case.

Related

Can't write large owl file with Jena

I'm trying to convert data contained in a database table in a set of triples so I'm writing an owl file using Jena java library.
I have successfully done it with a small number of table records (100) which corresponds to nearly 20.000 rows in the .owl file and I'm happy with it.
To write the owl file I have used the following code (m is an OntModel object):
BufferedWriter out = null;
try {
out = new BufferedWriter (new FileWriter(FILENAME));
m.write(out);
out.close();
}catch(IOException e) {};
Unfortunately when I try to do the same with the entire result set of the table (800.000 records) eclipse console shows me the exception:
Exception in thread "main" java.lang.OutOfMemoryError: GC overhead limit exceeded
the exception is raised by
m.write(out);
I'm absolutely sure the model is correctly filled because I tried to execute the program without creating the owl file and all worked fine.
To fix it, I tried to increase heap memory setting -Xmx4096Minrun->configuration->vm arguments but the error still appears.
I'm executing the application on a macbook so I have no unlimited memory. Are there chances to complete the task? maybe is there a more efficient way to store the model?
The default format is RDF/XML is a pretty form, but to calculate the "pretty", quite a lot of work needs to be done before writing starts. This includes building up internal datstructures. Some shapes of data cause quite extensive work to be done searching for the "most pretty" variation.
RDF/XML in pretty form is the most expensive format. Even the pretty Turtle form is cheaper though it still involves some preparation calculations.
To write in RDF/XML in a simpler format, with no complex pretty features:
RDFDataMgr.write(System.out, m, RDFFormat.RDFXML_PLAIN);
Output streams are preferred, and the output will be UTF-8 - "new BufferedWriter (new FileWriter(FILENAME));" will use the platform default character set.
See the documentation for other formats and variations:
https://jena.apache.org/documentation/io/rdf-output.html
such as RDFFormat.TURTLE_BLOCKS.

Extract part of XML file [duplicate]

I need a xml parser to parse a file that is approximately 1.8 gb.
So the parser should not load all the file to memory.
Any suggestions?
Aside the recommended SAX parsing, you could use the StAX API (kind of a SAX evolution), included in the JDK (package javax.xml.stream ).
StAX Project Home: http://stax.codehaus.org/Home
Brief introduction: http://www.xml.com/pub/a/2003/09/17/stax.html
Javadoc: https://docs.oracle.com/javase/8/docs/api/javax/xml/stream/package-summary.html
Use a SAX based parser that presents you with the contents of the document in a stream of events.
StAX API is easier to deal with compared to SAX. Here is a short tutorial
Try VTD-XML. I've found it to be more performant, and more importantly, easier to use than SAX.
As others have said, use a SAX parser, as it is a streaming parser. Using the various events, you extract your information as necessary and then, on the fly store it someplace else (database, another file, what have you).
You can even store it in memory if you truly just need a minor subset, or if you're simply summarizing the file. Depends on the use case of course.
If you're spooling to a DB, make sure you take some care to make your process restartable or whatever. A lot can happen in 1.8GB that can fail in the middle.
Stream the file into a SAX parser and read it into memory in chunks.
SAX gives you a lot of control and being event-driven makes sense. The api is a little hard to get a grip on, you have to pay attention to some things like when the characters() method is called, but the basic idea is you write a content handler that gets called when the start and end of each xml element is read. So you can keep track of the current xpath in the document, identify which paths have which data you're interested in, and identify which path marks the end of a chunk that you want to save or hand off or otherwise process.
Use almost any SAX Parser to stream the file a bit at a time.
I had a similar problem - I had to read a whole XML file and create a data structure in memory. On this data structure (the whole thing had to be loaded) I had to do various operations. A lot of the XML elements contained text (which I had to output in my output file, but wasn't important for the algorithm).
FIrstly, as suggested here, I used SAX to parse the file and build up my data structure. My file was 4GB and I had an 8GB machine so I figured maybe 3GB of the file was just text, and java.lang.String would probably need 6GB for those text using its UTF-16.
If the JVM takes up more space than the computer has physical RAM, then the machine will swap. Doing a mark+sweep garbage collection will result in the pages getting accessed in a random-order manner and also objects getting moved from one object pool to another, which basically kills the machine.
So I decided to write all my strings out to disk in a file (the FS can obviously handle sequential-write of the 3GB just fine, and when reading it in the OS will use available memory for a file-system cache; there might still be random-access reads but fewer than a GC in java). I created a little helper class which you are more than welcome to download if it helps you: StringsFile javadoc | Download ZIP.
StringsFile file = new StringsFile();
StringInFile str = file.newString("abc"); // writes string to file
System.out.println("str is: " + str.toString()); // fetches string from file
+1 for StaX. It's easier to use than SaX because you don't need to write callbacks (you essentially just loop over all elements of the while until you're done) and it has (AFAIK) no limit as to the size of the files it can process.

Marking chunks of data in a binary file

I'm writing save logic for an application and part of it will save a dynamic list of "chunks" of data to a single file. Some of those chunks might have been provided by a plugin though (which would have included logic to read it back), so I need to find a way to properly skip unrecognized chunks of data if the plugin which created it has been removed.
My current solution is to write a length (int32) before each "chunk" so if there's an error the reader can skip past it and continue reading the next chunk.
However, this requires calculating the length of the data before writing any of it, and since our system is somewhat dynamic and allows nested data types, I'd rather avoid the overhead of caching everything just to measure it.
I'm considering using file markers somehow - I could scan the file for a very specific byte sequence that separates chunks. That could be written after each chunk rather than before.
Are other options I'm not thinking of? My goal is to find a way to write the data as immediately, without the need for caching and measuring it.

Parsing messy texts with Stanford Parser

I am running Stanford Parser on a large chunk of texts. The parser terminates when it hits a sentence it cannot parse, and gives the following runtime error. Is there a way to make Stanford Parser ignore the error, and move on to parsing the next sentence?
One way is to break down the text into a myriad of one-sentence documents, and parse each document and record the output. However, this involves loading the Stanford Parser many many times (each time a document is parsed, the Stanford Parser has to be reloaded). Loading the parser takes a lot of time, but parsing takes much shorter time. It would be great to find a way to avoid having to reload the parser on every sentence.
Another solution might be to reload the parser once it hits an error, and picking up the texts where it stopped and continue parsing from there. Does anyone know of a good way to implements this solution?
Last but not least, does there exist any Java wrapper that ignores errors and keeps a Java program running until the program terminates naturally?
Thanks!
Exception in thread "main" java.lang.RuntimeException: CANNOT EVEN CREATE ARRAYS OF ORIGINAL SIZE!!
at edu.stanford.nlp.parser.lexparser.ExhaustivePCFGParser.considerCreatingArrays(ExhaustivePCFGParser.java:2190)
at edu.stanford.nlp.parser.lexparser.ExhaustivePCFGParser.parse(ExhaustivePCFGParser.java:347)
at edu.stanford.nlp.parser.lexparser.LexicalizedParserQuery.parseInternal(LexicalizedParserQuery.java:258)
at edu.stanford.nlp.parser.lexparser.LexicalizedParserQuery.parse(LexicalizedParserQuery.java:536)
at edu.stanford.nlp.parser.lexparser.LexicalizedParserQuery.parseAndReport(LexicalizedParserQuery.java:585)
at edu.stanford.nlp.parser.lexparser.ParseFiles.parseFiles(ParseFiles.java:213)
at edu.stanford.nlp.parser.lexparser.ParseFiles.parseFiles(ParseFiles.java:73)
at edu.stanford.nlp.parser.lexparser.LexicalizedParser.main(LexicalizedParser.java:1535)
This error is basically an out of memory error. It likely occurs because there are long stretches of text with no sentence terminating punctuation (periods, question marks), and so it has been and is trying to parse a huge list of words that it regards as a single sentence.
The parser in general tries to continue after a parse failure, but can't in this case because it both failed to create data structures for parsing a longer sentence and then failed to recreate the data structures it was using previously. So, you need to do something.
Choices are:
Indicate sentence/short document boundaries yourself. This does not require loading the parser many times (and you should avoid that). From the command-line you can put each sentence in a file and give the parser many documents to parse and ask it to save them in different files (See the -writeOutputFiles option).
Alternatively (and perhaps better) you can do this keeping everything in one file by either making the sentences one per line, or using simple XML/SGML style tags surrounding each sentence and then to use the -sentences newline or -parseInside ELEMENT.
Or you can just avoid this problem by specifying a maximum sentence length. Longer things that are not sentence divided will be skipped. (This is great for runtime too!) You can do this with -maxLength 80.
If you are writing your own program, you could catch this Exception and try to resume. But it will only be successful if sufficient memory is available, unless you take the steps in the earlier bullet points.
Parser are known to be slow. You can try using shallow parser which will be relatively faster then full blown version. If you just need POS tag then consider using tagger. Create a static instance of parser and use it over and over rather then reloading. <>Q-24

java - analyzing big text files

I need to analyze a log file at runtime with Java.
What I need is, to be able to take a big text file, and search for a certain string or regex within a certain range of lines.
The range itself is deduced by another search.
For example, I want to search the string "operation ended with failure" in the file, but not the whole file, only starting with the line which says "starting operation".
Of course I can do this with plain InputStream and file reading, but is there a library or a tool that will help do it more conveniently?
If the file is really huge, then in your case either good written java or any *nix tool solution will be almost equally slow (it will be bound to IO). In such a case you won't avoid reading the whole file line-by-line.... And in this case few lines of java code would do the job ... But rather than once-off search I'd think about splitting the file at generation time, which might be much more efficient. You could redirect the log file to another program/script (either awk or python would be perfect for it) and split the file on-line/when generated rather than post-factum.
Check this one out - http://johannburkard.de/software/stringsearch/
Hope that helps ;)

Categories