Parsing messy texts with Stanford Parser - java

I am running Stanford Parser on a large chunk of texts. The parser terminates when it hits a sentence it cannot parse, and gives the following runtime error. Is there a way to make Stanford Parser ignore the error, and move on to parsing the next sentence?
One way is to break down the text into a myriad of one-sentence documents, and parse each document and record the output. However, this involves loading the Stanford Parser many many times (each time a document is parsed, the Stanford Parser has to be reloaded). Loading the parser takes a lot of time, but parsing takes much shorter time. It would be great to find a way to avoid having to reload the parser on every sentence.
Another solution might be to reload the parser once it hits an error, and picking up the texts where it stopped and continue parsing from there. Does anyone know of a good way to implements this solution?
Last but not least, does there exist any Java wrapper that ignores errors and keeps a Java program running until the program terminates naturally?
Thanks!
Exception in thread "main" java.lang.RuntimeException: CANNOT EVEN CREATE ARRAYS OF ORIGINAL SIZE!!
at edu.stanford.nlp.parser.lexparser.ExhaustivePCFGParser.considerCreatingArrays(ExhaustivePCFGParser.java:2190)
at edu.stanford.nlp.parser.lexparser.ExhaustivePCFGParser.parse(ExhaustivePCFGParser.java:347)
at edu.stanford.nlp.parser.lexparser.LexicalizedParserQuery.parseInternal(LexicalizedParserQuery.java:258)
at edu.stanford.nlp.parser.lexparser.LexicalizedParserQuery.parse(LexicalizedParserQuery.java:536)
at edu.stanford.nlp.parser.lexparser.LexicalizedParserQuery.parseAndReport(LexicalizedParserQuery.java:585)
at edu.stanford.nlp.parser.lexparser.ParseFiles.parseFiles(ParseFiles.java:213)
at edu.stanford.nlp.parser.lexparser.ParseFiles.parseFiles(ParseFiles.java:73)
at edu.stanford.nlp.parser.lexparser.LexicalizedParser.main(LexicalizedParser.java:1535)

This error is basically an out of memory error. It likely occurs because there are long stretches of text with no sentence terminating punctuation (periods, question marks), and so it has been and is trying to parse a huge list of words that it regards as a single sentence.
The parser in general tries to continue after a parse failure, but can't in this case because it both failed to create data structures for parsing a longer sentence and then failed to recreate the data structures it was using previously. So, you need to do something.
Choices are:
Indicate sentence/short document boundaries yourself. This does not require loading the parser many times (and you should avoid that). From the command-line you can put each sentence in a file and give the parser many documents to parse and ask it to save them in different files (See the -writeOutputFiles option).
Alternatively (and perhaps better) you can do this keeping everything in one file by either making the sentences one per line, or using simple XML/SGML style tags surrounding each sentence and then to use the -sentences newline or -parseInside ELEMENT.
Or you can just avoid this problem by specifying a maximum sentence length. Longer things that are not sentence divided will be skipped. (This is great for runtime too!) You can do this with -maxLength 80.
If you are writing your own program, you could catch this Exception and try to resume. But it will only be successful if sufficient memory is available, unless you take the steps in the earlier bullet points.

Parser are known to be slow. You can try using shallow parser which will be relatively faster then full blown version. If you just need POS tag then consider using tagger. Create a static instance of parser and use it over and over rather then reloading. <>Q-24

Related

ANTLR4 techniques for LARGE files - breaking up the parsing tree

I have a working ANTLR4 compiler which works well for files up to ~300Mb, if I set the JAVA VM size to 8G with -Xmx8G. However larger files crash the parser/compiler with a HEAP out of memory message. I have been advised to check my code for memory consumption outside of the ANTLR4 process. (data below) I'm using token factory and unbufferedChar and token streams.
One strategy I'm working with is to test the size of the INPUT file/stream, (if knowable), in my case it is. If the file is small, parse using my top level rule which generates a parse tree that is large, but works for small files.
If the file is larger than an arbitrary threshold, I attempt to divide the parsing into chunks, by selecting a sub-rule. So for small files I parse the rule patFile (existing working code), for large files I'm exploring breaking things up by parsing sub rule "patFileHeader", followed by parsing the rule "bigPatternRec" which replaces the "patterns+" portion of the former rule.
In this way my expectation is that I can control how much of the token stream is read in.
At the moment this looks promising, but I see issues with controlling how much ANTLR4 parses when processing the header. I likely have a grammar rule that causes the patFileHeader to consume all available input tokens before exiting. Other cases seem to work, but I'm still testing. I'm just not sure that this approach to solving "large file" parsing is viable.
SMALL file Example Grammar:
patFile : patFileHeader patterns+
// {System.out.println("parser encountered patFile");}
;
patFileHeader : SpecialDirective? includes* gbl_directives* patdef
;
patterns : patdata+ patEnd
// {System.out.println("parser encountered patterns");}
;
bigPatternRec : patdata
| patEnd
;
...
In my case for a small file, I create the parse tree with:
parser = new myparser(tokens);
tree = parser.patFile(); // rule that parses to EOF
walker=walk(mylisteners,tree);
Which will parse the entire file to EOF.
For larger files I considered the following technique:
// Process the first few lines of the file
tree = parser.patFileHeader(); // sub rule that does not parse to EOF
walker=walk(mylisteners,tree);
//
// Process remaining lines one line/record at a time
//
while( inFile.available() ) {
parser = new myParser(tokens);
tree = parser.bigPatternRec();
walker=walk(mylisteners,tree);
}
In response to a suggestion that I profile the behavior, I have generated this screenshot of JVMonitor on the "whole file" during processing of my project.
One thing of intrest to me was the three Context sets of ~398Mb. In my grammar vec is a component of vecdata, so it appears that some context data is getting replicated. I may play with that. It's possible that the char[] entry is my code outside of ANTLR4. I'd have to disable my listeners and run to generate the parse tree witihout my code active to be sure. I do other things that consume memory (MappedByteBuffers) for high speed file I/O on output, which will contribute to exceeding the 8Gb image.
What is interesting however, is what happens to the memory image if I break the calls up and JUST process subrules. The memory consumption is ~%10 of the full size, and the ANTLR4 objects are not even on the radar in that case.

Defining a manual Split algorithm for File Input

I'm new to Spark and the Hadoop ecosystem and already fell in love with it.
Right now, I'm trying to port an existing Java application over to Spark.
This Java application is structured the following way:
Read file(s) one by one with a BufferedReader with a custom Parser Class that does some heavy computing on the input data. The input files are of 1 to maximum 2.5 GB size each.
Store data in memory (in a HashMap<String, TreeMap<DateTime, List<DataObjectInterface>>>)
Write out the in-memory-datastore as JSON. These JSON files are smaller of size.
I wrote a Scala application that does process my files by one worker but that is obviously not the most performance benefit I can get out of Spark.
Now to my problem with porting this over to Spark:
The input files are line-based. I usually have one message per line. However, some messages depend on preceding lines to form an actual valid message in the Parser. For example it could happen that I get data in the following order in an input file:
{timestamp}#0x033#{data_bytes} \n
{timestamp}#0x034#{data_bytes} \n
{timestamp}#0x035#{data_bytes} \n
{timestamp}#0x0FE#{data_bytes}\n
{timestamp}#0x036#{data_bytes} \n
To form an actual message that out of the "composition message" 0x036, the parser also needs the lines from message 0x033, 0x034 and 0x035. Other messages could also get in between these set of needed messages. The most messages can be parsed by reading a single line though.
Now finally my question:
How to get Spark to split my file correctly for my purposes? The files can not be Split "randomly"; they must be split in a way that makes sure that all my messages can be parsed and the Parser will not wait for input that he will never get. This means that each composition message (messages that depend on preceding lines) need to be in one split.
I guess there are several ways to achieve a correct output but I'll throw some ideas that I had into this post as well:
Define a manual Split algorithm for the file input? This will check that the last few lines of a split do not contain the start of a "big" message [0x033, 0x034, 0x035].
Split the file however spark wants but also add a fixed number of lines (lets say 50, that will do the job for sure) from the last split to the next split. Multiple data will be handled by the Parser class correctly and would not introduce any issues.
The second way might be easier, however I have no clue how to implement this in Spark. Can someone point me into the right direction?
Thanks in advance!
I saw your comment on my blogpost on http://blog.ae.be/ingesting-data-spark-using-custom-hadoop-fileinputformat/ and decided to give my input here.
First of all, I'm not entirely sure what you're trying to do. Help me out here: your file contains lines containing the 0x033, 0x034, 0x035 and 0x036 so Spark will process them separately? While actually these lines need to be processed together?
If this is the case, you shouldn't interpret this as a "corrupt split". As you can read in the blogpost, Spark splits files into records that it can process separately. By default it does this by splitting records on newlines. In your case however, your "record" is actually spread over multiple lines. So yes, you can use a custom fileinputformat. I'm not sure this will be the easiest solution however.
You can try to solve this using a custom fileinputformat that does the following: instead of giving line by line like the default fileinputformat does, you parse the file and keep track of encountered records (0x033, 0x034 etc). In the meanwhile you may filter out records like 0x0FE (not sure if you want to use them elsewhere). The result of this will be that Spark gets all these physical records as one logical record.
On the other hand, it might be easier to read the file line by line and map the records using a functional key (e.g. [object 33, 0x033], [object 33, 0x034], ...). This way you can combine these lines using the key you chose.
There are certainly other options. Whichever you choose depends on your use case.

Parsing XML in Java line by line

I'd like to parse an XML file in Java line by line because the framework of a file that I got is a bit different than usual. It is not nested; each tag is in its own line.
Part of XML file:
<sentence><flag>3</flag></sentence>
<word><text>Zdravo</text></word>
<phoneme><onephoneme>z</onephoneme></phoneme>
<phoneme><onephoneme>d</onephoneme></phoneme>
<phoneme><onephoneme>r</onephoneme></phoneme>
<phoneme><onephoneme>"a:</onephoneme></phoneme>
<phoneme><onephoneme>v</onephoneme></phoneme>
<phoneme><onephoneme>O</onephoneme></phoneme>
<sentence><flag>0</flag></sentence>
<word><text>moje</text></word>
...
I searched and found a lot of different ways to parse an XML file but all of them scan the whole file and I don't want that because my file is almost 100k lines and for now (and maybe even later) I only need first 800 lines so it would be much faster to just parse line by line. I don't know how many lines I really need in advance but I'd like to count how many times I reach tag and stop at certain count (for now it's 17 - that's around line 800).
Tutorials that I found:
nested XML: http://www.mkyong.com/java/how-to-read-xml-file-in-java-dom-parser/
nested XML: http://www.javacodegeeks.com/2013/05/parsing-xml-using-dom-sax-and-stax-parser-in-java.html
single line XML with attributes: Read single XML line with Java
Each sentence is then separated into word and each word into phonemes, so in the end I'd have 3 ArrayLists: flags, words and phonemes.
I hope I gave you enough information.
Thank you.
Lines are not really relevant for XML, you can have all your XML worth of 100K lines in one single line. What you need to do is count by elements/nodes you parse. Use a SAX parser, it is event based, it will notify you when an element start and when it ends. Whenever you get an element you are interested in parsing increment the counter, this assumes you know the elements you are interested in, from your example, those would be:
<sentence>
<word>
<phoneme>
etc.
Andrew Stubbs suggested SAX and StAX but if your file will be really big I would use VTD-XML it is at least 3 times faster then SAX and much more flexible. Processing 2GB XMLs is not a problem at all
If you want to read a file line by line, is has nothing to do with XML. Simply use an BufferedReader as it provides an readLine method.
With a simple counter you can check, how many lines you have already read and quit the loop after you hit the 800 mark.
As said by #Korashen, if you can guarantee that the files you will be dealing with are going to follow a flat, line by line structure then you are probably best off pretending that the files are not XML at all, and using a normal BufferedReader.
However, if you need to parse it as XML, then a streaming XML reader should be able to do what you want. According to Java XML Parser for huge files, SAX or StAX are the standard choices.
You can use sax parser. In the xml is traversed line by line and appropriate events are triggered. In addition you could use org.xml.sax.Locator to identify the line number and throw an exception when you encounter line 800 to stop parsing.

java - analyzing big text files

I need to analyze a log file at runtime with Java.
What I need is, to be able to take a big text file, and search for a certain string or regex within a certain range of lines.
The range itself is deduced by another search.
For example, I want to search the string "operation ended with failure" in the file, but not the whole file, only starting with the line which says "starting operation".
Of course I can do this with plain InputStream and file reading, but is there a library or a tool that will help do it more conveniently?
If the file is really huge, then in your case either good written java or any *nix tool solution will be almost equally slow (it will be bound to IO). In such a case you won't avoid reading the whole file line-by-line.... And in this case few lines of java code would do the job ... But rather than once-off search I'd think about splitting the file at generation time, which might be much more efficient. You could redirect the log file to another program/script (either awk or python would be perfect for it) and split the file on-line/when generated rather than post-factum.
Check this one out - http://johannburkard.de/software/stringsearch/
Hope that helps ;)

What does the org.apache.xmlbeans.XmlException with a message of "Unexpected element: CDATA" mean?

I'm trying to parse and load an XML document, however I'm getting this exception when I call the parse method on the class that extends XmlObject. Unfortunately, it gives me no ideas of what element is unexpected, which is my problem.
I am not able to share the code for this, but I can try to provide more information if necessary.
Not being able to share code or input data, you may consider the following approach. That's a very common dichotomic approach to diagnostic, I'm afraid, and indeed you may readily have started it...
Try and reduce the size of the input XML by removing parts of it, ensuring that the underlying XML document remains well formed and possibly valid (if validity is required in your parser's setup). If you maintain validity, this may require to alter [a copy of] the Schema (DTD or other), as manditory elements might be removed during the cut-and-try approach... BTW, the error message seems to hint more at a validation issue that a basic well-formedness assertion issue.
Unless one has a particular hunch as to the area that triggers the parser's complaint, we typically remove (or re-add, when things start working) about half of what was previously cut or re-added.
You may also start with trying a mostly empty file, to assert that the parser does work at all... There again is the idea to "divide to prevail": is the issue in the XML input or in the parser ? (remembering that there could be two issues, one in the input and one in the parser, and thtat such issues could even be unrelated...)
Sorry to belabor basic diagnostics techniques which you may well be fluent with...
You should check the arguments you are passing to the method parse();
If you are directly passing a string to parse or file or inputstream accordingly (File/InputStream/String) etc.
The exception is caused by the length of the XML file. If you add or remove one character from the file, the parser will succeed.
The problem occurs within the 3rd party PiccoloLexer library that XMLBeans relies on. It has been fixed in revision 959082 but has not been applied to xbean 2.5 jar.
XMLBeans - Problem with XML files if length is exactly 8193bytes
Issue reported on XMLBean Jira

Categories