I am trying to parse out sentences from a huge amount of text. using java I started off with NLP tools like OpenNLP and Stanford's Parser.
But here is where i get stuck. though both these parsers are pretty great they fail when it comes to a non uniform text.
For example in my text most sentences are delimited by a period, but in some cases like bullet points they aren't. Here both the parses fail miserably.
I even tried setting the option in the stanford parses for multiple sentence terminators but the output was not much better!
Any ideas??
Edit :To make it simpler I am looking to parse text where the delimiter is either a new line ("\n") or a period(".") ...
First you have to clearly define the task. What, precisely, is your definition of 'a sentence?' Until you have such a definition, you will just wander in circles.
Second, cleaning dirty text is usually a rather different task from 'sentence splitting'. The various NLP sentence chunkers are assuming relatively clean input text. Getting from HTML, or extracted powerpoint, or other noise, to text is another problem.
Third, Stanford and other large caliber devices are statistical. So, they are guaranteed to have a non-zero error rate. The less your data looks like what they were trained on, the higher the error rate.
Write a custom sentence splitter. You could use something like the Stanford splitter as a first pass and then write a rule based post-processor to correct mistakes.
I did something like this for biomedical text I was parsing. I used the GENIA splitter and then fixed stuff after the fact.
EDIT: If you are taking in input HTML, then you should preprocess it first, for example handling bulleted lists and stuff. Then apply your splitter.
There's one more excellent toolkit for natural language processing - GATE. It has number of sentence splitters, including standard ANNIE sentence splitter (doesn't fit you needs completely) and RegEx sentence splitter. Use later for any tricky splitting.
Exact pipeline for your purpose is:
Document Reset PR.
ANNIE English Tokenizer.
ANNIE RegEx Sentence Splitter.
Also you can use GATE's JAPE rules for even more flexible pattern searching. (See Tao for full GATE documentation).
If you would like to stick on Stanford NLP or OpenNLP, then you'd better retrain the model. Almost all of the tools in these packages are machine learning based. Only with customized training data, can they give you a ideal model and performance.
Here is my suggestion: manually split the sentences base on your criteria. I guess couple of thousand sentences is enough. Then call the API or command-line to retrain sentence splitters. Then you're done!
But first of all, one thing you need to figure out is, as said in previous threads: "First you have to clearly define the task. What, precisely, is your definition of 'a sentence?"
I'm using Stanford NLP and OpenNLP in my project, Dishes Map, A delicious dishes discovery engine, based on NLP and machine learning. They're working very well!
For similar case what I did was separated the text into different sentences (separated by new lines) based on where I want the text to split. As in your case it is texts starting with bullets (or exactly the text with "line break tag " at end). This will also solve similar problem which may occur if you are working with the HTML for the same.
And after separating those into different lines you can send the individual lines for the sentence detection, that will be more correct.
Related
I wrote custom tokenizer for solr, when I first add records to solr, they are going throug my tokenizer and other filters, when they are going through my tokenizer I call web service and add needed attributes. After it I can use search without sending requests to web service. When I use search with highlighting data are going through my tokenizer again, what should I do for not going through tokenizer again?
When the highlighter is run on the text to highlight, the analyzer and tokenizer for the field is re-run on the text to score the different tokens against the submitted text, to determine which fragment is the best match for the query produced. You can see this code around line #62 of Highlighter.java in Lucene.
There are however a few options that might help for negating the need to re-parse the document text, all given as options on the community wiki for Highlighting:
For the standard highlighter:
It does not require any special datastructures such as termVectors,
although it will use them if they are present. If they are not, this
highlighter will re-analyze the document on-the-fly to highlight it.
This highlighter is a good choice for a wide variety of search
use-cases.
There are also two other Highlighter-implementations you might want to look at, as either one uses other support structures that might avoid doing the retokenizing / analysis of the field (I think testing it will be a lot quicker for you than for me right now).
FastVector Highlighter: The FastVector Highlighter requires term vector options (termVectors, termPositions, and termOffsets) on the field.
Postings Highlighter: The Postings Highlighter requires storeOffsetsWithPositions to be configured on the field. This is a much more compact and efficient structure than term vectors, but is not appropriate for huge numbers of query terms.
You can switch the highlighting implementation by using hl.useFastVectorHighligter=true or adding <highlighting class="org.apache.solr.highlight.PostingsSolrHighlighter"/> to your searchComponent definition.
I am running Stanford Parser on a large chunk of texts. The parser terminates when it hits a sentence it cannot parse, and gives the following runtime error. Is there a way to make Stanford Parser ignore the error, and move on to parsing the next sentence?
One way is to break down the text into a myriad of one-sentence documents, and parse each document and record the output. However, this involves loading the Stanford Parser many many times (each time a document is parsed, the Stanford Parser has to be reloaded). Loading the parser takes a lot of time, but parsing takes much shorter time. It would be great to find a way to avoid having to reload the parser on every sentence.
Another solution might be to reload the parser once it hits an error, and picking up the texts where it stopped and continue parsing from there. Does anyone know of a good way to implements this solution?
Last but not least, does there exist any Java wrapper that ignores errors and keeps a Java program running until the program terminates naturally?
Thanks!
Exception in thread "main" java.lang.RuntimeException: CANNOT EVEN CREATE ARRAYS OF ORIGINAL SIZE!!
at edu.stanford.nlp.parser.lexparser.ExhaustivePCFGParser.considerCreatingArrays(ExhaustivePCFGParser.java:2190)
at edu.stanford.nlp.parser.lexparser.ExhaustivePCFGParser.parse(ExhaustivePCFGParser.java:347)
at edu.stanford.nlp.parser.lexparser.LexicalizedParserQuery.parseInternal(LexicalizedParserQuery.java:258)
at edu.stanford.nlp.parser.lexparser.LexicalizedParserQuery.parse(LexicalizedParserQuery.java:536)
at edu.stanford.nlp.parser.lexparser.LexicalizedParserQuery.parseAndReport(LexicalizedParserQuery.java:585)
at edu.stanford.nlp.parser.lexparser.ParseFiles.parseFiles(ParseFiles.java:213)
at edu.stanford.nlp.parser.lexparser.ParseFiles.parseFiles(ParseFiles.java:73)
at edu.stanford.nlp.parser.lexparser.LexicalizedParser.main(LexicalizedParser.java:1535)
This error is basically an out of memory error. It likely occurs because there are long stretches of text with no sentence terminating punctuation (periods, question marks), and so it has been and is trying to parse a huge list of words that it regards as a single sentence.
The parser in general tries to continue after a parse failure, but can't in this case because it both failed to create data structures for parsing a longer sentence and then failed to recreate the data structures it was using previously. So, you need to do something.
Choices are:
Indicate sentence/short document boundaries yourself. This does not require loading the parser many times (and you should avoid that). From the command-line you can put each sentence in a file and give the parser many documents to parse and ask it to save them in different files (See the -writeOutputFiles option).
Alternatively (and perhaps better) you can do this keeping everything in one file by either making the sentences one per line, or using simple XML/SGML style tags surrounding each sentence and then to use the -sentences newline or -parseInside ELEMENT.
Or you can just avoid this problem by specifying a maximum sentence length. Longer things that are not sentence divided will be skipped. (This is great for runtime too!) You can do this with -maxLength 80.
If you are writing your own program, you could catch this Exception and try to resume. But it will only be successful if sufficient memory is available, unless you take the steps in the earlier bullet points.
Parser are known to be slow. You can try using shallow parser which will be relatively faster then full blown version. If you just need POS tag then consider using tagger. Create a static instance of parser and use it over and over rather then reloading. <>Q-24
I have a small working text editor written in Java using JTextPane and I would like to color specific words. It would work in the same fashion that keywords in Java (within Eclipse) are colored; if they are a keyword, they are highlighted after the user is done typing them. I am new to text editors implemented in Java, any ideas?
In computer science, lexical analysis is the process of converting a
sequence of characters into a sequence of tokens. A program or
function that performs lexical analysis is called a lexical analyzer,
lexer, tokenizer,[1] or scanner. A lexer often exists as a single
function which is called by a parser or another function, or can be
combined with the parser in scannerless parsing.
Having said that, it is no trivial task. You need a high level library to do that. It will ease your task. What is the way out ?
Use ANTLR. Here is what its site says:
ANTLR is a powerful parser generator that you can use to read,
process, execute, or translate structured text or binary files. It’s
widely used in academia and industry to build all sorts of languages,
tools, and frameworks....
NetBeans IDE parses C++ with ANTLR.
There, problem solved. The author of ANTLR also has a book on how to use ANTLR which you may want to buy if you wanna learn how to use it.
Having given you enough brain melt, there is an out of the box solution available for you: JSyntaxPane. Just like any JCOmponent, you initialize it and pop it into a JFrame. It works like a charm. It supports a whole lot of languages apart from Java
Is there a quick way to find all the commented-out code across Java files in Eclipse?
Any option in Search, perhaps, or any add-on that can do this?
It should be able to find only code which is commented out, but not ordinary comments.
In Eclipse, I just do a file search with the regular expression checkbox turned on:
(/\*.*;.*\*/)|(//.*;)
It will find semicolons in
// These;
and /* these; */
Works for me.
Sonar can do it: http://www.sonarsource.org/commented-out-code-eradication-with-sonar/
You can mark your own commented code with a task tag. You can create your own task tags in Eclipse.
From the menu, go to Window -> Preferences. In the Preferences dialog, go to General -> Editors -> Structured Text Editors -> Task Tags.
Add an appropriate task tag, like COMMENTED. Set the priority to Low.
Then, any code you comment out, you can mark with the COMMENTED task tag. A list of these task tags, along with their locations, appears in the Tasks view.
#Jorn said:
I think [the OP] wants to find code that is commented out, not code that has a comment.
If the intention is to find commented out code, then I don't think it is possible in general. The problem is that it is impossible to distinguish between comments that were written as code or pseudo-code, and code that is commented out. Making that distinction requires human intelligence.
Now IDE's typically have a "toggle comments" function that comments out code in a particular way. It would be feasible to write a tool / plugin that matches the style produced by a
particular IDE. But that's probably not good enough, especially since reformatting the code typically gets rid of the characteristics that made the commented out code recognizable.
If the problem is to find commented-out code, what is needed is a way to find comments, and way to decide if a comment might contain code.
A simple way to do this is to search for comment that contain code-like things. I'd be tempted to hunt for comments containing a ";" character (or some other rare indicator such as "="); it will be pretty hard to have any interesting commented code that doesn't contain this and in my experience with comments, I don't see many that people write that contain this. A regexp search for this should be pretty straightforward, even if it picked up a few addtional false positives (e.g. // in a string literal).
A more sophisticated way to accomplish this is to use a Java lexer or parser. If you have a lexer that returns comments at tokens (not all of them do, Java compilers aren't interested in comments), then you can simply scan the lexemes for a comment and do the semicolon check I described above. You won't get any false positives hits for comment like things in string literals with this approach.
If you have a re-engineering parser that captures comments as part of the AST ( such as our SD Java Front End),
you can mechanically scan the parse tree for comments, feed the comment context back to the parser
to see if the content is code like, and report any that passes that test modulo some size-depedent error rate
(10 errors in 15 characters implies "really is a comment"). Now the "code-like" test requires
the reengineering parser be willing to recognize any substring of the (Java) language.
Our DMS Software Reengineering Toolkit underlying the Java Front End can actually do that, using access to the grammar buried in the front end, as it is willing to start a parse for any language (non)terminal,
and this question is "can you find a sequuence of (non)terminals that consumes the string?".
The lexer and parser approaches are small and big sledgehammers respectively. If OP is going to do this just once, he can stick to the manual regex search. If the problem is to vet the code base repeatedly (needed in big organizations), he'd want a tool that can be run on regular basis.
You can do a search in Eclipse.
All you need to search for is /* and //
However, you will only find the files which contain that expression, and not the actual content which I believe you are after.
However, if you are using Linux you can easily get all the comments with a one liner.
Bit of a random one, i am wanting to have a play with some NLP stuff and I would like to:
Get all the text that will be displayed to the user in a browser from HTML.
My ideal output would not have any tags in it and would only have fullstops (and any other punctuation used) and new line characters, though i can tolerate a fairly reasonable amount of failure in this (random other stuff ending up in output).
If there was a way of inserting a newline or full stop in situations where the content was likely not to continue on then that would be considered an added bonus. e.g:
items in an ul or option tag could be separated by full stops (or to be honest just ignored).
I am working Java, but would be interested in seeing any code that does this.
I can (and will if required) come up with something to do this, just wondered if there was anything out there like this already, as it would probably be better than what I come up with in an afternoon ;-).
An example of the code I might write if I do end up doing this would be to use a SAX parser to find content in p tags, strip it of any span or strong etc tags, and add a full stop if I hit a div or another p without having had a fullstop.
Any pointers or suggestions very welcome.
Hmmm ... almost any HTML parser could be used to create the effect you want -- just run through all of the tags and emit only the text elements, and emit a LF for the closing tag of every block element. As you say, a SAX implementation would be simple and straight-forward.
I would just strip everything out that has <> tags and if you want to have a full stop at the end of every sentence you check for closing tags and place a full stop.
If you have
<strong> test </strong>
(and other tags that change the look of the test) you could place in conditions to not place a full stop here.
HTML parsers seem to be a reasonable starting point for this.
there are a number of them for example: HTMLCleaner and Nekohtml seem to work fine.
They are good as they fix the tags to allow you to more consistently process them, even if you are just removing them.
But as it turns out you probably want to get rid of script tags meta data etc. And in that case you are better working with well formed XML which these guy get for you from "wild" html.
there are many SO questions relating to this (like this one) you should search for "HTML parsing" though ;-)