I use the XOM library to parse and process .docx documents. MS Word stores text content in runs (<w:r>) inside the paragraph tags (<w:p>), and often breaks the text into several runs. Sometimes every word and every space between them is in a separate run. When I load a run containing only a space, the parser removes that space and handles it as an empty tag, as a result, the output contains the text without spaces. How could I force the parser to keep all the spaces? I would prefer keeping this parser, but if there is no solution, could you recommend an alternative one?
This is how I call the parser:
StreamingPathFilter filter = new StreamingPathFilter("/w:document/w:body/*:*", prefixes);
Builder builder = new Builder(filter.createNodeFactory(null, contentTransform));
builder.build(documentFile);
...
StreamingTransform contentTransform = new StreamingTransform() {
#Override
public Nodes transform(nu.xom.Element node){
<...process XML and output text...>
}
}
Meanwhile, I found the solution to this issue, thanks to the hint of Elliotte Rusty Harold on the XOM mailing list.
First, the StreamingPathFilter is in fact not part of the nu.xom package, it belongs to nux.xom.
Second, the issue was caused by StreamingPathFilter. When I changed the code to use the default Builder constructor, the missing spaces appeared in the output.
Just for documentation, the new code looks like the following:
Builder builder = new Builder();
nu.xom.Document doc = builder.build(documentFile);
context = XPathContext.makeNamespaceContext(doc.getRootElement());
Nodes nodes = doc.getRootElement().query("w:body/*", context);
for (int i = 0; i < nodes.size(); i++) {
transform((nu.xom.Element) nodes.get(i));
}
...
private void transform(nu.xom.Element node){
//process nodes
...
}
Related
I am using the XStream library (1.4.10) and the Dom4jDriver to generate xml content from a Java object. The problem is that it appends a new line at the beginning of the content. Anyway to turn this off?
Dom4JDriver dom4JDriver = new Dom4JDriver();
dom4JDriver.getOutputFormat().setSuppressDeclaration(true);
XStream xStream = new XStream(dom4JDriver);
xStream.processAnnotations(MyClass.class);
String myContent = xStream.toXML(myClassInstance); //extra '\n' appended at the start of the string
MyClass.class:
#XStreamAlias("myClass")
public class MyClass{
private String something;
private String somethingElse;
...........
Generated xml:
\n<myClass>\n <something>blabla</something>\n......
I know that I can just use myContent.subString(...) to get rid of the first character, but it doesnt seem so clean to me. I am also doing this for a lot of operations so I would rather not have that line to begin with for performance's sake. Any advise? Thank you :)
Have you tried DomDriver in place of Dom4JDriver?
I am currently in the process of upgrading a search engine application from Lucene 3.5.0 to version 4.10.3. There have been some substantial API changes in version 4 that break backward compatibility. I have managed to fix most of them, but a few issues remain that I could use some help with:
"cannot override final method from Analyzer"
The original code extended the Analyzer class and the overrode tokenStream(...).
#Override
public TokenStream tokenStream(String fieldName, Reader reader) {
CharStream charStream = CharReader.get(reader);
return
new LowerCaseFilter(version,
new SeparationFilter(version,
new WhitespaceTokenizer(version,
new HTMLStripFilter(charStream))));
}
But this method is final now and I am not sure how to understand the following note from the change log:
ReusableAnalyzerBase has been renamed to Analyzer. All Analyzer implementations must now use Analyzer.TokenStreamComponents, rather than overriding .tokenStream() and .reusableTokenStream() (which are now final).
There is another problem in the method quoted above:
"The method get(Reader) is undefined for the type CharReader"
There seem to have been some considerable changes here, too.
"TermPositionVector cannot be resolved to a type"
This class is gone now in Lucene 4. Are there any simple fixes for this? From the change log:
The term vectors APIs (TermFreqVector, TermPositionVector, TermVectorMapper) have been removed in favor of the above flexible indexing APIs, presenting a single-document inverted index of the document from the term vectors.
Probably related to this:
"The method getTermFreqVector(int, String) is undefined for the type IndexReader."
Both problems occur here, for instance:
TermPositionVector termVector = (TermPositionVector) reader.getTermFreqVector(...);
("reader" is of Type IndexReader)
I would appreciate any help with these issues.
I found core developer Uwe Schindler's response to your question on the Lucene mailing list. It took me some time to wrap my head around the new API, so I need to write down something before I forget.
These notes apply to Lucene 4.10.3.
Implementing an Analyzer (1-2)
new Analyzer() {
#Override
protected TokenStreamComponents createComponents(String fieldName, Reader reader) {
Tokenizer source = new WhitespaceTokenizer(new HTMLStripCharFilter(reader));
TokenStream sink = new LowerCaseFilter(source);
return new TokenStreamComponents(source, sink);
}
};
The constructor of TokenStreamComponents takes a source and a sink. The sink is the end result of your token stream, returned by Analyzer.tokenStream(), so set it to your filter chain. The source is the token stream before you apply any filters.
HTMLStripCharFilter, despite its name, is actually a subclass of java.io.Reader which removes HTML constructs, so you no longer need CharReader.
Term vector replacements (3-4)
Term vectors work differently in Lucene 4, so there are no straightforward method swaps. The specific answer depends on what your requirements are.
If you want positional information, you have to index your fields with positional information in the first place:
Document doc = new Document();
FieldType f = new FieldType();
f.setIndexed(true);
f.setStoreTermVectors(true);
f.setStoreTermVectorPositions(true);
doc.add(new Field("text", "hello", f));
Finally, in order to get at the frequency and positional info of a field of a document, you drill down the new API like this (adapted from this answer):
// IndexReader ir;
// int docID = 0;
Terms terms = ir.getTermVector(docID, "text");
terms.hasPositions(); // should be true if you set the field to store positions
TermsEnum termsEnum = terms.iterator(null);
BytesRef term = null;
// Explore the terms for this field
while ((term = termsEnum.next()) != null) {
// Enumerate through documents, in this case only one
DocsAndPositionsEnum docsEnum = termsEnum.docsAndPositions(null, null);
int docIdEnum;
while ((docIdEnum = docsEnum.nextDoc()) != DocIdSetIterator.NO_MORE_DOCS) {
for (int i = 0; i < docsEnum.freq(); i++) {
System.out.println(term.utf8ToString() + " " + docIdEnum + " "
+ docsEnum.nextPosition());
}
}
}
It'd be nice if Terms.iterator() returned an actual Iterable.
H guys,
I'm having a piece of code which would search for some similar .ser files and loads them into a list
the files are (rulesIncr1.ser,rulesIncr2.ser, rulesIncr3.ser ...... and so on)
now to load all the files i have written the following logic
String defaultfilename = "rulesincr";
int i=1;
String incrFile;
//THE FOLLOWING CODE WILL CHECK FOR ANY NU8MBER OF INCR RULES FILE IN THE LOCATION AND ADD THEM TO A RULE MODEL LIST
do
{
String tempincr = new Integer(i).toString();
incrFile = defaultfilename.concat(tempincr).concat(".ser");
FileInputStream fis= new FileInputStream( filePath.concat(incrFile));
ObjectInputStream inStreamIncr = new ObjectInputStream(fis);
myRulesIncr = (List<RuleModel>)inStreamIncr.readObject();
i++;
}
while(new File(filePath.concat(incrFile)).isFile());
Now the problem I'm facing is each and every time myRulesIncr would be refreshed and only the last file is loaded at the end. I need to have all the loaded files. Please advise
Thanks
The line
myRulesIncr = (List<RuleModel>)inStreamIncr.readObject();
in your loop will always override the List to which the myRulesIncr variable points. If you want to add all those RuleModel instances to myRulesIncr you should have something like
List<RuleModel> myRulesIncr = new ArrayList<RuleModel>();
while{
//your while loop without the
//myRulesIncr = (List<RuleModel>)inStreamIncr.readObject(); line
myRules.addAll( (List<RuleModel>)inStreamIncr.readObject() );
}
Im unfamiliar with the standard list object but the problem appears to be here:
myRulesIncr = (List<RuleModel>)inStreamIncr.readObject();
You appear to be making a new list every time, and even if not I believe you need to increment to the next node ie myRulesIncr = myRulesIncr.next()
My data is already tokenized with an external resource and I'd like to use that data within lucene. My first idea would be to join those strings with a \x01 and use a WhiteSpaceTokenizer to split them again. Is there a better idea? (the input is in XML)
As bonus, this annotated data also contains synonyms, how would I inject them (represented as XML tags).
Lucene allows you to provide your own stream of tokens to the field, bypassing the tokenization step. To do that you can create your own subclass of TokenStream implementing incrementToken() and then call field.setTokenStream(new MyTokenStream(yourTokens)):
public class MyTokenStream extends TokenStream {
CharTermAttribute charTermAtt;
OffsetAttribute offsetAtt;
final Iterator<MyToken> listOfTokens;
MyTokenStream(Iterator<MyToken> tokenList) {
listOfTokens = tokenList;
charTermAtt = addAttribute(CharTermAttribute.class);
offsetAtt = addAttribute(OffsetAttribute.class);
}
#Override
public boolean incrementToken() throws IOException {
if(listOfTokens.hasNext()) {
super.clearAttributes();
MyToken myToken = listOfTokens.next();
charTermAtt.setLength(0);
charTermAtt.append(myToken.getText());
offsetAtt.setOffset(myToken.begin(), myToken.end());
return true;
}
return false;
}
}
WhitespaceTokenizer is unfit for strings joined with 0x01. Instead, derive from CharTokenizer, overriding isTokenChar.
The main problem with this approach is that joining and then splitting again migth be expensive; if it turns to be too expensive, you can implement a trivial TokenStream that just emits the tokens from its input.
If by synonyms you mean that a term like "programmer" is expanded to a set of terms, say, {"programmer", "developer", "hacker"}, then I recommend emitting these at the same position. You can use a PositionIncrementAttribute to control this.
For an example of PositionIncrementAttribute usage, see my lemmatizing TokenStream which emits both word forms found in full text and their lemmas at the same position.
I am writing a project that works with NLP (natural language parser). I am using the stanford parser.
I create a thread pool that takes sentences and run the parser with them.
When I create one thread its all works fine, but when I create more, I get errors.
The "test" procedure is finding words that have some connections.
If I do an synchronized its supposed to work like one thread but still I get errors.
My problem is that I have errors on this code:
public synchronized String test(String s,LexicalizedParser lp )
{
if (s.isEmpty()) return "";
if (s.length()>80) return "";
System.out.println(s);
String[] sent = s.split(" ");
Tree parse = (Tree) lp.apply(Arrays.asList(sent));
TreebankLanguagePack tlp = new PennTreebankLanguagePack();
GrammaticalStructureFactory gsf = tlp.grammaticalStructureFactory();
GrammaticalStructure gs = gsf.newGrammaticalStructure(parse);
Collection tdl = gs.typedDependenciesCollapsed();
List list = new ArrayList(tdl);
//for (int i=0;i<list.size();i++)
//System.out.println(list.get(1).toString());
//remove scops and numbers like sbj(screen-4,good-6)->screen good
Pattern p = Pattern.compile(".*\\((.*?)\\-\\d+,(.*?)\\-\\d+\\).*");
if (list.size()>2){
// Split input with the pattern
Matcher m = p.matcher(list.get(1).toString());
//check if the result have more than 1 groups
if (m.find()&& m.groupCount()>1){
if (m.groupCount()>1)
{
System.out.println(list);
return m.group(1)+m.group(2);
}}
}
return "";
}
the errors that I have are:
at blogsOpinions.ParserText.(ParserText.java:47)
at blogsOpinions.ThreadPoolTest$1.run(ThreadPoolTest.java:50)
at blogsOpinions.ThreadPool$PooledThread.run(ThreadPoolTest.java:196)
Recovering using fall through
strategy: will construct an (X ...)
tree. Exception in thread
"PooledThread-21"
java.lang.ClassCastException:
java.lang.String cannot be cast to
edu.stanford.nlp.ling.HasWord
at
edu.stanford.nlp.parser.lexparser.LexicalizedParser.apply(LexicalizedParser.java:289)
at blogsOpinions.ParserText.test(ParserText.java:174)
at blogsOpinions.ParserText.insertDb(ParserText.java:76)
at blogsOpinions.ParserText.(ParserText.java:47)
at blogsOpinions.ThreadPoolTest$1.run(ThreadPoolTest.java:50)
at blogsOpinions.ThreadPool$PooledThread.run(ThreadPoolTest.java:196)
and how can i get the discription of the subject like the screen is very good, and I want to get screen good from the list the I get and not like list.get(1).
You can't call LexicalizedParser.parse on a List of Strings; it expects a list of HasWord objects. It's much easier to call the apply method on your input string. This will also run a proper tokenizer on your input (instead of your simple split on spaces).
To get relations such as subjectness out of the returned Tree, call its dependencies member.
Hm, I witnessed the same stack trace. Turned out I was loading two instances of the LexicalizedParser in the same JVM. This seemed to be the problem. When I made sure only one instance is created, I was able to call lp.apply(Arrays.asList(sent)) just fine.