Lucene Highlighter TokenStream exception - java

I have a problem with Lucene Highlighter. I found some code on Stackoverflow and on other, but this code does not work in my program. This is a method where I try search and higlight words, but when I search something, program gives me exception.
Method:
private static void useIndex(String query, String field, String option)
throws ParseException, CorruptIndexException, IOException, InvalidTokenOffsetsException {
// StandardAnalyzer analyzer = new StandardAnalyzer();
Query q = new QueryParser(field, analyzer).parse(query);
int hitsPerPage = 5;
IndexReader reader = DirectoryReader.open(index);
IndexSearcher searcher = new IndexSearcher(reader);
TopScoreDocCollector collector = TopScoreDocCollector.create(hitsPerPage);
searcher.search(q, collector);
ScoreDoc[] hits = collector.topDocs().scoreDocs;
SimpleHTMLFormatter htmlFormatter = new SimpleHTMLFormatter();
Highlighter highlighter = new Highlighter(htmlFormatter, new QueryScorer(q));
// display results
System.out.println("Found " + hits.length + " hits for " + query);
for (int i = 0; i < hits.length; ++i) {
int docId = hits[i].doc;
Document d = searcher.doc(docId);
String docURL = d.get("url");
String docContent = d.get("content");
TokenStream tokenStream = TokenSources.getAnyTokenStream(reader, docId, "content", analyzer);
TextFragment[] frag = highlighter.getBestTextFragments(tokenStream, docContent, false, 4);
String docFrag="";
if ((frag[0] != null) && (frag[0].getScore() > 0)) {
docFrag=frag[0].toString();
}
model.addRow(new Object[] { docURL, findSilimar(docId), docFrag });
}
reader.close();
}
Exception:
Exception in thread "AWT-EventQueue-0" java.lang.NoClassDefFoundError: org/apache/lucene/index/memory/MemoryIndex
Caused by: java.lang.ClassNotFoundException: org.apache.lucene.index.memory.MemoryIndex
I tried everything, but I don't know what is wrong.
P.S. Sorry for my English.

A NoClassDefFoundError means that class isn't in your classpath, so you should figure out what jar you need to add to get it. MemoryIndex is in: lucene-memory-x.x.x.jar
By the way, at a glance, it doesn't appear that this exception would be thrown in the code you've provided.

Related

General way to get a value from a org.w3c.dom.Attr in Java 1.8?

I'm looking for a completely general way to get the value from an XML attribute using Java 1.8's org.w3c.dom.Attr?
Here is what I tried:
import org.w3c.dom.Attr;
String getAttributeValue(Attr attr) {
if (attr.getSpecified()) {
String value = attr.getValue();
}
else {
throw new RuntimeException("I don't know how to get an attribute value when."
+ "it is not specified with the attribute because my "
+ "implementation does not consider default values in "
+ "the schema.");
}
}
According to the JavaDoc for org.w3c.dom.Attr, my solution is not complete because I don't cover the condition where a default value is given in the schema of the XML Document. How do I complete the algorithm covering all possible situations for which there is a correct solution and throwing an exception when a solution is not possible?
My Attr instance comes from a sequence of calls starting with javax.imageio.ImageIO.getImageReaders(ImageInputStream stream). It goes like this:
void processImage(final ImageInputStream stream) throws IOException {
Iterator<ImageReader> readers = ImageIO.getImageReaders(stream);
while (readers.hasNext()) {
ImageReader reader = readers.next();
reader.setInput(imageInputStream, true);
IIOMetadata metadata = reader.getImageMetadata(0);
String[] names = metadata.getMetadataFormatNames();
int length = names.length;
for (int i = 0; i < length; ++i) {
Node node = metadata.getAsTree(names[i]);
short type = node.getNodeType();
if(type == Node.ATTRIBUTE_NODE) {
System.out.print(name[i])
System.out.print(" = ");
String attributeValue = getAttributeValue(node);
System.out.println(attributeValue);
}
}
}
}

Unable to identify error in Lucene MoreLikeThis

I need to use Lucene MoreLikeThis to find similar documents given a paragraph of text. I am new to Lucene and followed the code here
I have already indexed the documents at the directory - "C:\Users\lucene_index_files\v2"
I am using "They are computer engineers and they like to develop their own tools. The program in languages like Java, CPP." as the document to which I want to find similar documents.
public class LuceneSearcher2 {
public static void main(String[] args) throws IOException {
LuceneSearcher2 m = new LuceneSearcher2();
System.out.println("1");
m.start();
System.out.println("2");
//m.writerEntries();
m.findSilimar("They are computer engineers and they like to develop their own tools. The program in languages like Java, CPP.");
System.out.println("3");
}
private Directory indexDir;
private StandardAnalyzer analyzer;
private IndexWriterConfig config;
public void start() throws IOException{
//analyzer = new StandardAnalyzer(Version.LUCENE_42);
//config = new IndexWriterConfig(Version.LUCENE_42, analyzer);
analyzer = new StandardAnalyzer();
config = new IndexWriterConfig(analyzer);
config.setOpenMode(OpenMode.CREATE_OR_APPEND);
indexDir = new RAMDirectory(); //don't write on disk
//https://stackoverflow.com/questions/36542551/lucene-in-java-method-not-found?rq=1
indexDir = FSDirectory.open(FileSystems.getDefault().getPath("C:\\Users\\lucene_index_files\\v2")); //write on disk
//System.out.println(indexDir);
}
private void findSilimar(String searchForSimilar) throws IOException {
IndexReader reader = DirectoryReader.open(indexDir);
IndexSearcher indexSearcher = new IndexSearcher(reader);
System.out.println("2a");
MoreLikeThis mlt = new MoreLikeThis(reader);
mlt.setMinTermFreq(0);
mlt.setMinDocFreq(0);
mlt.setFieldNames(new String[]{"title", "content"});
mlt.setAnalyzer(analyzer);
System.out.println("2b");
StringReader sReader = new StringReader(searchForSimilar);
//Query query = mlt.like(sReader, null);
//Throws error - The method like(String, Reader...) in the type MoreLikeThis is not applicable for the arguments (StringReader, null)
Query query = mlt.like("computer");
System.out.println("2c");
System.out.println(query.toString());
TopDocs topDocs = indexSearcher.search(query,10);
for ( ScoreDoc scoreDoc : topDocs.scoreDocs ) {
Document aSimilar = indexSearcher.doc( scoreDoc.doc );
String similarTitle = aSimilar.get("title");
String similarContent = aSimilar.get("content");
System.out.println("====similar finded====");
System.out.println("title: "+ similarTitle);
System.out.println("content: "+ similarContent);
}
System.out.println("2d");
}}
I am unsure as to what is causing the system to not generate an output/
What is your output ? I am assuming your not finding similar documents. The reason could be that the query you are creating is empty.
First of all to run your code in a meaningful way this line
Query query = mlt.like(sReader, null);
needs a String[] of field names as the argument, so it should work like this
Query query = mlt.like(sReader, new String[]{"title", "content"});
Now, in order to use MoreLikeThis in Lucene, your stored Fields have to have the set the option to store term vectors "setStoreTermVectors(true);" true when creating fields, for instance like this:
FieldType fieldType = new FieldType();
fieldType.setStored(true);
fieldType.setStoreTermVectors(true);
fieldType.setTokenized(true);
Field contentField = new Field("contents", this.getBlurb(), fieldType);
doc.add(contentField);
Leaving this out could result in an empty query string and consequently no results for the query

Exception in thread "main" java.lang.NullPointerException - HBase indexing data

I am parsing a pdf and storing title, author etc. in variables, and I need to index the values in hbase. So I am getting datas of hbase table from the variables that I created in the project. Program shows me NullPointerException error when I use the variables for indexing in hbase table.
Exception in thread "main" java.lang.NullPointerException
at java.lang.String.<init>(String.java:154)
at testSolr.Testt.Parsing(Testt.java:50)
at testSolr.Testt.main(Testt.java:94)
I tried two different types and none of them worked.
String title = new String(metadata.get("title"));
and
String title = metadata.get("title");
Here is the parts of my code(I wrote significant parts.):
Random rand = new Random();
int min=1, max=5000;
int randomNumber = rand.nextInt((max - min) + 1) + min;
//parsing part
String title = new String(metadata.get("title"));
String nPage = new String(metadata.get("xmpTPg:NPage"));
String author = new String(metadata.get("Author"));
String content = new String(handler.toString());
//hbase part(the part where I am getting the error.)
Put p = new Put(Bytes.toBytes(randomNumber));
p.add(Bytes.toBytes("book"),
Bytes.toBytes("title"),Bytes.toBytes(title));
p.add(Bytes.toBytes("book"),
Bytes.toBytes("author"),Bytes.toBytes(author));
p.add(Bytes.toBytes("book"),
Bytes.toBytes("pageNumber"),Bytes.toBytes(nPage));
p.add(Bytes.toBytes("book"),
Bytes.toBytes("content"),Bytes.toBytes(content));
hTable.put(p);
Should I make variables null in the beginning of parsing? I think that does not make any sense. What should I do to fix the error?
Update:
Full code
public static String location = "/home/alican/Downloads/solr-4.10.2/example/solr/senior/PDFs/solr-word.pdf";
public static void Parsing(String location) throws IOException, SAXException, TikaException, SolrServerException {
// random number generator for ids
Random rand = new Random();
int min=1, max=5000;
int randomNumber = rand.nextInt((max - min) + 1) + min;
// random number generator for ids ends
// pdf Parser
BodyContentHandler handler = new BodyContentHandler(-1);
FileInputStream inputstream = new FileInputStream(location);
Metadata metadata = new Metadata();
ParseContext pcontext = new ParseContext();
PDFParser pdfparser = new PDFParser();
pdfparser.parse(inputstream, handler, metadata, pcontext);
String title = new String(metadata.get("title"));
String nPage = metadata.get("xmpTPg:NPage");
String author = new String(metadata.get("Author"));
String content = new String(handler.toString());
System.out.println("Title: " + metadata.get("title"));
System.out.println("Number of Page(s): " + metadata.get("xmpTPg:NPages"));
System.out.println("Author(s): " + metadata.get("Author"));
System.out.println("Content of the PDF :" + handler.toString());
// pdf Parser ends
// solr Indexing
SolrClient server = new HttpSolrClient(url);
SolrInputDocument doc = new SolrInputDocument();
doc.addField("id", randomNumber);
doc.addField("author", author);
doc.addField("title", title);
doc.addField("pageNumber", nPage);
doc.addField("content", content);
server.add(doc);
System.out.println("solr commiiitt......");
server.commit();
// solr Indexing ends
// hbase Indexing
Configuration config = HBaseConfiguration.create();
HTable hTable = new HTable(config, "books");
Put p = new Put(Bytes.toBytes(randomNumber));
p.add(Bytes.toBytes("book"),
Bytes.toBytes("title"),Bytes.toBytes(title));
p.add(Bytes.toBytes("book"),
Bytes.toBytes("author"),Bytes.toBytes(author));
p.add(Bytes.toBytes("book"),
Bytes.toBytes("pageNumber"),Bytes.toBytes(nPage));
p.add(Bytes.toBytes("book"),
Bytes.toBytes("content"),Bytes.toBytes(content));
hTable.put(p);
System.out.println("hbase commiiitttt..");
hTable.close();
// hbase Indexing ends
}
Output of title, author, number of page and content:
Title: solr-word
Number of Page(s): 1
Author(s): Grant Ingersoll
Content of the PDF :
This is a test of PDF and Word extraction in Solr, it is only a test. Do not panic.
HBase part assumes that variable of nPage is null. Actually it is not. Value of nPage is 1.
p.add(Bytes.toBytes("book"),
Bytes.toBytes("pageNumber"),Bytes.toBytes(nPage));
Solution:
metadata.get("xmpTPg:NPage") returns null when it is assigned to a variable for some reason. I realized that it is because of parser. I changed my parser and there is no any null variable anymore.
- Apache PDFBox(my new parser) is better than Apache Tika(my old parser).
Your metadata.get("title") is returning null, therefore, a NullPointerException is thrown. See Javadoc for more details.

How can I get the terms of a Lucene document field tokens after they are analyzed?

I'm using Lucene 5.1.0. After Analyzing and indexing a document, I would like to get a list of all the terms indexed that belong to this specific document.
{
File[] files = FILES_TO_INDEX_DIRECTORY.listFiles();
for (File file : files) {
Document document = new Document();
Reader reader = new FileReader(file);
document.add(new TextField("fieldname",reader));
iwriter.addDocument(document);
}
iwriter.close();
IndexReader indexReader = DirectoryReader.open(directory);
int maxDoc=indexReader.maxDoc();
for (int i=0; i < maxDoc; i++) {
Document doc=indexReader.document(i);
String[] terms = doc.getValues("fieldname");
}
}
the terms return null. Is there a way to get the saved terms per document?
Here is a sample code for the answer, using a TokenStream
TokenStream ts= analyzer.tokenStream("myfield", reader);
// The Analyzer class will construct the Tokenizer, TokenFilter(s), and CharFilter(s),
// and pass the resulting Reader to the Tokenizer.
OffsetAttribute offsetAtt = ts.addAttribute(OffsetAttribute.class);
CharTermAttribute charTermAttribute = ts.addAttribute(CharTermAttribute.class);
try {
ts.reset(); // Resets this stream to the beginning. (Required)
while (ts.incrementToken()) {
// Use AttributeSource.reflectAsString(boolean)
// for token stream debugging.
System.out.println("token: " + ts.reflectAsString(true));
String term = charTermAttribute.toString();
System.out.println(term);
}
ts.end(); // Perform end-of-stream operations, e.g. set the final offset.
} finally {
ts.close(); // Release resources associated with this stream.
}

Open Microsoft Word in Java

I'm trying to open MS Word 2003 document in java, search for a specified String and replace it with a new String. I use APACHE POI to do that. My code is like the following one:
public void searchAndReplace(String inputFilename, String outputFilename,
HashMap<String, String> replacements) {
File outputFile = null;
File inputFile = null;
FileInputStream fileIStream = null;
FileOutputStream fileOStream = null;
BufferedInputStream bufIStream = null;
BufferedOutputStream bufOStream = null;
POIFSFileSystem fileSystem = null;
HWPFDocument document = null;
Range docRange = null;
Paragraph paragraph = null;
CharacterRun charRun = null;
Set<String> keySet = null;
Iterator<String> keySetIterator = null;
int numParagraphs = 0;
int numCharRuns = 0;
String text = null;
String key = null;
String value = null;
try {
// Create an instance of the POIFSFileSystem class and
// attach it to the Word document using an InputStream.
inputFile = new File(inputFilename);
fileIStream = new FileInputStream(inputFile);
bufIStream = new BufferedInputStream(fileIStream);
fileSystem = new POIFSFileSystem(bufIStream);
document = new HWPFDocument(fileSystem);
docRange = document.getRange();
numParagraphs = docRange.numParagraphs();
keySet = replacements.keySet();
for (int i = 0; i < numParagraphs; i++) {
paragraph = docRange.getParagraph(i);
text = paragraph.text();
numCharRuns = paragraph.numCharacterRuns();
for (int j = 0; j < numCharRuns; j++) {
charRun = paragraph.getCharacterRun(j);
text = charRun.text();
System.out.println("Character Run text: " + text);
keySetIterator = keySet.iterator();
while (keySetIterator.hasNext()) {
key = keySetIterator.next();
if (text.contains(key)) {
value = replacements.get(key);
charRun.replaceText(key, value);
docRange = document.getRange();
paragraph = docRange.getParagraph(i);
charRun = paragraph.getCharacterRun(j);
text = charRun.text();
}
}
}
}
bufIStream.close();
bufIStream = null;
outputFile = new File(outputFilename);
fileOStream = new FileOutputStream(outputFile);
bufOStream = new BufferedOutputStream(fileOStream);
document.write(bufOStream);
} catch (Exception ex) {
System.out.println("Caught an: " + ex.getClass().getName());
System.out.println("Message: " + ex.getMessage());
System.out.println("Stacktrace follows.............");
ex.printStackTrace(System.out);
}
}
I call this function with following arguments:
HashMap<String, String> replacements = new HashMap<String, String>();
replacements.put("AAA", "BBB");
searchAndReplace("C:/Test.doc", "C:/Test1.doc", replacements);
When the Test.doc file contains a simple line like this : "AAA EEE", it works successfully, but when i use a complicated file it will read the content successfully and generate the Test1.doc file but when I try to open it, it will give me the following error:
Word unable to read this document. It may be corrupt.
Try one or more of the following:
* Open and repair the file.
* Open the file with Text Recovery converter.
(C:\Test1.doc)
Please tell me what to do, because I'm a beginner in POI and I have not found a good tutorial for it.
First of all you should be closing your document.
Besides that, what I suggest doing is resaving your original Word document as a Word XML document, then changing the extension manually from .XML to .doc . Then look at the XML of the actual document you're working with and trace the content to make sure you're not accidentally editing hexadecimal values (AAA and EEE could be hex values in other fields).
Without seeing the actual Word document it's hard to say what's going on.
There is not much documentation about POI at all, especially for Word document unfortunately.
I don't know : is its OK to answer myself, but Just to share the knowledge, I'll answer myself.
After navigating the web, the final solution i found is :
The Library called docx4j is very good for dealing with MS docx file, although its documentation is not enough till now and its forum is still in a beginning steps, but overall it help me to do what i need..
Thanks 4 all who help me..
You could try OpenOffice API, but there arent many resources out there to tell you how to use it.
You can also try this one: http://www.dancrintea.ro/doc-to-pdf/
Looks like this could be the issue.

Categories