Exception in thread "main" java.lang.NullPointerException - HBase indexing data - java

I am parsing a pdf and storing title, author etc. in variables, and I need to index the values in hbase. So I am getting datas of hbase table from the variables that I created in the project. Program shows me NullPointerException error when I use the variables for indexing in hbase table.
Exception in thread "main" java.lang.NullPointerException
at java.lang.String.<init>(String.java:154)
at testSolr.Testt.Parsing(Testt.java:50)
at testSolr.Testt.main(Testt.java:94)
I tried two different types and none of them worked.
String title = new String(metadata.get("title"));
and
String title = metadata.get("title");
Here is the parts of my code(I wrote significant parts.):
Random rand = new Random();
int min=1, max=5000;
int randomNumber = rand.nextInt((max - min) + 1) + min;
//parsing part
String title = new String(metadata.get("title"));
String nPage = new String(metadata.get("xmpTPg:NPage"));
String author = new String(metadata.get("Author"));
String content = new String(handler.toString());
//hbase part(the part where I am getting the error.)
Put p = new Put(Bytes.toBytes(randomNumber));
p.add(Bytes.toBytes("book"),
Bytes.toBytes("title"),Bytes.toBytes(title));
p.add(Bytes.toBytes("book"),
Bytes.toBytes("author"),Bytes.toBytes(author));
p.add(Bytes.toBytes("book"),
Bytes.toBytes("pageNumber"),Bytes.toBytes(nPage));
p.add(Bytes.toBytes("book"),
Bytes.toBytes("content"),Bytes.toBytes(content));
hTable.put(p);
Should I make variables null in the beginning of parsing? I think that does not make any sense. What should I do to fix the error?
Update:
Full code
public static String location = "/home/alican/Downloads/solr-4.10.2/example/solr/senior/PDFs/solr-word.pdf";
public static void Parsing(String location) throws IOException, SAXException, TikaException, SolrServerException {
// random number generator for ids
Random rand = new Random();
int min=1, max=5000;
int randomNumber = rand.nextInt((max - min) + 1) + min;
// random number generator for ids ends
// pdf Parser
BodyContentHandler handler = new BodyContentHandler(-1);
FileInputStream inputstream = new FileInputStream(location);
Metadata metadata = new Metadata();
ParseContext pcontext = new ParseContext();
PDFParser pdfparser = new PDFParser();
pdfparser.parse(inputstream, handler, metadata, pcontext);
String title = new String(metadata.get("title"));
String nPage = metadata.get("xmpTPg:NPage");
String author = new String(metadata.get("Author"));
String content = new String(handler.toString());
System.out.println("Title: " + metadata.get("title"));
System.out.println("Number of Page(s): " + metadata.get("xmpTPg:NPages"));
System.out.println("Author(s): " + metadata.get("Author"));
System.out.println("Content of the PDF :" + handler.toString());
// pdf Parser ends
// solr Indexing
SolrClient server = new HttpSolrClient(url);
SolrInputDocument doc = new SolrInputDocument();
doc.addField("id", randomNumber);
doc.addField("author", author);
doc.addField("title", title);
doc.addField("pageNumber", nPage);
doc.addField("content", content);
server.add(doc);
System.out.println("solr commiiitt......");
server.commit();
// solr Indexing ends
// hbase Indexing
Configuration config = HBaseConfiguration.create();
HTable hTable = new HTable(config, "books");
Put p = new Put(Bytes.toBytes(randomNumber));
p.add(Bytes.toBytes("book"),
Bytes.toBytes("title"),Bytes.toBytes(title));
p.add(Bytes.toBytes("book"),
Bytes.toBytes("author"),Bytes.toBytes(author));
p.add(Bytes.toBytes("book"),
Bytes.toBytes("pageNumber"),Bytes.toBytes(nPage));
p.add(Bytes.toBytes("book"),
Bytes.toBytes("content"),Bytes.toBytes(content));
hTable.put(p);
System.out.println("hbase commiiitttt..");
hTable.close();
// hbase Indexing ends
}
Output of title, author, number of page and content:
Title: solr-word
Number of Page(s): 1
Author(s): Grant Ingersoll
Content of the PDF :
This is a test of PDF and Word extraction in Solr, it is only a test. Do not panic.
HBase part assumes that variable of nPage is null. Actually it is not. Value of nPage is 1.
p.add(Bytes.toBytes("book"),
Bytes.toBytes("pageNumber"),Bytes.toBytes(nPage));
Solution:
metadata.get("xmpTPg:NPage") returns null when it is assigned to a variable for some reason. I realized that it is because of parser. I changed my parser and there is no any null variable anymore.
- Apache PDFBox(my new parser) is better than Apache Tika(my old parser).

Your metadata.get("title") is returning null, therefore, a NullPointerException is thrown. See Javadoc for more details.

Related

In itext7,how to change attach files display order by added time

I want to change my attach file order in created pdf,attachment are displayed default by name,
how to change them displayed by add time?
this is my implement method:
#Override
public boolean attachFile(String src, String dest, List<SysItemfile> attachmentpaths) {
try {
PdfName name = new PdfName(src);
PdfDocument pdfDoc = new PdfDocument(new PdfReader(src), new PdfWriter(dest));
List<String> descs = new ArrayList<String>();
int i = 0;
int j = 1;
for (SysItemfile attachmentpath : attachmentpaths) {
String filename = attachmentpath.getFilename();
//test for the file name
System.out.println("filename:"+filename);
if (descs.contains(attachmentpath.getFilename())) {
//get the file suffix
String suffix = filename.substring(filename.lastIndexOf(".") + 1);
String realname = filename.substring(0,filename.lastIndexOf("."));
filename = realname+i+"."+suffix;
i++;
} else {
descs.add(attachmentpath.getFilename());
}
PdfFileSpec spec = PdfFileSpec.createEmbeddedFileSpec(pdfDoc, attachmentpath.getFileurl(),
filename, filename, name, name);
// the first parameter is discription
pdfDoc.addFileAttachment(filename, spec);
}
pdfDoc.close();
} catch (IOException e) {
logger.error("attachFile unsuccess!");
logger.error(e.getLocalizedMessage());
return false;
}
return true;
}
After that , when i add attachment to my pdf,the cann't change the order of attachment display.
what should I do?
As long as you only add attachments, the PDF standard does not allow for prescribing the sort order a PDF viewer uses when displaying the attachments.
If, on the other hand, you make the PDF a portable collection (aka a Portfolio), you can prescribe a schema (i.e. the fields in the detail list) and the sort order (by one or a combination of those fields).
You can quite easily make your PDF with attachments a portable collection with the name and modification date sorted by the latter like this:
try ( PdfReader reader = new PdfReader(...);
PdfWriter writer = new PdfWriter(...);
PdfDocument document = new PdfDocument(reader, writer)) {
PdfCollection collection = new PdfCollection();
document.getCatalog().setCollection(collection);
PdfCollectionSchema schema = new PdfCollectionSchema();
PdfCollectionField field = new PdfCollectionField("File Name", PdfCollectionField.FILENAME);
field.setOrder(0);
schema.addField("Name", field);
field = new PdfCollectionField("Modification Date", PdfCollectionField.MODDATE);
field.setOrder(1);
schema.addField("Modified", field);
collection.setSchema(schema);
PdfCollectionSort sort = new PdfCollectionSort("Modified");
collection.setSort(sort);
}
(SortAttachments test testAttachLikeGeologistedWithCollection)
You actually even can define a custom field using a type PdfCollectionField.TEXT, PdfCollectionField.DATE, or PdfCollectionField.NUMBER by which to sort. You can set the value of such a custom field on a PdfFileSpec via its setCollectionItem method.

Unable to identify error in Lucene MoreLikeThis

I need to use Lucene MoreLikeThis to find similar documents given a paragraph of text. I am new to Lucene and followed the code here
I have already indexed the documents at the directory - "C:\Users\lucene_index_files\v2"
I am using "They are computer engineers and they like to develop their own tools. The program in languages like Java, CPP." as the document to which I want to find similar documents.
public class LuceneSearcher2 {
public static void main(String[] args) throws IOException {
LuceneSearcher2 m = new LuceneSearcher2();
System.out.println("1");
m.start();
System.out.println("2");
//m.writerEntries();
m.findSilimar("They are computer engineers and they like to develop their own tools. The program in languages like Java, CPP.");
System.out.println("3");
}
private Directory indexDir;
private StandardAnalyzer analyzer;
private IndexWriterConfig config;
public void start() throws IOException{
//analyzer = new StandardAnalyzer(Version.LUCENE_42);
//config = new IndexWriterConfig(Version.LUCENE_42, analyzer);
analyzer = new StandardAnalyzer();
config = new IndexWriterConfig(analyzer);
config.setOpenMode(OpenMode.CREATE_OR_APPEND);
indexDir = new RAMDirectory(); //don't write on disk
//https://stackoverflow.com/questions/36542551/lucene-in-java-method-not-found?rq=1
indexDir = FSDirectory.open(FileSystems.getDefault().getPath("C:\\Users\\lucene_index_files\\v2")); //write on disk
//System.out.println(indexDir);
}
private void findSilimar(String searchForSimilar) throws IOException {
IndexReader reader = DirectoryReader.open(indexDir);
IndexSearcher indexSearcher = new IndexSearcher(reader);
System.out.println("2a");
MoreLikeThis mlt = new MoreLikeThis(reader);
mlt.setMinTermFreq(0);
mlt.setMinDocFreq(0);
mlt.setFieldNames(new String[]{"title", "content"});
mlt.setAnalyzer(analyzer);
System.out.println("2b");
StringReader sReader = new StringReader(searchForSimilar);
//Query query = mlt.like(sReader, null);
//Throws error - The method like(String, Reader...) in the type MoreLikeThis is not applicable for the arguments (StringReader, null)
Query query = mlt.like("computer");
System.out.println("2c");
System.out.println(query.toString());
TopDocs topDocs = indexSearcher.search(query,10);
for ( ScoreDoc scoreDoc : topDocs.scoreDocs ) {
Document aSimilar = indexSearcher.doc( scoreDoc.doc );
String similarTitle = aSimilar.get("title");
String similarContent = aSimilar.get("content");
System.out.println("====similar finded====");
System.out.println("title: "+ similarTitle);
System.out.println("content: "+ similarContent);
}
System.out.println("2d");
}}
I am unsure as to what is causing the system to not generate an output/
What is your output ? I am assuming your not finding similar documents. The reason could be that the query you are creating is empty.
First of all to run your code in a meaningful way this line
Query query = mlt.like(sReader, null);
needs a String[] of field names as the argument, so it should work like this
Query query = mlt.like(sReader, new String[]{"title", "content"});
Now, in order to use MoreLikeThis in Lucene, your stored Fields have to have the set the option to store term vectors "setStoreTermVectors(true);" true when creating fields, for instance like this:
FieldType fieldType = new FieldType();
fieldType.setStored(true);
fieldType.setStoreTermVectors(true);
fieldType.setTokenized(true);
Field contentField = new Field("contents", this.getBlurb(), fieldType);
doc.add(contentField);
Leaving this out could result in an empty query string and consequently no results for the query

How to save new and/or modified metadata with Apache Tika?

I found this code sample. However, it does not save the new metadata. How to save the new metadata in the same file ?
I tried IOUtils copy. But the problem is that The parser implementation will consume this stream but will not close it .(https://tika.apache.org/1.1/parser.html)
I need a sample code to save the changes.
public void SetMedata(File param_File) throws IOException, SAXException, TikaException {
// parameters of parse() method
Parser parser = new AutoDetectParser();
BodyContentHandler handler = new BodyContentHandler();
Metadata metadata = new Metadata();
FileInputStream inputstream = new FileInputStream(param_File);
ParseContext context = new ParseContext();
// Parsing the given file
parser.parse(inputstream, handler, metadata, context);
// list of meta data elements elements
System.out.println("===Before=== metadata elements and values of the given file :");
String[] metadataNamesb4 = metadata.names();
for (String name : metadataNamesb4) {
System.out.println(name + ": " + metadata.get(name));
}
// setting date meta data
metadata.set(TikaCoreProperties.CREATED, new Date());
// setting multiple values to author property
metadata.set(TikaCoreProperties.TITLE, "ram ,raheem ,robin ");
// printing all the meta data elements with new elements
System.out.println("===After=== List of all the metadata elements after adding new elements ");
String[] metadataNamesafter = metadata.names();
for (String name : metadataNamesafter) {
System.out.println(name + ": " + metadata.get(name));
}
//=======================================
//How To Save metada. ===================
}
Thank you in advance for your answers, examples and help.

What is the best way to generate a unique and short file name in Java

I don't necessarily want to use UUIDs since they are fairly long.
The file just needs to be unique within its directory.
One thought which comes to mind is to use File.createTempFile(String prefix, String suffix), but that seems wrong because the file is not temporary.
The case of two files created in the same millisecond needs to be handled.
Well, you could use the 3-argument version: File.createTempFile(String prefix, String suffix, File directory) which will let you put it where you'd like. Unless you tell it to, Java won't treat it differently than any other file. The only drawback is that the filename is guaranteed to be at least 8 characters long (minimum of 3 characters for the prefix, plus 5 or more characters generated by the function).
If that's too long for you, I suppose you could always just start with the filename "a", and loop through "b", "c", etc until you find one that doesn't already exist.
I'd use Apache Commons Lang library (http://commons.apache.org/lang).
There is a class org.apache.commons.lang.RandomStringUtils that can be used to generate random strings of given length. Very handy not only for filename generation!
Here is the example:
String ext = "dat";
File dir = new File("/home/pregzt");
String name = String.format("%s.%s", RandomStringUtils.randomAlphanumeric(8), ext);
File file = new File(dir, name);
I use the timestamp
i.e
new File( simpleDateFormat.format( new Date() ) );
And have the simpleDateFormat initialized to something like as:
new SimpleDateFormat("File-ddMMyy-hhmmss.SSS.txt");
EDIT
What about
new File(String.format("%s.%s", sdf.format( new Date() ),
random.nextInt(9)));
Unless the number of files created in the same second is too high.
If that's the case and the name doesn't matters
new File( "file."+count++ );
:P
This works for me:
String generateUniqueFileName() {
String filename = "";
long millis = System.currentTimeMillis();
String datetime = new Date().toGMTString();
datetime = datetime.replace(" ", "");
datetime = datetime.replace(":", "");
String rndchars = RandomStringUtils.randomAlphanumeric(16);
filename = rndchars + "_" + datetime + "_" + millis;
return filename;
}
// USE:
String newFile;
do{
newFile=generateUniqueFileName() + "." + FileExt;
}
while(new File(basePath+newFile).exists());
Output filenames should look like :
2OoBwH8OwYGKW2QE_4Sep2013061732GMT_1378275452253.Ext
Look at the File javadoc, the method createNewFile will create the file only if it doesn't exist, and will return a boolean to say if the file was created.
You may also use the exists() method:
int i = 0;
String filename = Integer.toString(i);
File f = new File(filename);
while (f.exists()) {
i++;
filename = Integer.toString(i);
f = new File(filename);
}
f.createNewFile();
System.out.println("File in use: " + f);
If you have access to a database, you can create and use a sequence in the file name.
select mySequence.nextval from dual;
It will be guaranteed to be unique and shouldn't get too large (unless you are pumping out a ton of files).
//Generating Unique File Name
public String getFileName() {
String timeStamp = new SimpleDateFormat("yyyy-MM-dd_HH:mm:ss").format(new Date());
return "PNG_" + timeStamp + "_.png";
}
I use current milliseconds with random numbers
i.e
Random random=new Random();
String ext = ".jpeg";
File dir = new File("/home/pregzt");
String name = String.format("%s%s",System.currentTimeMillis(),random.nextInt(100000)+ext);
File file = new File(dir, name);
Combining other answers, why not use the ms timestamp with a random value appended; repeat until no conflict, which in practice will be almost never.
For example: File-ccyymmdd-hhmmss-mmm-rrrrrr.txt
Why not just use something based on a timestamp..?
Problem is synchronization. Separate out regions of conflict.
Name the file as : (server-name)_(thread/process-name)_(millisecond/timestamp).(extension)
example : aws1_t1_1447402821007.png
How about generate based on time stamp rounded to the nearest millisecond, or whatever accuracy you need... then use a lock to synchronize access to the function.
If you store the last generated file name, you can append sequential letters or further digits to it as needed to make it unique.
Or if you'd rather do it without locks, use a time step plus a thread ID, and make sure that the function takes longer than a millisecond, or waits so that it does.
It looks like you've got a handful of solutions for creating a unique filename, so I'll leave that alone. I would test the filename this way:
String filePath;
boolean fileNotFound = true;
while (fileNotFound) {
String testPath = generateFilename();
try {
RandomAccessFile f = new RandomAccessFile(
new File(testPath), "r");
} catch (Exception e) {
// exception thrown by RandomAccessFile if
// testPath doesn't exist (ie: it can't be read)
filePath = testPath;
fileNotFound = false;
}
}
//now create your file with filePath
This also works
String logFileName = new SimpleDateFormat("yyyyMMddHHmm'.txt'").format(new Date());
logFileName = "loggerFile_" + logFileName;
I understand that I am too late to reply on this question. But I think I should put this as it seems something different from other solution.
We can concatenate threadname and current timeStamp as file name. But with this there is one issue like some thread name contains special character like "\" which can create problem in creating file name. So we can remove special charater from thread name and then concatenate thread name and time stamp
fileName = threadName(after removing special charater) + currentTimeStamp
Why not use synchronized to process multi thread.
here is my solution,It's can generate a short file name , and it's unique.
private static synchronized String generateFileName(){
String name = make(index);
index ++;
return name;
}
private static String make(int index) {
if(index == 0) return "";
return String.valueOf(chars[index % chars.length]) + make(index / chars.length);
}
private static int index = 1;
private static char[] chars = {'a','b','c','d','e','f','g',
'h','i','j','k','l','m','n',
'o','p','q','r','s','t',
'u','v','w','x','y','z'};
blew is main function for test , It's work.
public static void main(String[] args) {
List<String> names = new ArrayList<>();
List<Thread> threads = new ArrayList<>();
for (int i = 0; i < 100; i++) {
Thread thread = new Thread(new Runnable() {
#Override
public void run() {
for (int i = 0; i < 1000; i++) {
String name = generateFileName();
names.add(name);
}
}
});
thread.run();
threads.add(thread);
}
for (int i = 0; i < 10; i++) {
try {
threads.get(i).join();
} catch (InterruptedException e) {
e.printStackTrace();
}
}
System.out.println(names);
System.out.println(names.size());
}

Open Microsoft Word in Java

I'm trying to open MS Word 2003 document in java, search for a specified String and replace it with a new String. I use APACHE POI to do that. My code is like the following one:
public void searchAndReplace(String inputFilename, String outputFilename,
HashMap<String, String> replacements) {
File outputFile = null;
File inputFile = null;
FileInputStream fileIStream = null;
FileOutputStream fileOStream = null;
BufferedInputStream bufIStream = null;
BufferedOutputStream bufOStream = null;
POIFSFileSystem fileSystem = null;
HWPFDocument document = null;
Range docRange = null;
Paragraph paragraph = null;
CharacterRun charRun = null;
Set<String> keySet = null;
Iterator<String> keySetIterator = null;
int numParagraphs = 0;
int numCharRuns = 0;
String text = null;
String key = null;
String value = null;
try {
// Create an instance of the POIFSFileSystem class and
// attach it to the Word document using an InputStream.
inputFile = new File(inputFilename);
fileIStream = new FileInputStream(inputFile);
bufIStream = new BufferedInputStream(fileIStream);
fileSystem = new POIFSFileSystem(bufIStream);
document = new HWPFDocument(fileSystem);
docRange = document.getRange();
numParagraphs = docRange.numParagraphs();
keySet = replacements.keySet();
for (int i = 0; i < numParagraphs; i++) {
paragraph = docRange.getParagraph(i);
text = paragraph.text();
numCharRuns = paragraph.numCharacterRuns();
for (int j = 0; j < numCharRuns; j++) {
charRun = paragraph.getCharacterRun(j);
text = charRun.text();
System.out.println("Character Run text: " + text);
keySetIterator = keySet.iterator();
while (keySetIterator.hasNext()) {
key = keySetIterator.next();
if (text.contains(key)) {
value = replacements.get(key);
charRun.replaceText(key, value);
docRange = document.getRange();
paragraph = docRange.getParagraph(i);
charRun = paragraph.getCharacterRun(j);
text = charRun.text();
}
}
}
}
bufIStream.close();
bufIStream = null;
outputFile = new File(outputFilename);
fileOStream = new FileOutputStream(outputFile);
bufOStream = new BufferedOutputStream(fileOStream);
document.write(bufOStream);
} catch (Exception ex) {
System.out.println("Caught an: " + ex.getClass().getName());
System.out.println("Message: " + ex.getMessage());
System.out.println("Stacktrace follows.............");
ex.printStackTrace(System.out);
}
}
I call this function with following arguments:
HashMap<String, String> replacements = new HashMap<String, String>();
replacements.put("AAA", "BBB");
searchAndReplace("C:/Test.doc", "C:/Test1.doc", replacements);
When the Test.doc file contains a simple line like this : "AAA EEE", it works successfully, but when i use a complicated file it will read the content successfully and generate the Test1.doc file but when I try to open it, it will give me the following error:
Word unable to read this document. It may be corrupt.
Try one or more of the following:
* Open and repair the file.
* Open the file with Text Recovery converter.
(C:\Test1.doc)
Please tell me what to do, because I'm a beginner in POI and I have not found a good tutorial for it.
First of all you should be closing your document.
Besides that, what I suggest doing is resaving your original Word document as a Word XML document, then changing the extension manually from .XML to .doc . Then look at the XML of the actual document you're working with and trace the content to make sure you're not accidentally editing hexadecimal values (AAA and EEE could be hex values in other fields).
Without seeing the actual Word document it's hard to say what's going on.
There is not much documentation about POI at all, especially for Word document unfortunately.
I don't know : is its OK to answer myself, but Just to share the knowledge, I'll answer myself.
After navigating the web, the final solution i found is :
The Library called docx4j is very good for dealing with MS docx file, although its documentation is not enough till now and its forum is still in a beginning steps, but overall it help me to do what i need..
Thanks 4 all who help me..
You could try OpenOffice API, but there arent many resources out there to tell you how to use it.
You can also try this one: http://www.dancrintea.ro/doc-to-pdf/
Looks like this could be the issue.

Categories