How to save new and/or modified metadata with Apache Tika?

How to save new and/or modified metadata with Apache Tika? - java

I found this code sample. However, it does not save the new metadata. How to save the new metadata in the same file ?
I tried IOUtils copy. But the problem is that The parser implementation will consume this stream but will not close it .(https://tika.apache.org/1.1/parser.html)
I need a sample code to save the changes.
public void SetMedata(File param_File) throws IOException, SAXException, TikaException {
// parameters of parse() method
Parser parser = new AutoDetectParser();
BodyContentHandler handler = new BodyContentHandler();
Metadata metadata = new Metadata();
FileInputStream inputstream = new FileInputStream(param_File);
ParseContext context = new ParseContext();
// Parsing the given file
parser.parse(inputstream, handler, metadata, context);
// list of meta data elements elements
System.out.println("===Before=== metadata elements and values of the given file :");
String[] metadataNamesb4 = metadata.names();
for (String name : metadataNamesb4) {
System.out.println(name + ": " + metadata.get(name));
}
// setting date meta data
metadata.set(TikaCoreProperties.CREATED, new Date());
// setting multiple values to author property
metadata.set(TikaCoreProperties.TITLE, "ram ,raheem ,robin ");
// printing all the meta data elements with new elements
System.out.println("===After=== List of all the metadata elements after adding new elements ");
String[] metadataNamesafter = metadata.names();
for (String name : metadataNamesafter) {
System.out.println(name + ": " + metadata.get(name));
}
//=======================================
//How To Save metada. ===================
}
Thank you in advance for your answers, examples and help.

Related

Exception in thread "main" java.lang.NullPointerException - HBase indexing data

I am parsing a pdf and storing title, author etc. in variables, and I need to index the values in hbase. So I am getting datas of hbase table from the variables that I created in the project. Program shows me NullPointerException error when I use the variables for indexing in hbase table.
Exception in thread "main" java.lang.NullPointerException
at java.lang.String.<init>(String.java:154)
at testSolr.Testt.Parsing(Testt.java:50)
at testSolr.Testt.main(Testt.java:94)
I tried two different types and none of them worked.
String title = new String(metadata.get("title"));
and
String title = metadata.get("title");
Here is the parts of my code(I wrote significant parts.):
Random rand = new Random();
int min=1, max=5000;
int randomNumber = rand.nextInt((max - min) + 1) + min;
//parsing part
String title = new String(metadata.get("title"));
String nPage = new String(metadata.get("xmpTPg:NPage"));
String author = new String(metadata.get("Author"));
String content = new String(handler.toString());
//hbase part(the part where I am getting the error.)
Put p = new Put(Bytes.toBytes(randomNumber));
p.add(Bytes.toBytes("book"),
Bytes.toBytes("title"),Bytes.toBytes(title));
p.add(Bytes.toBytes("book"),
Bytes.toBytes("author"),Bytes.toBytes(author));
p.add(Bytes.toBytes("book"),
Bytes.toBytes("pageNumber"),Bytes.toBytes(nPage));
p.add(Bytes.toBytes("book"),
Bytes.toBytes("content"),Bytes.toBytes(content));
hTable.put(p);
Should I make variables null in the beginning of parsing? I think that does not make any sense. What should I do to fix the error?
Update:
Full code
public static String location = "/home/alican/Downloads/solr-4.10.2/example/solr/senior/PDFs/solr-word.pdf";
public static void Parsing(String location) throws IOException, SAXException, TikaException, SolrServerException {
// random number generator for ids
Random rand = new Random();
int min=1, max=5000;
int randomNumber = rand.nextInt((max - min) + 1) + min;
// random number generator for ids ends
// pdf Parser
BodyContentHandler handler = new BodyContentHandler(-1);
FileInputStream inputstream = new FileInputStream(location);
Metadata metadata = new Metadata();
ParseContext pcontext = new ParseContext();
PDFParser pdfparser = new PDFParser();
pdfparser.parse(inputstream, handler, metadata, pcontext);
String title = new String(metadata.get("title"));
String nPage = metadata.get("xmpTPg:NPage");
String author = new String(metadata.get("Author"));
String content = new String(handler.toString());
System.out.println("Title: " + metadata.get("title"));
System.out.println("Number of Page(s): " + metadata.get("xmpTPg:NPages"));
System.out.println("Author(s): " + metadata.get("Author"));
System.out.println("Content of the PDF :" + handler.toString());
// pdf Parser ends
// solr Indexing
SolrClient server = new HttpSolrClient(url);
SolrInputDocument doc = new SolrInputDocument();
doc.addField("id", randomNumber);
doc.addField("author", author);
doc.addField("title", title);
doc.addField("pageNumber", nPage);
doc.addField("content", content);
server.add(doc);
System.out.println("solr commiiitt......");
server.commit();
// solr Indexing ends
// hbase Indexing
Configuration config = HBaseConfiguration.create();
HTable hTable = new HTable(config, "books");
Put p = new Put(Bytes.toBytes(randomNumber));
p.add(Bytes.toBytes("book"),
Bytes.toBytes("title"),Bytes.toBytes(title));
p.add(Bytes.toBytes("book"),
Bytes.toBytes("author"),Bytes.toBytes(author));
p.add(Bytes.toBytes("book"),
Bytes.toBytes("pageNumber"),Bytes.toBytes(nPage));
p.add(Bytes.toBytes("book"),
Bytes.toBytes("content"),Bytes.toBytes(content));
hTable.put(p);
System.out.println("hbase commiiitttt..");
hTable.close();
// hbase Indexing ends
}
Output of title, author, number of page and content:
Title: solr-word
Number of Page(s): 1
Author(s): Grant Ingersoll
Content of the PDF :
This is a test of PDF and Word extraction in Solr, it is only a test. Do not panic.
HBase part assumes that variable of nPage is null. Actually it is not. Value of nPage is 1.
p.add(Bytes.toBytes("book"),
Bytes.toBytes("pageNumber"),Bytes.toBytes(nPage));
Solution:
metadata.get("xmpTPg:NPage") returns null when it is assigned to a variable for some reason. I realized that it is because of parser. I changed my parser and there is no any null variable anymore.
- Apache PDFBox(my new parser) is better than Apache Tika(my old parser).

Your metadata.get("title") is returning null, therefore, a NullPointerException is thrown. See Javadoc for more details.

How to write modified meta data to mp3

I extracted meta information thru apache tika library, made some changes and now i want to write out changed information to file.
code snippet for extracting is here:
InputStream input = new FileInputStream(new File(...));
ContentHandler handler = new DefaultHandler();
Metadata metadata = new Metadata();
Parser parser = new Mp3Parser();
ParseContext parseCtx = new ParseContext();
parser.parse(input, handler, metadata, parseCtx);
input.close();
String[] tags = metadata.names();
i make some changes further like:
for(String tagName : tags){
metadata.remove(tagName);
}
And finally i want to write modified version. How can i do this?

Extract text from a large pdf with Tika

I try to extract text from a large pdf, but i only get the first pages, i need all text to will be passed to a string variable.
This is the code
public class ParsePDF {
public static void main(String args[]) throws Exception {
try {
File file = new File("C:/vlarge.pdf");
String content = new Tika().parseToString(file);
System.out.println("The Content: " + content);
}
catch (Exception e) {
e.printStackTrace();
}
}
}

From the Javadocs:
To avoid unpredictable excess memory use, the returned string contains
only up to getMaxStringLength() first characters extracted from the
input document. Use the setMaxStringLength(int) method to adjust this
limitation.
Calling setMaxStringLength(-1) will disable this limit.

Try the apache api TIKA. Its working for large PDF's also.
Sample :
InputStream input = new FileInputStream("sample.pdf");
ContentHandler handler = new BodyContentHandler(Integer.MAX_VALUE);
Metadata metadata = new Metadata();
new PDFParser().parse(input, handler, metadata, new ParseContext());
String plainText = handler.toString();
System.out.println(plainText);

Reading XML database column in java

I have DB2 database with XML column. I would like to read data from it and save each XML to separate file.
Here is a part of my code:
final List<Map<String, Object>> myList = dbcManager.createQuery(query).getResultList();
int i=0;
for (final Map<String, Object> element : myList) {
i++;
String filePath = "C://elements//elem_" + i + ".xml";
File file = new File(filePath);
if(!file.exists()){
file.createNewFile();
}
BufferedWriter out = new BufferedWriter(new FileWriter(filePath));
out.write(element.get("columnId"));
out.close();
}
Now, I have error in line out.write(element.get("columnId"));, because element.get("columnId") is an object type and it should be for example string.
And my question is: To which type should I convert (cast) element.get("columnId") to save it in xml file?

You should use the ResultSet.getSQLXML() method to read the XML column value, then use an appropriate method of the SQLXML class, e.g. getString() or getCharacterStream(). More info here.

How to extract .gz file Dynamically in Java?

In http://www.newegg.com/Siteindex_USA.xml lots of urls of .gz-files are given, like this:
<loc>
http://www.newegg.com//Sitemap/USA/newegg_sitemap_product01.xml.gz
</loc>
I want to extract these dynamically. I don't want to store them locally, I just want to extract them and store the contained data in a database.
Modify:
I am getting exception
private void processGzip(URL url, byte[] response) throws MalformedURLException,
IOException, UnknownFormatException {
if (DEBUG) System.out.println("Processing gzip");
InputStream is = new ByteArrayInputStream(response);
// Remove .gz ending
String xmlUrl = url.toString().replaceFirst("\\.gz$", "");
if (DEBUG) System.out.println("XML url = " + xmlUrl);
InputStream decompressed = new GZIPInputStream(is);
InputSource in = new InputSource(decompressed);
in.setSystemId(xmlUrl);
processXml(url, in);
decompressed.close();
}

Simply wrap the input stream in GZIPInputStream, and it'll decompress the data as you're reading it.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

How to save new and/or modified metadata with Apache Tika? - java

Related

Exception in thread "main" java.lang.NullPointerException - HBase indexing data

How to write modified meta data to mp3

Extract text from a large pdf with Tika

Reading XML database column in java

How to extract .gz file Dynamically in Java?

Categories

Resources