IKVM C# Tika Implementation - NoClassDefFoundError - sun.java2d.Disposer

IKVM C# Tika Implementation - NoClassDefFoundError - sun.java2d.Disposer - java

I have a small library that utilizes IKVM to run Tika (1.2) for the purposes of extracting text and metadata for use within Lucene. I grab document and image paths from a CMS we are using, and pass them through here:
public TextExtractionResult Extract(string filePath)
{
var parser = new AutoDetectParser();
var metadata = new Metadata();
var parseContext = new ParseContext();
Class parserClass = parser.GetType();
parseContext.set(parserClass, parser);
try
{
// Attempt to fix ImageParser "NoClassDefFoundError"
java.lang.System.setProperty("java.awt.headless", "true");
var file = new File(filePath);
var url = file.toURI().toURL();
using (InputStream inputStream = TikaInputStream.get(url, metadata))
{
parser.parse(inputStream, getTransformerHandler(), metadata, parseContext);
inputStream.close();
}
return AssembleExtractionResult(_outputWriter.toString(), metadata);
}
catch (Exception ex)
{
throw new ApplicationException("Extraction of text from the file '{0}' failed.".ToFormat(filePath), ex);
}
}
Only when the files are .png, it bombs with this error:
It seems as though it most likely coming from Tika's ImageParser.
For those who are interested - You can see getTransformerHandler() here:
private TransformerHandler getTransformerHandler()
{
var factory = TransformerFactory.newInstance() as SAXTransformerFactory;
TransformerHandler handler = factory.newTransformerHandler();
handler.getTransformer().setOutputProperty(OutputKeys.METHOD, "text");
handler.getTransformer().setOutputProperty(OutputKeys.INDENT, "yes");
handler.getTransformer().setOutputProperty(OutputKeys.INDENT, "UTF-8");
_outputWriter = new StringWriter();
handler.setResult(new StreamResult(_outputWriter));
return handler;
}
I have looked around and keep being pointed in the direct of running headless, so I already tried that with no luck. Because this is a C# implementation in IKVM, is something missing? It works on all other documents as far as I can tell (.jpeg, .docx, .pdf, etc.).
Thanks to those who know more about Tika + IKVM implementations than I do.

Apache Tika 1.2 was released back on 17 July 2012, and there have been a lot of fixes and improvements since then
You should upgrade to the most recent version of Apache Tika (1.12 as of writing), and that should solve your issue

Related

Cannot get mime type from file extension in Java

I am trying to get file mime type by using the file extensions. I created a list for accepted mime types using https://www.iana.org/assignments/media-types/media-types.xml page.
For example, I need to get "audio/opus" value from a file like file.opus. However, I get "application/ogg" when I use the following approach:
public static void validateFileExtension(InputStream inputStream, String fileId) {
try {
BufferedInputStream bis = new BufferedInputStream(inputStream);
AutoDetectParser parser = new AutoDetectParser();
Detector detector = parser.getDetector();
Metadata md = new Metadata();
md.add(Metadata.RESOURCE_NAME_KEY, fileId);
MediaType mediaType = detector.detect(bis, md);
String fileTypeFromFile = mediaType.toString();
// code omitted for brevity
} catch (IOException e) {
throw new UploadFailedException(fileId);
}
}
I also made a search for this issue and there are some workarounds, but not completely fix the problem. So, how can get the correct mime type from a file using its extension or another way in Java?
Another issue that I am not sure, is it not safe to add application/octet-stream to the allowed file types / extensions? Because I also got "application/octet-stream" value by using other approaches. Any idea?

java tika how to convert html to plain text retaining specific element

The code below works perfectly in converting html to plain text...
Url url = new URL(your_url);
InputStream is = url.openStream();
ContentHandler textHandler = new BodyContentHandler();
Metadata metadata = new Metadata();
AutoDetectParser parser = new AutoDetectParser();
ParseContext context = new ParseContext();
parser.parse(is, textHandler, metadata, context);
System.out.println("Body: " + textHandler.toString());
My question is:
How to retain / keep specific element like links, , etc... or how to prevent specific element like links, to be removed in html to plain text conversion?
Thanks and best regards...

There are many ways you can use Apache Tika for this kind of work.
As Gagravarr says, the ContentHandler you use is the key here. There are are number of useful ones that could help here, or you can build your own custom one.
Since I'm not sure what you are looking to do, I've tried to share some examples of common approaches, particularly for HTML content.
MatchingContentHandler
A common route is to use a MatchingContentHandler to filter the content you are interested in:
URL url = new URL("http://tika.apache.org");
InputStream is = url.openStream();
// Only select <a> tags to be output
XPathParser xhtmlParser = new XPathParser("xhtml", XHTMLContentHandler.XHTML);
Matcher divContentMatcher = xhtmlParser.parse("//xhtml:a/descendant::node()");
MatchingContentHandler handler = new MatchingContentHandler(new ToHTMLContentHandler(), divContentMatcher);
// Parse based on original question
HtmlParser parser = new HtmlParser()
Metadata metadata = new Metadata();
parser.parse(is, handler, metadata, new ParseContext());
System.out.println("Links: " + handler.toString());
It's worth noting this is for inclusion only and only supports a sub-set of XPath. See XPathParser for details.
LinkContentHandler
If you just want to extract links, the LinkContentHandler is a great option:
URL url = new URL("http://tika.apache.org");
InputStream is = url.openStream();
LinkContentHandler linkHandler = new LinkContentHandler();
Metadata metadata = new Metadata();
HtmlParser parser = new HtmlParser();
parser.parse(is, linkHandler, metadata, new ParseContext());
System.out.println("Links: " + linkHandler.getLinks());
It's code is also a great example of how to build a custom handler.
BoilerpipeContentHandler
The BoilerpipeContentHandler uses the Boilerpipe library underneath allowing you to use one of it's defined extractors to process the content.
URL url = new URL("http://tika.apache.org");
InputStream is = url.openStream();
ExtractorBase extractor = ArticleSentencesExtractor.getInstance();
BoilerpipeContentHandler textHandler = new BoilerpipeContentHandler(new BodyContentHandler(), extractor);
Metadata metadata = new Metadata();
HtmlParser parser = new HtmlParser();
parser.parse(is, textHandler, metadata, new ParseContext());
System.out.println(textHandler.getTextDocument().getTextBlocks());
These can be really useful if you are really interesting in the content inside, as the extractors can help you focus on the content.
Custom ContentHandler or ContentHandlerDecorator
You can build your own ContentHandler to do custom processing and get exactly what you want out from a file.
In some cases this could be to write out specific content, such in the example below, or in other cases it could be processing such as collecting and making available links, as in the LinkHandler.
Using custom ContentHandler instances is really powerful and there are a ton of examples available in the Apache Tika code base, as well other open source projects too.
Below is a bit of a contrived example, just trying to emit part of the HTML:
URL url = new URL("http://tika.apache.org");
InputStream is = url.openStream();
StringWriter sw = new StringWriter();
SAXTransformerFactory factory = (SAXTransformerFactory) SAXTransformerFactory.newInstance();
TransformerHandler handler = factory.newTransformerHandler();
handler.getTransformer().setOutputProperty(OutputKeys.METHOD, "html");
handler.setResult(new StreamResult(sw));
ContentHandlerDecorator h2Handler = new ContentHandlerDecorator(handler) {
private final List<String> elementsToInclude = List.of("h2");
private boolean processElement = false;
#Override
public void startElement(String uri, String local, String name, Attributes atts)
throws SAXException {
if (elementsToInclude.contains(name)) {
processElement = true;
super.startElement(uri, local, name, atts);
}
}
#Override
public void ignorableWhitespace(char[] ch, int start, int length) {
// Skip whitespace
}
#Override
public void characters(char[] ch, int start, int length) throws SAXException {
if (!processElement) {
return;
}
super.characters(ch, start, length);
}
#Override
public void endElement(
String uri, String local, String name) throws SAXException {
if (elementsToInclude.contains(name)) {
processElement = false;
super.endElement(uri, local, name);
}
}
};
HtmlParser parser = new HtmlParser();
parser.parse(is, h2Handler, new Metadata(), new ParseContext());
System.out.println("Heading Level 2s: " + sw.toString());
As you can see in some of these examples one or more ContentHandler instances can be chained together but it's worth noting some expect the output to be well-formed, so check the Javadocs. XHTMLContentHandler is also useful if you want to map different files into your own common format.
JSoup :)
Another route is JSoup, either directly (skip the Tika part and use Jsoup.connect()) if you are processing HTML directly, or chained with Apache Tika, if you are wanting to read from HTML generated from different file types.
URL url = new URL("http://tika.apache.org");
InputStream is = url.openStream();
ToHTMLContentHandler html = new ToHTMLContentHandler();
HtmlParser parser = new HtmlParser();
parser.parse(is, html, new Metadata(), new ParseContext());
Document doc = Jsoup.parse(html.toString());
Elements h2List = doc.select("h2");
for (Element headline : h2List) {
System.out.println(headline.text());
}
Once you've parsed it, you can query the document with Jsoup. Not the most efficient vs a ContentHandler built for the job, but can be useful for messy content sets.

How to add metadata to PDF document using PDFbox?

I have an input stream of a PDF document available to me. I would like to add subject metadata to the document and then save it. I'm not sure how to do this.
I came across a sample recipe here: https://pdfbox.apache.org/1.8/cookbook/workingwithmetadata.html
However, it is still fuzzy. Below is what I'm trying and places where I have questions
PDDocument doc = PDDocument.load(myInputStream);
PDDocumentCatalog catalog = doc.getDocumentCatalog();
InputStream newXMPData = ...; //what goes here? How can I add subject tag?
PDMetadata newMetadata = new PDMetadata(doc, newXMLData, false );
catalog.setMetadata( newMetadata );
//does anything else need to happen to save the document??
//I would like an outputstream of the document (with metadata) so that I can save it to an S3 bucket

The following code sets the title of a PDF document, but it should be adaptable to work with other properties as well:
public static byte[] insertTitlePdf(byte[] documentBytes, String title) {
try {
PDDocument document = PDDocument.load(documentBytes);
PDDocumentInformation info = document.getDocumentInformation();
info.setTitle(title);
ByteArrayOutputStream baos = new ByteArrayOutputStream();
document.save(baos);
return baos.toByteArray();
} catch (IOException e) {
e.printStackTrace();
}
return null;
}
Apache PDFBox is needed, so import it to e.g. Maven with:
<dependency>
<groupId>org.apache.pdfbox</groupId>
<artifactId>pdfbox</artifactId>
<version>2.0.6</version>
</dependency>
Add a title with:
byte[] documentBytesWithTitle = insertTitlePdf(documentBytes, "Some fancy title");
Display it in the browser with (JSF example):
<object class="pdf" data="data:application/pdf;base64,#{myBean.getDocumentBytesWithTitleAsBase64()}" type="application/pdf">Document could not be loaded</object>
Result (Chrome):

Another much easier way to do this would be to use the built-in Document Information object:
PDDocument inputDoc = // your doc
inputDoc.getDocumentInformation().setCreator("Some meta");
inputDoc.getDocumentInformation().setCustomMetadataValue("fieldName", "fieldValue");
This also has the benefit of not requiring the xmpbox library.

This answer uses xmpbox and comes from the AddMetadataFromDocInfo example in the source code download:
XMPMetadata xmp = XMPMetadata.createXMPMetadata();
DublinCoreSchema dc = xmp.createAndAddDublinCoreSchema();
dc.setDescription("descr");
XmpSerializer serializer = new XmpSerializer();
ByteArrayOutputStream baos = new ByteArrayOutputStream();
serializer.serialize(xmp, baos, true);
PDMetadata metadata = new PDMetadata(doc);
metadata.importXMPMetadata(baos.toByteArray());
doc.getDocumentCatalog().setMetadata(metadata);

How to write modified meta data to mp3

I extracted meta information thru apache tika library, made some changes and now i want to write out changed information to file.
code snippet for extracting is here:
InputStream input = new FileInputStream(new File(...));
ContentHandler handler = new DefaultHandler();
Metadata metadata = new Metadata();
Parser parser = new Mp3Parser();
ParseContext parseCtx = new ParseContext();
parser.parse(input, handler, metadata, parseCtx);
input.close();
String[] tags = metadata.names();
i make some changes further like:
for(String tagName : tags){
metadata.remove(tagName);
}
And finally i want to write modified version. How can i do this?

Why is my Tika Metadata object not being populated when using ForkParser?

ForkParser is a new Tika parser that was introduced in Tika version 0.9, located in org.apache.tika.fork. The new parser forks off a new jvm process to analyze the passed file stream. I figured this may be a good way to constrain how much memory I'm willing to devote to Tika's metadata extraction process. However, the Metadata object is not being populated with the appropriate metadata properties like it would when using an AutoDetectParser. Tests have shown that the BodyContentHandler object is not null.
Why is the Metadata object not being populated with anything (except the manually added RESOURCE_NAME_KEY)?
public static Metadata getMetadata(File f) {
Metadata metadata = new Metadata();
try {
FileInputStream fis = new FileInputStream(f);
BodyContentHandler contentHandler = new BodyContentHandler(-1);
ParseContext context = new ParseContext();
ForkParser parser = new ForkParser();
parser.setJavaCommand("/usr/local/java6/bin/java -Xmx64m");
metadata.set(Metadata.RESOURCE_NAME_KEY, f.getName());
parser.parse(fis, contentHandler, metadata, context);
fis.close();
String contentType = metadata.get(Metadata.CONTENT_TYPE);
logger.error("contentHandler: " + contentHandler.toString());
logger.error("metadata: " + metadata.toString());
return metadata;
} catch (Throwable e) {
logger.error("Exception while analyzing file\n" +
"CAUTION: metadata may still have useful content in it!\n" +
"Exception: " + e, e);
return metadata;
}
}

The ForkParser class in Tika 1.0 unfortunately does not support metadata extraction since for now the communication channel to the forked parser process only supports passing back SAX events but not metadata entries. I suggest you file a TIKA improvement issue to get this fixed.
One workaround you might want to consider is getting the extracted metadata from the <meta> tags in the <head> section of the XHTML document returned by the forked parser. Those should be available and contain most of the metadata entries normally returned in the Metadata object.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

IKVM C# Tika Implementation - NoClassDefFoundError - sun.java2d.Disposer - java

Apache Tika 1.2 was released back on 17 July 2012, and there have been a lot of fixes and improvements since then You should upgrade to the most recent version of Apache Tika (1.12 as of writing), and that should solve your issue

Related

Cannot get mime type from file extension in Java

java tika how to convert html to plain text retaining specific element

How to add metadata to PDF document using PDFbox?

How to write modified meta data to mp3

Why is my Tika Metadata object not being populated when using ForkParser?

Categories

Resources