I am trying to get file mime type by using the file extensions. I created a list for accepted mime types using https://www.iana.org/assignments/media-types/media-types.xml page.
For example, I need to get "audio/opus" value from a file like file.opus. However, I get "application/ogg" when I use the following approach:
public static void validateFileExtension(InputStream inputStream, String fileId) {
try {
BufferedInputStream bis = new BufferedInputStream(inputStream);
AutoDetectParser parser = new AutoDetectParser();
Detector detector = parser.getDetector();
Metadata md = new Metadata();
md.add(Metadata.RESOURCE_NAME_KEY, fileId);
MediaType mediaType = detector.detect(bis, md);
String fileTypeFromFile = mediaType.toString();
// code omitted for brevity
} catch (IOException e) {
throw new UploadFailedException(fileId);
}
}
I also made a search for this issue and there are some workarounds, but not completely fix the problem. So, how can get the correct mime type from a file using its extension or another way in Java?
Another issue that I am not sure, is it not safe to add application/octet-stream to the allowed file types / extensions? Because I also got "application/octet-stream" value by using other approaches. Any idea?
The code below works perfectly in converting html to plain text...
Url url = new URL(your_url);
InputStream is = url.openStream();
ContentHandler textHandler = new BodyContentHandler();
Metadata metadata = new Metadata();
AutoDetectParser parser = new AutoDetectParser();
ParseContext context = new ParseContext();
parser.parse(is, textHandler, metadata, context);
System.out.println("Body: " + textHandler.toString());
My question is:
How to retain / keep specific element like links, , etc... or how to prevent specific element like links, to be removed in html to plain text conversion?
Thanks and best regards...
There are many ways you can use Apache Tika for this kind of work.
As Gagravarr says, the ContentHandler you use is the key here. There are are number of useful ones that could help here, or you can build your own custom one.
Since I'm not sure what you are looking to do, I've tried to share some examples of common approaches, particularly for HTML content.
MatchingContentHandler
A common route is to use a MatchingContentHandler to filter the content you are interested in:
URL url = new URL("http://tika.apache.org");
InputStream is = url.openStream();
// Only select <a> tags to be output
XPathParser xhtmlParser = new XPathParser("xhtml", XHTMLContentHandler.XHTML);
Matcher divContentMatcher = xhtmlParser.parse("//xhtml:a/descendant::node()");
MatchingContentHandler handler = new MatchingContentHandler(new ToHTMLContentHandler(), divContentMatcher);
// Parse based on original question
HtmlParser parser = new HtmlParser()
Metadata metadata = new Metadata();
parser.parse(is, handler, metadata, new ParseContext());
System.out.println("Links: " + handler.toString());
It's worth noting this is for inclusion only and only supports a sub-set of XPath. See XPathParser for details.
LinkContentHandler
If you just want to extract links, the LinkContentHandler is a great option:
URL url = new URL("http://tika.apache.org");
InputStream is = url.openStream();
LinkContentHandler linkHandler = new LinkContentHandler();
Metadata metadata = new Metadata();
HtmlParser parser = new HtmlParser();
parser.parse(is, linkHandler, metadata, new ParseContext());
System.out.println("Links: " + linkHandler.getLinks());
It's code is also a great example of how to build a custom handler.
BoilerpipeContentHandler
The BoilerpipeContentHandler uses the Boilerpipe library underneath allowing you to use one of it's defined extractors to process the content.
URL url = new URL("http://tika.apache.org");
InputStream is = url.openStream();
ExtractorBase extractor = ArticleSentencesExtractor.getInstance();
BoilerpipeContentHandler textHandler = new BoilerpipeContentHandler(new BodyContentHandler(), extractor);
Metadata metadata = new Metadata();
HtmlParser parser = new HtmlParser();
parser.parse(is, textHandler, metadata, new ParseContext());
System.out.println(textHandler.getTextDocument().getTextBlocks());
These can be really useful if you are really interesting in the content inside, as the extractors can help you focus on the content.
Custom ContentHandler or ContentHandlerDecorator
You can build your own ContentHandler to do custom processing and get exactly what you want out from a file.
In some cases this could be to write out specific content, such in the example below, or in other cases it could be processing such as collecting and making available links, as in the LinkHandler.
Using custom ContentHandler instances is really powerful and there are a ton of examples available in the Apache Tika code base, as well other open source projects too.
Below is a bit of a contrived example, just trying to emit part of the HTML:
URL url = new URL("http://tika.apache.org");
InputStream is = url.openStream();
StringWriter sw = new StringWriter();
SAXTransformerFactory factory = (SAXTransformerFactory) SAXTransformerFactory.newInstance();
TransformerHandler handler = factory.newTransformerHandler();
handler.getTransformer().setOutputProperty(OutputKeys.METHOD, "html");
handler.setResult(new StreamResult(sw));
ContentHandlerDecorator h2Handler = new ContentHandlerDecorator(handler) {
private final List<String> elementsToInclude = List.of("h2");
private boolean processElement = false;
#Override
public void startElement(String uri, String local, String name, Attributes atts)
throws SAXException {
if (elementsToInclude.contains(name)) {
processElement = true;
super.startElement(uri, local, name, atts);
}
}
#Override
public void ignorableWhitespace(char[] ch, int start, int length) {
// Skip whitespace
}
#Override
public void characters(char[] ch, int start, int length) throws SAXException {
if (!processElement) {
return;
}
super.characters(ch, start, length);
}
#Override
public void endElement(
String uri, String local, String name) throws SAXException {
if (elementsToInclude.contains(name)) {
processElement = false;
super.endElement(uri, local, name);
}
}
};
HtmlParser parser = new HtmlParser();
parser.parse(is, h2Handler, new Metadata(), new ParseContext());
System.out.println("Heading Level 2s: " + sw.toString());
As you can see in some of these examples one or more ContentHandler instances can be chained together but it's worth noting some expect the output to be well-formed, so check the Javadocs. XHTMLContentHandler is also useful if you want to map different files into your own common format.
JSoup :)
Another route is JSoup, either directly (skip the Tika part and use Jsoup.connect()) if you are processing HTML directly, or chained with Apache Tika, if you are wanting to read from HTML generated from different file types.
URL url = new URL("http://tika.apache.org");
InputStream is = url.openStream();
ToHTMLContentHandler html = new ToHTMLContentHandler();
HtmlParser parser = new HtmlParser();
parser.parse(is, html, new Metadata(), new ParseContext());
Document doc = Jsoup.parse(html.toString());
Elements h2List = doc.select("h2");
for (Element headline : h2List) {
System.out.println(headline.text());
}
Once you've parsed it, you can query the document with Jsoup. Not the most efficient vs a ContentHandler built for the job, but can be useful for messy content sets.
I have an input stream of a PDF document available to me. I would like to add subject metadata to the document and then save it. I'm not sure how to do this.
I came across a sample recipe here: https://pdfbox.apache.org/1.8/cookbook/workingwithmetadata.html
However, it is still fuzzy. Below is what I'm trying and places where I have questions
PDDocument doc = PDDocument.load(myInputStream);
PDDocumentCatalog catalog = doc.getDocumentCatalog();
InputStream newXMPData = ...; //what goes here? How can I add subject tag?
PDMetadata newMetadata = new PDMetadata(doc, newXMLData, false );
catalog.setMetadata( newMetadata );
//does anything else need to happen to save the document??
//I would like an outputstream of the document (with metadata) so that I can save it to an S3 bucket
The following code sets the title of a PDF document, but it should be adaptable to work with other properties as well:
public static byte[] insertTitlePdf(byte[] documentBytes, String title) {
try {
PDDocument document = PDDocument.load(documentBytes);
PDDocumentInformation info = document.getDocumentInformation();
info.setTitle(title);
ByteArrayOutputStream baos = new ByteArrayOutputStream();
document.save(baos);
return baos.toByteArray();
} catch (IOException e) {
e.printStackTrace();
}
return null;
}
Apache PDFBox is needed, so import it to e.g. Maven with:
<dependency>
<groupId>org.apache.pdfbox</groupId>
<artifactId>pdfbox</artifactId>
<version>2.0.6</version>
</dependency>
Add a title with:
byte[] documentBytesWithTitle = insertTitlePdf(documentBytes, "Some fancy title");
Display it in the browser with (JSF example):
<object class="pdf" data="data:application/pdf;base64,#{myBean.getDocumentBytesWithTitleAsBase64()}" type="application/pdf">Document could not be loaded</object>
Result (Chrome):
Another much easier way to do this would be to use the built-in Document Information object:
PDDocument inputDoc = // your doc
inputDoc.getDocumentInformation().setCreator("Some meta");
inputDoc.getDocumentInformation().setCustomMetadataValue("fieldName", "fieldValue");
This also has the benefit of not requiring the xmpbox library.
This answer uses xmpbox and comes from the AddMetadataFromDocInfo example in the source code download:
XMPMetadata xmp = XMPMetadata.createXMPMetadata();
DublinCoreSchema dc = xmp.createAndAddDublinCoreSchema();
dc.setDescription("descr");
XmpSerializer serializer = new XmpSerializer();
ByteArrayOutputStream baos = new ByteArrayOutputStream();
serializer.serialize(xmp, baos, true);
PDMetadata metadata = new PDMetadata(doc);
metadata.importXMPMetadata(baos.toByteArray());
doc.getDocumentCatalog().setMetadata(metadata);
I extracted meta information thru apache tika library, made some changes and now i want to write out changed information to file.
code snippet for extracting is here:
InputStream input = new FileInputStream(new File(...));
ContentHandler handler = new DefaultHandler();
Metadata metadata = new Metadata();
Parser parser = new Mp3Parser();
ParseContext parseCtx = new ParseContext();
parser.parse(input, handler, metadata, parseCtx);
input.close();
String[] tags = metadata.names();
i make some changes further like:
for(String tagName : tags){
metadata.remove(tagName);
}
And finally i want to write modified version. How can i do this?
ForkParser is a new Tika parser that was introduced in Tika version 0.9, located in org.apache.tika.fork. The new parser forks off a new jvm process to analyze the passed file stream. I figured this may be a good way to constrain how much memory I'm willing to devote to Tika's metadata extraction process. However, the Metadata object is not being populated with the appropriate metadata properties like it would when using an AutoDetectParser. Tests have shown that the BodyContentHandler object is not null.
Why is the Metadata object not being populated with anything (except the manually added RESOURCE_NAME_KEY)?
public static Metadata getMetadata(File f) {
Metadata metadata = new Metadata();
try {
FileInputStream fis = new FileInputStream(f);
BodyContentHandler contentHandler = new BodyContentHandler(-1);
ParseContext context = new ParseContext();
ForkParser parser = new ForkParser();
parser.setJavaCommand("/usr/local/java6/bin/java -Xmx64m");
metadata.set(Metadata.RESOURCE_NAME_KEY, f.getName());
parser.parse(fis, contentHandler, metadata, context);
fis.close();
String contentType = metadata.get(Metadata.CONTENT_TYPE);
logger.error("contentHandler: " + contentHandler.toString());
logger.error("metadata: " + metadata.toString());
return metadata;
} catch (Throwable e) {
logger.error("Exception while analyzing file\n" +
"CAUTION: metadata may still have useful content in it!\n" +
"Exception: " + e, e);
return metadata;
}
}
The ForkParser class in Tika 1.0 unfortunately does not support metadata extraction since for now the communication channel to the forked parser process only supports passing back SAX events but not metadata entries. I suggest you file a TIKA improvement issue to get this fixed.
One workaround you might want to consider is getting the extracted metadata from the <meta> tags in the <head> section of the XHTML document returned by the forked parser. Those should be available and contain most of the metadata entries normally returned in the Metadata object.