My application allows users to download files. While creating headers I am using Tika to set extension as shown below.
This works fine for pdf files. Fails for DOC and EXCEL files.
private HttpHeaders getHeaderData(byte[] fileBytes) throws IOException, MimeTypeException {
final HttpHeaders headers = new HttpHeaders();
TikaInputStream tikaStream = TikaInputStream.get(fileBytes);
Tika tika = new Tika();
String mimeType = tika.detect(tikaStream);
headers.setContentType(MediaType.valueOf(mimeType));
MimeTypes defaultMimeTypes = MimeTypes.getDefaultMimeTypes();
String extension = defaultMimeTypes.forName(mimeType).getExtension();
headers.add("file-ext", extension);
return headers;
}
I see that the mimeType is resolved to "application/pdf" for pdf files but resolves to "application/x-tika-ooxml" for excel and word files which is the problem.
How can I get word(.docx) and excel (xlx, xlsx) formats if I have a file in bytes.
Why does this work for pdf?
Summary
The short answer is: You have to use Tika's detector with its MediaType class - not MimeTypes.
The slightly longer answer is: Even that will not get you all the way, because of how older MS-Office files are structured. For those you have to also parse the files, and inspect their metadata.
The term "media type" has replaced the term "MIME type" - see here:
[RFC2046] specifies that Media Types (formerly known as MIME types) and Media
Subtypes will be assigned and listed by the IANA.
Office 97-2003
When Tika inspects Excel and Word 97-2003 files using its detector, it will return a media type of application/x-tika-msoffice. I assume (perhaps incorrectly) that this is its way of handling a file-type group, where the detector cannot determine the specific flavor of MS-Office 97-2003 file, based on its analysis. This is similar to the application/x-tika-ooxml in your question.
Expected Results
Based on the IANA list here, and a Mozilla list here, these are the media types we expect to get for the following file types:
.pdf :: application/pdf
.xls :: application/vnd.ms-excel
.doc :: application/msword
.xlsx :: application/vnd.openxmlformats-officedocument.spreadsheetml.sheet
.docx :: application/vnd.openxmlformats-officedocument.wordprocessingml.document
The Program
The program shown below uses the following Maven dependencies:
<dependencies>
<dependency>
<groupId>org.apache.tika</groupId>
<artifactId>tika-core</artifactId>
<version>1.23</version>
</dependency>
<dependency>
<groupId>org.apache.tika</groupId>
<artifactId>tika-parsers</artifactId>
<version>1.23</version>
</dependency>
<dependency>
<groupId>javax.ws.rs</groupId>
<artifactId>javax.ws.rs-api</artifactId>
<version>2.1.1</version>
</dependency>
</dependencies>
The program (just for this demo - not production ready) is shown below. Specifically, look at the tikaDetect() and tikaParse() methods.
import java.io.IOException;
import java.io.File;
import java.io.FileInputStream;
import java.io.BufferedInputStream;
import java.util.Set;
import java.util.HashSet;
import org.apache.tika.mime.MediaType;
import org.apache.tika.mime.MimeTypeException;
import org.apache.tika.metadata.Metadata;
import org.apache.tika.detect.Detector;
import org.apache.tika.detect.DefaultDetector;
import org.apache.tika.exception.TikaException;
import org.apache.tika.parser.Parser;
import org.apache.tika.parser.AutoDetectParser;
import org.apache.tika.sax.BodyContentHandler;
import org.apache.tika.parser.ParseContext;
import org.xml.sax.SAXException;
import org.xml.sax.ContentHandler;
public class Main {
private final Set<File> msOfficeFiles = new HashSet();
public static void main(String[] args) throws IOException, MimeTypeException,
SAXException, TikaException {
Main main = new Main();
main.doFileDetection();
}
private void doFileDetection() throws IOException, MimeTypeException, SAXException, TikaException {
File file1 = new File("C:/tmp/foo.pdf");
File file2 = new File("C:/tmp/baz.xlsx");
File file3 = new File("C:/tmp/bat.docx");
// Excel 97-2003 format:
File file4 = new File("C:/tmp/bar.xls");
// Word 97-2003 format:
File file5 = new File("C:/tmp/daz.doc");
Set<File> files = new HashSet();
files.add(file1);
files.add(file2);
files.add(file3);
files.add(file4);
files.add(file5);
for (File file : files) {
try (BufferedInputStream bis = new BufferedInputStream(
new FileInputStream(file))) {
tikaDetect(file, bis);
} catch (IOException e) {
e.printStackTrace();
}
}
for (File file : msOfficeFiles) {
tikaParse(file);
}
}
private void tikaDetect(File file, BufferedInputStream bis)
throws IOException, SAXException, TikaException {
Detector detector = new DefaultDetector();
Metadata metadata = new Metadata();
MediaType mediaType = detector.detect(bis, metadata);
if (mediaType.toString().equals("application/x-tika-msoffice")) {
msOfficeFiles.add(file);
} else {
System.out.println("Media Type for " + file.getName()
+ " is: " + mediaType.toString());
}
}
private void tikaParse(File file) throws SAXException, TikaException {
Parser parser = new AutoDetectParser();
ContentHandler handler = new BodyContentHandler();
Metadata metadata = new Metadata();
ParseContext context = new ParseContext();
try (BufferedInputStream bis = new BufferedInputStream(
new FileInputStream(file))) {
parser.parse(bis, handler, metadata, context);
tikaDetect(file, bis);
} catch (IOException e) {
e.printStackTrace();
}
System.out.println("Media Type for " + file.getName()
+ " is: " + metadata.get("Content-Type"));
}
}
Actual Results
The program generates some warnings and information messages. If we ignore these for this exercise, we get the following print statements:
Media Type for bat.docx is: application/vnd.openxmlformats-officedocument.wordprocessingml.document
Media Type for baz.xlsx is: application/vnd.openxmlformats-officedocument.spreadsheetml.sheet
Media Type for foo.pdf is: application/pdf
Media Type for bar.xls is: application/vnd.ms-excel
Media Type for daz.doc is: application/msword
These match the expected official media (MIME) types.
Tika official usages:
https://tika.apache.org/1.26/detection.html
Tika supported formats:
https://tika.apache.org/1.26/formats.html
You could get the answers by simply reading the above 2 pages.
Here are some key quotes:
Microsoft Office and some related applications produce documents in the generic OLE 2 Compound Document and Office Open XML (OOXML) formats. The older OLE 2 format was introduced in Microsoft Office version 97 and was the default format until Office version 2007 and the new XML-based OOXML format. The OfficeParser and OOXMLParser classes use Apache POI libraries to support text and metadata extraction from both OLE2 and OOXML documents.
That means you need to include also Apache POI jars or Maven dependencies for MS office files.
Tika provides a wrapping detector in the form of org.apache.tika.detect.DefaultDetector. This uses the service loader to discover all available detectors, including any available container aware ones, and tries them in turn. For container aware detection, include the Tika Parsers jar and its dependencies in your project, then use DefaultDetector along with a TikaInputStream.
That means you need to include the Tika Parsers jar or Maven dependencies.
Then use
new DefaultDetector().detect(TikaInputStream.get(file), new Metadata());
Related
I'm having problems with Apache POI and File Mime Type.
I Use a file template (Microsoft Word DOCX) to modify some values with Apache Poi.
The original file has the mime type "application/vnd.openxmlformats-officedocument.wordprocessingml.document" (in linux: file -i {filename}), but and I process the file with POI and save then again I got "application/octet-stream" and I wish to Keep the File with the original mime type.
I open the file with HEX editor, both files original and modified and both has the same "magical numbers" (50 4B 03 04), but the file size is different, even when the texts are the same.
So It's possible to fix it? Anyone have the same problem? I check it in LibreOffice and appears to have same behavior of Apache POI.
Any help, any information will help.
As you already stated in a comment, the kind how Apache POI rearranges the Office Open XML ZIP package leads to misinterpreting the content type by some tools. An Office Open XML file (*.docx, *.xlsx, *.pptx) is a ZIP archive but somewhat how Microsoft Office is packing that archive must be special. I have not found what exactly it is though.
Example:
Start having a Document.docx having some simple content, which was saved by Microsoft Word.
For this, file -i produces:
axel#arichter:~/Dokumente/JAVA/poi/poi-4.0.1$ file -i Document.docx
Document.docx: application/vnd.openxmlformats-officedocument.wordprocessingml.document; charset=binary
Now run that code:
import java.io.FileOutputStream;
import java.io.FileInputStream;
import org.apache.poi.xwpf.usermodel.XWPFDocument;
public class WordReadAndReWrite {
public static void main(String[] args) throws Exception {
String inFilePath = "Document.docx";
String outFilePath = "NewDocument.docx";
XWPFDocument doc = new XWPFDocument(new FileInputStream(inFilePath));
doc.createParagraph().createRun().setText("new text inserted");
FileOutputStream out = new FileOutputStream(outFilePath);
doc.write(out);
out.close();
doc.close();
}
}
For the resulting NewDocument.docx, file -i produces:
axel#arichter:~/Dokumente/JAVA/poi/poi-4.0.1$ file -i NewDocument.docx
NewDocument.docx: application/octet-stream; charset=binary
But if we are doing the same without using Apache POI's ZipPackage but instead using FileSystem for getting the XML out of the Office Open XML ZIP package using following code:
import java.nio.file.Files;
import java.nio.file.FileSystems;
import java.nio.file.FileSystem;
import java.nio.file.Paths;
import java.nio.file.Path;
import java.nio.file.StandardCopyOption;
import java.nio.file.StandardOpenOption;
import javax.xml.parsers.DocumentBuilderFactory;
import javax.xml.parsers.DocumentBuilder;
import org.w3c.dom.Document;
import org.w3c.dom.Node;
import javax.xml.transform.TransformerFactory;
import javax.xml.transform.Transformer;
import javax.xml.transform.stream.StreamResult;
import javax.xml.transform.dom.DOMSource;
public class WordReadAndReWriteFileSystem {
public static void main(String[] args) throws Exception {
String inFilePath = "Document.docx";
String outFilePath = "NewDocument.docx";
FileSystem fileSystem = FileSystems.newFileSystem(Paths.get(inFilePath), null);
Path wordDocumentXml = fileSystem.getPath("/word/document.xml");
DocumentBuilder documentBuilder = DocumentBuilderFactory.newInstance().newDocumentBuilder();
Document xmlDocument = documentBuilder.parse(Files.newInputStream(wordDocumentXml, StandardOpenOption.READ));
Node p = xmlDocument.createElement("w:p");
Node r = xmlDocument.createElement("w:r");
p.appendChild(r);
Node t = xmlDocument.createElement("w:t");
r.appendChild(t);
Node text = xmlDocument.createTextNode("new text inserted");
t.appendChild(text);
Node body = xmlDocument.getElementsByTagName("w:body").item(0);
Node sectPr = xmlDocument.getElementsByTagName("w:sectPr").item(0);
body.insertBefore(p, sectPr);
TransformerFactory transformerFactory = TransformerFactory.newInstance();
Transformer transformer = transformerFactory.newTransformer();
DOMSource domSource = new DOMSource(xmlDocument);
Path tmpDoc = Files.createTempFile("wordDocument", "tmp");
tmpDoc.toFile().deleteOnExit();
StreamResult streamResult = new StreamResult(Files.newOutputStream(tmpDoc, StandardOpenOption.WRITE));
transformer.transform(domSource, streamResult);
fileSystem.close();
Path tmpZip = Files.createTempFile("zipDocument", "tmp");
tmpZip.toFile().deleteOnExit();
Path path = Files.copy(Paths.get(inFilePath), tmpZip, StandardCopyOption.REPLACE_EXISTING);
fileSystem = FileSystems.newFileSystem(path, null);
wordDocumentXml = fileSystem.getPath("/word/document.xml");
Files.copy(tmpDoc, wordDocumentXml, StandardCopyOption.REPLACE_EXISTING);
fileSystem.close();
Files.copy(tmpZip, Paths.get(outFilePath), StandardCopyOption.REPLACE_EXISTING);
}
}
Then for the resulting NewDocument.docx, file -i produces:
axel#arichter:~/Dokumente/JAVA/poi/poi-4.0.1$ file -i NewDocument.docx
NewDocument.docx: application/vnd.openxmlformats-officedocument.wordprocessingml.document; charset=binary
This Code shows the correct mime type of file for all files that I test:
public static void main(String[] args) {
String fileName = "model_libreoffice.docx";
// String fileName = "model_poi.docx";
// String fileName = "model_msoffice.docx";
// String fileName = "model_repacked_bz2.docx";
try {
InputStream is = Main.class.getResourceAsStream("/" + fileName);
Tika t = new Tika();
String mime = t.detect(is, fileName);
System.out.println("----> " + mime);
} catch (IOException e) {
e.printStackTrace();
}
}
After long debugging and testing I think that's problem with Third party validation of files.
This simple code shows me the correct mime type for all files that I try, modified by MicrosoftOffice, LibreOffice, Apache Poi, Unzip and Zipping again (renamed to DOCX) the content files of DOCX...
So I think that's problem can be mark as "solved" at all.
Currently, tika is processing zip files looking inside them.
I'd like to disable this features and only gets me application/zip mime type.
I'm using this code right now:
public String getMimeType(InputStream is) {
TikaConfig tikaConfig = TikaConfig.getDefaultConfig();
Detector detector = tikaConfig.getDetector(); //new DefaultDetector();
Metadata metadata = new Metadata();
MediaType mediaType = detector.detect(TikaInputStream.get(is), metadata);
}
This code returns me zipped mime type file.
Any ideas?
Based on your example I wrote a dummy app. I then used a large Zip file that doesn't have the zip Extension. I do not see the behavior that Tika parses the whole file.
I looked with a debugger. Tika only reads 65536 bytes to determine the file type.
See: Tika MimeTypes.class:154
public MediaType detect(InputStream input, Metadata metadata) throws IOException {
List<MimeType> possibleTypes = null;
// Get type based on magic prefix
if (input != null) {
input.mark(getMinLength());
try {
byte[] prefix = readMagicHeader(input);
possibleTypes = getMimeType(prefix);
} finally {
input.reset();
}
}
dummy app
import org.apache.tika.config.TikaConfig;
import org.apache.tika.detect.Detector;
import org.apache.tika.io.TikaInputStream;
import org.apache.tika.metadata.Metadata;
import org.apache.tika.mime.MediaType;
import java.io.File;
import java.io.FileInputStream;
import java.io.IOException;
import java.io.InputStream;
public class Main {
public static void main(String[] args) throws IOException {
System.out.println(new Main().getMimeType(new FileInputStream(new File("C:\\temp\\apache-tomcat-8.0.47-windows-x64"))));
}
public String getMimeType(InputStream is) throws IOException {
final TikaConfig tikaConfig = TikaConfig.getDefaultConfig();
Detector detector = tikaConfig.getDetector(); //new DefaultDetector();
Metadata metadata = new Metadata();
MediaType mediaType = detector.detect(TikaInputStream.get(is), metadata);
return mediaType.getType();
}
Maven:
<dependencies>
<dependency>
<groupId>org.apache.tika</groupId>
<artifactId>tika-core</artifactId>
<version>1.18</version>
</dependency>
</dependencies>
Can you show me an example that reads and parses the whole zip-file? I can understand that that would be a problem when files exceed a certain size.
Unfortunately I cant help you if I can't reproduce the problem.
How to check if user is actually uploading APK file in java.
By checking only the extension of file user can rename any file(Eg:TXT file) to apk and upload.
I tried to check the contentType of the particular file as shown below:
if(apkFile.getContentType().equals("application/vnd.android.package-archive"))
But here the problem is when upload from some browser the content type comes as application/vnd.android.package-archive but for some others its application/octet-stream.
What is the proper way to check this?
Apache tika parser can detect your file type if you changed the file type extension by renaming.Though parse method a bit expensive. After parsing the file, file metadata are stored in MetaData object. However you can follow the code
import org.apache.tika.metadata.Metadata;
import org.apache.tika.sax.BodyContentHandler;
import org.apache.tika.parser.AutoDetectParser;
import java.io.File;
import java.io.FileInputStream;
import java.io.InputStream;
public class Test {
public static void main(String[] args) {
File file = null;
InputStream is = null;
String contentType = null;
try {
//suppose f.apk file you renamed it f.txt
file = new File("path\\to\\your\\file\\f.txt");
is = new FileInputStream(file);
Metadata metadata = new Metadata();
BodyContentHandler handler = new BodyContentHandler();
AutoDetectParser parser = new AutoDetectParser();
try {
// This step takes some time.
parser.parse(is, handler, metadata);
} catch (Exception ex) {
ex.printStackTrace();
} finally {
is.close();
}
contentType = metadata.get("Content-Type");
System.out.println("Content Type = " + contentType);
} catch (Exception e) {
e.printStackTrace();
}
}
}
Required maven dependency
<dependency>
<groupId>org.apache.tika</groupId>
<artifactId>tika-core</artifactId>
<version>1.16</version>
</dependency>
<dependency>
<groupId>org.apache.tika</groupId>
<artifactId>tika-parsers</artifactId>
<version>1.16</version>
</dependency>
Apk is formated in zip,so you could check whether it's a zip file to decrease the probability that it' not an apk file ,in addtion to checking the file extension.After all ,not all of the marplots change file extensions only from zip to apk
I'm using a webservice that is always sending me a plain/text file. However, that file can either be a zip or a csv but I'm not being informed of its type beforehand.
Is there a way to know the file type by looking through its content programmatically wise of course. As one is in byte code and the other one an actually readeable text.
I've already thought of looking for lots of commas in the file content but that seems inaccurate.
You can use java.util.zip.ZipFile, if the constructor throws a ZipException, it's not a zip file...
try(ZipFile zip = new ZipFile(filename)) {
// It's a zip file
}
catch(ZipException e) {
// Not a valid zip
}
You could make use of the ZIP file structure.
As per the file header, each file should start with the bytes: 0x04 0x03 0x4b 0x50.
You could also use a MIME detection library such as Apache Tika import org.apache.tika.Tika;
import org.apache.tika.mime.MediaType;
import java.io.IOException;
import java.io.InputStream;
import java.nio.file.Files;
import java.nio.file.Path;
import java.nio.file.Paths;
public class Detect {
/**
* Resolves the MediaType using Tika and prints it to the standard output.
* #param file the path of the file to probe.
* #throws IOException whenever an I/O exception occurs.
*/
private void detect(Path file) throws IOException {
Tika tika = new Tika();
try(InputStream is = Files.newInputStream(file)){
MediaType mediaType = MediaType.parse(tika.detect(is));
System.out.println(mediaType);
}
}
public static void main(String[] args) throws IOException {
Detect d = new Detect();
d.detect(Paths.get("zip_file"));
d.detect(Paths.get("csv_file"));
}
}
The strings I'm (programmatically) getting from MS Word files when using Apache POI are not the same text I can look at when I open the files with MS Word.
When using the following code:
File someFile = new File("some\\path\\MSWFile.doc");
InputStream inputStrm = new FileInputStream(someFile);
HWPFDocument wordDoc = new HWPFDocument(inputStrm);
System.out.println(wordDoc.getText());
the output is a single line with many 'invalid' characters (yes, the 'boxes'), and many unwanted strings, like "FORMTEXT", "HYPERLINK \l "_Toc##########"" ('#' being numeric digits), "PAGEREF _Toc########## \h 4", etc.
The following code "fixes" the single-line problem, but maintains all the invalid characters and unwanted text:
File someFile = new File("some\\path\\MSWFile.doc");
InputStream inputStrm = new FileInputStream(someFile);
WordExtractor wordExtractor = new WordExtractor(inputStrm);
for(String paragraph:wordExtractor.getParagraphText()){
System.out.println(paragraph);
}
I don't know if I'm using the wrong method for extracting the text, but that's what I've come up with when looking at POI's quick-guide. If I am, what is the correct approach?
If that output is correct, is there a standard way for getting rid of the unwanted text, or will I have to write a filter of my own?
This class can read both .doc and .docx files in Java. For this I'm using tika-app-1.2.jar:
/*
* This class is used to read .doc and .docx files
*
* #author Developer
*
*/
import java.io.ByteArrayOutputStream;
import java.io.File;
import java.io.InputStream;
import java.io.OutputStream;
import java.io.OutputStreamWriter;
import java.net.URL;
import org.apache.tika.detect.DefaultDetector;
import org.apache.tika.detect.Detector;
import org.apache.tika.io.TikaInputStream;
import org.apache.tika.metadata.Metadata;
import org.apache.tika.parser.AutoDetectParser;
import org.apache.tika.parser.ParseContext;
import org.apache.tika.parser.Parser;
import org.apache.tika.sax.BodyContentHandler;
import org.xml.sax.ContentHandler;
class TextExtractor {
private OutputStream outputstream;
private ParseContext context;
private Detector detector;
private Parser parser;
private Metadata metadata;
private String extractedText;
public TextExtractor() {
context = new ParseContext();
detector = new DefaultDetector();
parser = new AutoDetectParser(detector);
context.set(Parser.class, parser);
outputstream = new ByteArrayOutputStream();
metadata = new Metadata();
}
public void process(String filename) throws Exception {
URL url;
File file = new File(filename);
if (file.isFile()) {
url = file.toURI().toURL();
} else {
url = new URL(filename);
}
InputStream input = TikaInputStream.get(url, metadata);
ContentHandler handler = new BodyContentHandler(outputstream);
parser.parse(input, handler, metadata, context);
input.close();
}
public void getString() {
//Get the text into a String object
extractedText = outputstream.toString();
//Do whatever you want with this String object.
System.out.println(extractedText);
}
public static void main(String args[]) throws Exception {
if (args.length == 1) {
TextExtractor textExtractor = new TextExtractor();
textExtractor.process(args[0]);
textExtractor.getString();
} else {
throw new Exception();
}
}
}
To compile:
javac -cp ".:tika-app-1.2.jar" TextExtractor.java
To run:
java -cp ".:tika-app-1.2.jar" TextExtractor SomeWordDocument.doc
There are two options, one provided directly in Apache POI, the other via Apache Tika (which uses Apache POI internally).
The first option is to use WordExtractor, but wrap it in a call to stripFields(String) when calling it. That will remove the text based fields included in the text, things like HYPERLINK that you've seen. Your code would become:
NPOIFSFileSystem fs = new NPOIFSFileSytem(file);
WordExtractor extractor = new WordExtractor(fs.getRoot());
for(String rawText : extractor.getParagraphText()) {
String text = extractor.stripFields(rawText);
System.out.println(text);
}
The other option is to use Apache Tika. Tika provides text extraction, and metadata, for a wide variety of files, so the same code will work for .doc, .docx, .pdf and many others too. To get clean, plain text of your word document (you can also get XHTML if you'd rather), you'd do something like:
TikaConfig tika = TikaConfig.getDefaultConfig();
TikaInputStream stream = TikaInputStream.get(file);
ContentHandler handler = new BodyContentHandler();
Metadata metadata = new Metadata();
tika.getParser().parse(input, handler, metadata, new ParseContext());
String text = handler.toString();
Try this, works for me and is purely a POI solution. You will have to look for the HWPFDocument counterpart though. Make sure the document you are reading predates Word 97, else use XWPFDocument like I do.
InputStream inputstream = new FileInputStream(m_filepath);
//read the file
XWPFDocument adoc= new XWPFDocument(inputstream);
//and place it in a xwpf format
aString = new XWPFWordExtractor(adoc).getText();
//gets the full text
Now if you want certain parts you can use the getparagraphtext but dont use the text extractor, use it directly on the paragraph like this
for (XWPFParagraph p : adoc.getParagraphs())
{
System.out.println(p.getParagraphText());
}