tika: disable looking inside zip files - java

Currently, tika is processing zip files looking inside them.
I'd like to disable this features and only gets me application/zip mime type.
I'm using this code right now:
public String getMimeType(InputStream is) {
TikaConfig tikaConfig = TikaConfig.getDefaultConfig();
Detector detector = tikaConfig.getDetector(); //new DefaultDetector();
Metadata metadata = new Metadata();
MediaType mediaType = detector.detect(TikaInputStream.get(is), metadata);
}
This code returns me zipped mime type file.
Any ideas?

Based on your example I wrote a dummy app. I then used a large Zip file that doesn't have the zip Extension. I do not see the behavior that Tika parses the whole file.
I looked with a debugger. Tika only reads 65536 bytes to determine the file type.
See: Tika MimeTypes.class:154
public MediaType detect(InputStream input, Metadata metadata) throws IOException {
List<MimeType> possibleTypes = null;
// Get type based on magic prefix
if (input != null) {
input.mark(getMinLength());
try {
byte[] prefix = readMagicHeader(input);
possibleTypes = getMimeType(prefix);
} finally {
input.reset();
}
}
dummy app
import org.apache.tika.config.TikaConfig;
import org.apache.tika.detect.Detector;
import org.apache.tika.io.TikaInputStream;
import org.apache.tika.metadata.Metadata;
import org.apache.tika.mime.MediaType;
import java.io.File;
import java.io.FileInputStream;
import java.io.IOException;
import java.io.InputStream;
public class Main {
public static void main(String[] args) throws IOException {
System.out.println(new Main().getMimeType(new FileInputStream(new File("C:\\temp\\apache-tomcat-8.0.47-windows-x64"))));
}
public String getMimeType(InputStream is) throws IOException {
final TikaConfig tikaConfig = TikaConfig.getDefaultConfig();
Detector detector = tikaConfig.getDetector(); //new DefaultDetector();
Metadata metadata = new Metadata();
MediaType mediaType = detector.detect(TikaInputStream.get(is), metadata);
return mediaType.getType();
}
Maven:
<dependencies>
<dependency>
<groupId>org.apache.tika</groupId>
<artifactId>tika-core</artifactId>
<version>1.18</version>
</dependency>
</dependencies>
Can you show me an example that reads and parses the whole zip-file? I can understand that that would be a problem when files exceed a certain size.
Unfortunately I cant help you if I can't reproduce the problem.

Related

application/octet-stream for file Upload

In a Spring Boot app, I am trying to filter file types when use upload file by using the following approach:
public static void validateFile(InputStream inputStream, String fileId) throws FileUploadException {
try {
BufferedInputStream bis = new BufferedInputStream(inputStream);
AutoDetectParser parser = new AutoDetectParser();
Detector detector = parser.getDetector();
Metadata md = new Metadata();
md.add(Metadata.RESOURCE_NAME_KEY, fileId);
MediaType mediaType = detector.detect(bis, md);
// check if mediaType.toString() is in my valid file types ("audio/opus", "audio/x-aac", ...)
} catch (IOException e) {
throw new Exception(fileId);
}
}
However, when I try to add a file wit the extension of aac,the file type would be application/octet-stream.
According to Common MIME types, application/octet-stream is the default value for all other cases (I think when file type is not detected).
So, in this situation, should I add application/octet-stream type to my valid file types list? I know it may be dangerous in some cases, but how can I filter the file types for audio/video?
Note: I used some pages while building my valid file type list e.g. https://www.iana.org/assignments/media-types/media-types.xml
The first thing I will do is give a short and concise run-through/implementation of Tika.
Below is the code for a method you can use to find the type of the document:
public static String detectDocTypeUsingDetector(InputStream stream)
throws IOException {
Detector detector = new DefaultDetector();
Metadata metadata = new Metadata();
MediaType mediaType = detector.detect(stream, metadata);
return mediaType.toString();
}
This java code will return the document type. Even if the extension of the file has been changed, Tika can still identify the correct type using the magic bytes it finds at the beginning of the file.
Another easier way to do this is through the facade Tika class:
public static String detectDocTypeUsingFacade(InputStream stream)
throws IOException {
Tika tika = new Tika();
String mediaType = tika.detect(stream);
return mediaType;
}
Next, let us extract the file’s contents using a parser.
public static String extractContentUsingParser(InputStream stream)
throws IOException, TikaException, SAXException {
Parser parser = new AutoDetectParser();
ContentHandler handler = new BodyContentHandler();
Metadata metadata = new Metadata();
ParseContext context = new ParseContext();
parser.parse(stream, handler, metadata, context);
return handler.toString();
}
This is a higher level approach that shows what is going on but again, we can use the Tika class for an easier approach:
public static String extractContentUsingFacade(InputStream stream)
throws IOException, TikaException {
Tika tika = new Tika();
String content = tika.parseToString(stream);
return content;
}
Next, let’s extract the metadata using the parser as well:
public static Metadata extractMetadatatUsingParser(InputStream stream)
throws IOException, SAXException, TikaException {
Parser parser = new AutoDetectParser();
ContentHandler handler = new BodyContentHandler();
Metadata metadata = new Metadata();
ParseContext context = new ParseContext();
parser.parse(stream, handler, metadata, context);
return metadata;
}
And again, we can make it easier:
public static Metadata extractMetadatatUsingFacade(InputStream stream)
throws IOException, TikaException {
Tika tika = new Tika();
Metadata metadata = new Metadata();
tika.parse(stream, metadata);
return metadata;
}
Importing packages
Here are most of the things you need to import for these methods:
import java.io.File;
import java.io.FileInputStream;
import java.io.IOException;
import org.apache.tika.exception.TikaException;
import org.apache.tika.metadata.Metadata;
import org.apache.tika.parser.ParseContext;
import org.apache.tika.sax.BodyContentHandler;
import org.xml.sax.SAXException;

How to determine extension from fileBytes

My application allows users to download files. While creating headers I am using Tika to set extension as shown below.
This works fine for pdf files. Fails for DOC and EXCEL files.
private HttpHeaders getHeaderData(byte[] fileBytes) throws IOException, MimeTypeException {
final HttpHeaders headers = new HttpHeaders();
TikaInputStream tikaStream = TikaInputStream.get(fileBytes);
Tika tika = new Tika();
String mimeType = tika.detect(tikaStream);
headers.setContentType(MediaType.valueOf(mimeType));
MimeTypes defaultMimeTypes = MimeTypes.getDefaultMimeTypes();
String extension = defaultMimeTypes.forName(mimeType).getExtension();
headers.add("file-ext", extension);
return headers;
}
I see that the mimeType is resolved to "application/pdf" for pdf files but resolves to "application/x-tika-ooxml" for excel and word files which is the problem.
How can I get word(.docx) and excel (xlx, xlsx) formats if I have a file in bytes.
Why does this work for pdf?
Summary
The short answer is: You have to use Tika's detector with its MediaType class - not MimeTypes.
The slightly longer answer is: Even that will not get you all the way, because of how older MS-Office files are structured. For those you have to also parse the files, and inspect their metadata.
The term "media type" has replaced the term "MIME type" - see here:
[RFC2046] specifies that Media Types (formerly known as MIME types) and Media
Subtypes will be assigned and listed by the IANA.
Office 97-2003
When Tika inspects Excel and Word 97-2003 files using its detector, it will return a media type of application/x-tika-msoffice. I assume (perhaps incorrectly) that this is its way of handling a file-type group, where the detector cannot determine the specific flavor of MS-Office 97-2003 file, based on its analysis. This is similar to the application/x-tika-ooxml in your question.
Expected Results
Based on the IANA list here, and a Mozilla list here, these are the media types we expect to get for the following file types:
.pdf :: application/pdf
.xls :: application/vnd.ms-excel
.doc :: application/msword
.xlsx :: application/vnd.openxmlformats-officedocument.spreadsheetml.sheet
.docx :: application/vnd.openxmlformats-officedocument.wordprocessingml.document
The Program
The program shown below uses the following Maven dependencies:
<dependencies>
<dependency>
<groupId>org.apache.tika</groupId>
<artifactId>tika-core</artifactId>
<version>1.23</version>
</dependency>
<dependency>
<groupId>org.apache.tika</groupId>
<artifactId>tika-parsers</artifactId>
<version>1.23</version>
</dependency>
<dependency>
<groupId>javax.ws.rs</groupId>
<artifactId>javax.ws.rs-api</artifactId>
<version>2.1.1</version>
</dependency>
</dependencies>
The program (just for this demo - not production ready) is shown below. Specifically, look at the tikaDetect() and tikaParse() methods.
import java.io.IOException;
import java.io.File;
import java.io.FileInputStream;
import java.io.BufferedInputStream;
import java.util.Set;
import java.util.HashSet;
import org.apache.tika.mime.MediaType;
import org.apache.tika.mime.MimeTypeException;
import org.apache.tika.metadata.Metadata;
import org.apache.tika.detect.Detector;
import org.apache.tika.detect.DefaultDetector;
import org.apache.tika.exception.TikaException;
import org.apache.tika.parser.Parser;
import org.apache.tika.parser.AutoDetectParser;
import org.apache.tika.sax.BodyContentHandler;
import org.apache.tika.parser.ParseContext;
import org.xml.sax.SAXException;
import org.xml.sax.ContentHandler;
public class Main {
private final Set<File> msOfficeFiles = new HashSet();
public static void main(String[] args) throws IOException, MimeTypeException,
SAXException, TikaException {
Main main = new Main();
main.doFileDetection();
}
private void doFileDetection() throws IOException, MimeTypeException, SAXException, TikaException {
File file1 = new File("C:/tmp/foo.pdf");
File file2 = new File("C:/tmp/baz.xlsx");
File file3 = new File("C:/tmp/bat.docx");
// Excel 97-2003 format:
File file4 = new File("C:/tmp/bar.xls");
// Word 97-2003 format:
File file5 = new File("C:/tmp/daz.doc");
Set<File> files = new HashSet();
files.add(file1);
files.add(file2);
files.add(file3);
files.add(file4);
files.add(file5);
for (File file : files) {
try (BufferedInputStream bis = new BufferedInputStream(
new FileInputStream(file))) {
tikaDetect(file, bis);
} catch (IOException e) {
e.printStackTrace();
}
}
for (File file : msOfficeFiles) {
tikaParse(file);
}
}
private void tikaDetect(File file, BufferedInputStream bis)
throws IOException, SAXException, TikaException {
Detector detector = new DefaultDetector();
Metadata metadata = new Metadata();
MediaType mediaType = detector.detect(bis, metadata);
if (mediaType.toString().equals("application/x-tika-msoffice")) {
msOfficeFiles.add(file);
} else {
System.out.println("Media Type for " + file.getName()
+ " is: " + mediaType.toString());
}
}
private void tikaParse(File file) throws SAXException, TikaException {
Parser parser = new AutoDetectParser();
ContentHandler handler = new BodyContentHandler();
Metadata metadata = new Metadata();
ParseContext context = new ParseContext();
try (BufferedInputStream bis = new BufferedInputStream(
new FileInputStream(file))) {
parser.parse(bis, handler, metadata, context);
tikaDetect(file, bis);
} catch (IOException e) {
e.printStackTrace();
}
System.out.println("Media Type for " + file.getName()
+ " is: " + metadata.get("Content-Type"));
}
}
Actual Results
The program generates some warnings and information messages. If we ignore these for this exercise, we get the following print statements:
Media Type for bat.docx is: application/vnd.openxmlformats-officedocument.wordprocessingml.document
Media Type for baz.xlsx is: application/vnd.openxmlformats-officedocument.spreadsheetml.sheet
Media Type for foo.pdf is: application/pdf
Media Type for bar.xls is: application/vnd.ms-excel
Media Type for daz.doc is: application/msword
These match the expected official media (MIME) types.
Tika official usages:
https://tika.apache.org/1.26/detection.html
Tika supported formats:
https://tika.apache.org/1.26/formats.html
You could get the answers by simply reading the above 2 pages.
Here are some key quotes:
Microsoft Office and some related applications produce documents in the generic OLE 2 Compound Document and Office Open XML (OOXML) formats. The older OLE 2 format was introduced in Microsoft Office version 97 and was the default format until Office version 2007 and the new XML-based OOXML format. The OfficeParser and OOXMLParser classes use Apache POI libraries to support text and metadata extraction from both OLE2 and OOXML documents.
That means you need to include also Apache POI jars or Maven dependencies for MS office files.
Tika provides a wrapping detector in the form of org.apache.tika.detect.DefaultDetector. This uses the service loader to discover all available detectors, including any available container aware ones, and tries them in turn. For container aware detection, include the Tika Parsers jar and its dependencies in your project, then use DefaultDetector along with a TikaInputStream.
That means you need to include the Tika Parsers jar or Maven dependencies.
Then use
new DefaultDetector().detect(TikaInputStream.get(file), new Metadata());

Content type check for uploading APK file in JAVA

How to check if user is actually uploading APK file in java.
By checking only the extension of file user can rename any file(Eg:TXT file) to apk and upload.
I tried to check the contentType of the particular file as shown below:
if(apkFile.getContentType().equals("application/vnd.android.package-archive"))
But here the problem is when upload from some browser the content type comes as application/vnd.android.package-archive but for some others its application/octet-stream.
What is the proper way to check this?
Apache tika parser can detect your file type if you changed the file type extension by renaming.Though parse method a bit expensive. After parsing the file, file metadata are stored in MetaData object. However you can follow the code
import org.apache.tika.metadata.Metadata;
import org.apache.tika.sax.BodyContentHandler;
import org.apache.tika.parser.AutoDetectParser;
import java.io.File;
import java.io.FileInputStream;
import java.io.InputStream;
public class Test {
public static void main(String[] args) {
File file = null;
InputStream is = null;
String contentType = null;
try {
//suppose f.apk file you renamed it f.txt
file = new File("path\\to\\your\\file\\f.txt");
is = new FileInputStream(file);
Metadata metadata = new Metadata();
BodyContentHandler handler = new BodyContentHandler();
AutoDetectParser parser = new AutoDetectParser();
try {
// This step takes some time.
parser.parse(is, handler, metadata);
} catch (Exception ex) {
ex.printStackTrace();
} finally {
is.close();
}
contentType = metadata.get("Content-Type");
System.out.println("Content Type = " + contentType);
} catch (Exception e) {
e.printStackTrace();
}
}
}
Required maven dependency
<dependency>
<groupId>org.apache.tika</groupId>
<artifactId>tika-core</artifactId>
<version>1.16</version>
</dependency>
<dependency>
<groupId>org.apache.tika</groupId>
<artifactId>tika-parsers</artifactId>
<version>1.16</version>
</dependency>
Apk is formated in zip,so you could check whether it's a zip file to decrease the probability that it' not an apk file ,in addtion to checking the file extension.After all ,not all of the marplots change file extensions only from zip to apk

Java - Differentiate ZIP file from CSV file

I'm using a webservice that is always sending me a plain/text file. However, that file can either be a zip or a csv but I'm not being informed of its type beforehand.
Is there a way to know the file type by looking through its content programmatically wise of course. As one is in byte code and the other one an actually readeable text.
I've already thought of looking for lots of commas in the file content but that seems inaccurate.
You can use java.util.zip.ZipFile, if the constructor throws a ZipException, it's not a zip file...
try(ZipFile zip = new ZipFile(filename)) {
// It's a zip file
}
catch(ZipException e) {
// Not a valid zip
}
You could make use of the ZIP file structure.
As per the file header, each file should start with the bytes: 0x04 0x03 0x4b 0x50.
You could also use a MIME detection library such as Apache Tika import org.apache.tika.Tika;
import org.apache.tika.mime.MediaType;
import java.io.IOException;
import java.io.InputStream;
import java.nio.file.Files;
import java.nio.file.Path;
import java.nio.file.Paths;
public class Detect {
/**
* Resolves the MediaType using Tika and prints it to the standard output.
* #param file the path of the file to probe.
* #throws IOException whenever an I/O exception occurs.
*/
private void detect(Path file) throws IOException {
Tika tika = new Tika();
try(InputStream is = Files.newInputStream(file)){
MediaType mediaType = MediaType.parse(tika.detect(is));
System.out.println(mediaType);
}
}
public static void main(String[] args) throws IOException {
Detect d = new Detect();
d.detect(Paths.get("zip_file"));
d.detect(Paths.get("csv_file"));
}
}

Java: Apache POI: Can I get clean text from MS Word (.doc) files?

The strings I'm (programmatically) getting from MS Word files when using Apache POI are not the same text I can look at when I open the files with MS Word.
When using the following code:
File someFile = new File("some\\path\\MSWFile.doc");
InputStream inputStrm = new FileInputStream(someFile);
HWPFDocument wordDoc = new HWPFDocument(inputStrm);
System.out.println(wordDoc.getText());
the output is a single line with many 'invalid' characters (yes, the 'boxes'), and many unwanted strings, like "FORMTEXT", "HYPERLINK \l "_Toc##########"" ('#' being numeric digits), "PAGEREF _Toc########## \h 4", etc.
The following code "fixes" the single-line problem, but maintains all the invalid characters and unwanted text:
File someFile = new File("some\\path\\MSWFile.doc");
InputStream inputStrm = new FileInputStream(someFile);
WordExtractor wordExtractor = new WordExtractor(inputStrm);
for(String paragraph:wordExtractor.getParagraphText()){
System.out.println(paragraph);
}
I don't know if I'm using the wrong method for extracting the text, but that's what I've come up with when looking at POI's quick-guide. If I am, what is the correct approach?
If that output is correct, is there a standard way for getting rid of the unwanted text, or will I have to write a filter of my own?
This class can read both .doc and .docx files in Java. For this I'm using tika-app-1.2.jar:
/*
* This class is used to read .doc and .docx files
*
* #author Developer
*
*/
import java.io.ByteArrayOutputStream;
import java.io.File;
import java.io.InputStream;
import java.io.OutputStream;
import java.io.OutputStreamWriter;
import java.net.URL;
import org.apache.tika.detect.DefaultDetector;
import org.apache.tika.detect.Detector;
import org.apache.tika.io.TikaInputStream;
import org.apache.tika.metadata.Metadata;
import org.apache.tika.parser.AutoDetectParser;
import org.apache.tika.parser.ParseContext;
import org.apache.tika.parser.Parser;
import org.apache.tika.sax.BodyContentHandler;
import org.xml.sax.ContentHandler;
class TextExtractor {
private OutputStream outputstream;
private ParseContext context;
private Detector detector;
private Parser parser;
private Metadata metadata;
private String extractedText;
public TextExtractor() {
context = new ParseContext();
detector = new DefaultDetector();
parser = new AutoDetectParser(detector);
context.set(Parser.class, parser);
outputstream = new ByteArrayOutputStream();
metadata = new Metadata();
}
public void process(String filename) throws Exception {
URL url;
File file = new File(filename);
if (file.isFile()) {
url = file.toURI().toURL();
} else {
url = new URL(filename);
}
InputStream input = TikaInputStream.get(url, metadata);
ContentHandler handler = new BodyContentHandler(outputstream);
parser.parse(input, handler, metadata, context);
input.close();
}
public void getString() {
//Get the text into a String object
extractedText = outputstream.toString();
//Do whatever you want with this String object.
System.out.println(extractedText);
}
public static void main(String args[]) throws Exception {
if (args.length == 1) {
TextExtractor textExtractor = new TextExtractor();
textExtractor.process(args[0]);
textExtractor.getString();
} else {
throw new Exception();
}
}
}
To compile:
javac -cp ".:tika-app-1.2.jar" TextExtractor.java
To run:
java -cp ".:tika-app-1.2.jar" TextExtractor SomeWordDocument.doc
There are two options, one provided directly in Apache POI, the other via Apache Tika (which uses Apache POI internally).
The first option is to use WordExtractor, but wrap it in a call to stripFields(String) when calling it. That will remove the text based fields included in the text, things like HYPERLINK that you've seen. Your code would become:
NPOIFSFileSystem fs = new NPOIFSFileSytem(file);
WordExtractor extractor = new WordExtractor(fs.getRoot());
for(String rawText : extractor.getParagraphText()) {
String text = extractor.stripFields(rawText);
System.out.println(text);
}
The other option is to use Apache Tika. Tika provides text extraction, and metadata, for a wide variety of files, so the same code will work for .doc, .docx, .pdf and many others too. To get clean, plain text of your word document (you can also get XHTML if you'd rather), you'd do something like:
TikaConfig tika = TikaConfig.getDefaultConfig();
TikaInputStream stream = TikaInputStream.get(file);
ContentHandler handler = new BodyContentHandler();
Metadata metadata = new Metadata();
tika.getParser().parse(input, handler, metadata, new ParseContext());
String text = handler.toString();
Try this, works for me and is purely a POI solution. You will have to look for the HWPFDocument counterpart though. Make sure the document you are reading predates Word 97, else use XWPFDocument like I do.
InputStream inputstream = new FileInputStream(m_filepath);
//read the file
XWPFDocument adoc= new XWPFDocument(inputstream);
//and place it in a xwpf format
aString = new XWPFWordExtractor(adoc).getText();
//gets the full text
Now if you want certain parts you can use the getparagraphtext but dont use the text extractor, use it directly on the paragraph like this
for (XWPFParagraph p : adoc.getParagraphs())
{
System.out.println(p.getParagraphText());
}

Categories