I'm using a webservice that is always sending me a plain/text file. However, that file can either be a zip or a csv but I'm not being informed of its type beforehand.
Is there a way to know the file type by looking through its content programmatically wise of course. As one is in byte code and the other one an actually readeable text.
I've already thought of looking for lots of commas in the file content but that seems inaccurate.
You can use java.util.zip.ZipFile, if the constructor throws a ZipException, it's not a zip file...
try(ZipFile zip = new ZipFile(filename)) {
// It's a zip file
}
catch(ZipException e) {
// Not a valid zip
}
You could make use of the ZIP file structure.
As per the file header, each file should start with the bytes: 0x04 0x03 0x4b 0x50.
You could also use a MIME detection library such as Apache Tika import org.apache.tika.Tika;
import org.apache.tika.mime.MediaType;
import java.io.IOException;
import java.io.InputStream;
import java.nio.file.Files;
import java.nio.file.Path;
import java.nio.file.Paths;
public class Detect {
/**
* Resolves the MediaType using Tika and prints it to the standard output.
* #param file the path of the file to probe.
* #throws IOException whenever an I/O exception occurs.
*/
private void detect(Path file) throws IOException {
Tika tika = new Tika();
try(InputStream is = Files.newInputStream(file)){
MediaType mediaType = MediaType.parse(tika.detect(is));
System.out.println(mediaType);
}
}
public static void main(String[] args) throws IOException {
Detect d = new Detect();
d.detect(Paths.get("zip_file"));
d.detect(Paths.get("csv_file"));
}
}
Related
My application allows users to download files. While creating headers I am using Tika to set extension as shown below.
This works fine for pdf files. Fails for DOC and EXCEL files.
private HttpHeaders getHeaderData(byte[] fileBytes) throws IOException, MimeTypeException {
final HttpHeaders headers = new HttpHeaders();
TikaInputStream tikaStream = TikaInputStream.get(fileBytes);
Tika tika = new Tika();
String mimeType = tika.detect(tikaStream);
headers.setContentType(MediaType.valueOf(mimeType));
MimeTypes defaultMimeTypes = MimeTypes.getDefaultMimeTypes();
String extension = defaultMimeTypes.forName(mimeType).getExtension();
headers.add("file-ext", extension);
return headers;
}
I see that the mimeType is resolved to "application/pdf" for pdf files but resolves to "application/x-tika-ooxml" for excel and word files which is the problem.
How can I get word(.docx) and excel (xlx, xlsx) formats if I have a file in bytes.
Why does this work for pdf?
Summary
The short answer is: You have to use Tika's detector with its MediaType class - not MimeTypes.
The slightly longer answer is: Even that will not get you all the way, because of how older MS-Office files are structured. For those you have to also parse the files, and inspect their metadata.
The term "media type" has replaced the term "MIME type" - see here:
[RFC2046] specifies that Media Types (formerly known as MIME types) and Media
Subtypes will be assigned and listed by the IANA.
Office 97-2003
When Tika inspects Excel and Word 97-2003 files using its detector, it will return a media type of application/x-tika-msoffice. I assume (perhaps incorrectly) that this is its way of handling a file-type group, where the detector cannot determine the specific flavor of MS-Office 97-2003 file, based on its analysis. This is similar to the application/x-tika-ooxml in your question.
Expected Results
Based on the IANA list here, and a Mozilla list here, these are the media types we expect to get for the following file types:
.pdf :: application/pdf
.xls :: application/vnd.ms-excel
.doc :: application/msword
.xlsx :: application/vnd.openxmlformats-officedocument.spreadsheetml.sheet
.docx :: application/vnd.openxmlformats-officedocument.wordprocessingml.document
The Program
The program shown below uses the following Maven dependencies:
<dependencies>
<dependency>
<groupId>org.apache.tika</groupId>
<artifactId>tika-core</artifactId>
<version>1.23</version>
</dependency>
<dependency>
<groupId>org.apache.tika</groupId>
<artifactId>tika-parsers</artifactId>
<version>1.23</version>
</dependency>
<dependency>
<groupId>javax.ws.rs</groupId>
<artifactId>javax.ws.rs-api</artifactId>
<version>2.1.1</version>
</dependency>
</dependencies>
The program (just for this demo - not production ready) is shown below. Specifically, look at the tikaDetect() and tikaParse() methods.
import java.io.IOException;
import java.io.File;
import java.io.FileInputStream;
import java.io.BufferedInputStream;
import java.util.Set;
import java.util.HashSet;
import org.apache.tika.mime.MediaType;
import org.apache.tika.mime.MimeTypeException;
import org.apache.tika.metadata.Metadata;
import org.apache.tika.detect.Detector;
import org.apache.tika.detect.DefaultDetector;
import org.apache.tika.exception.TikaException;
import org.apache.tika.parser.Parser;
import org.apache.tika.parser.AutoDetectParser;
import org.apache.tika.sax.BodyContentHandler;
import org.apache.tika.parser.ParseContext;
import org.xml.sax.SAXException;
import org.xml.sax.ContentHandler;
public class Main {
private final Set<File> msOfficeFiles = new HashSet();
public static void main(String[] args) throws IOException, MimeTypeException,
SAXException, TikaException {
Main main = new Main();
main.doFileDetection();
}
private void doFileDetection() throws IOException, MimeTypeException, SAXException, TikaException {
File file1 = new File("C:/tmp/foo.pdf");
File file2 = new File("C:/tmp/baz.xlsx");
File file3 = new File("C:/tmp/bat.docx");
// Excel 97-2003 format:
File file4 = new File("C:/tmp/bar.xls");
// Word 97-2003 format:
File file5 = new File("C:/tmp/daz.doc");
Set<File> files = new HashSet();
files.add(file1);
files.add(file2);
files.add(file3);
files.add(file4);
files.add(file5);
for (File file : files) {
try (BufferedInputStream bis = new BufferedInputStream(
new FileInputStream(file))) {
tikaDetect(file, bis);
} catch (IOException e) {
e.printStackTrace();
}
}
for (File file : msOfficeFiles) {
tikaParse(file);
}
}
private void tikaDetect(File file, BufferedInputStream bis)
throws IOException, SAXException, TikaException {
Detector detector = new DefaultDetector();
Metadata metadata = new Metadata();
MediaType mediaType = detector.detect(bis, metadata);
if (mediaType.toString().equals("application/x-tika-msoffice")) {
msOfficeFiles.add(file);
} else {
System.out.println("Media Type for " + file.getName()
+ " is: " + mediaType.toString());
}
}
private void tikaParse(File file) throws SAXException, TikaException {
Parser parser = new AutoDetectParser();
ContentHandler handler = new BodyContentHandler();
Metadata metadata = new Metadata();
ParseContext context = new ParseContext();
try (BufferedInputStream bis = new BufferedInputStream(
new FileInputStream(file))) {
parser.parse(bis, handler, metadata, context);
tikaDetect(file, bis);
} catch (IOException e) {
e.printStackTrace();
}
System.out.println("Media Type for " + file.getName()
+ " is: " + metadata.get("Content-Type"));
}
}
Actual Results
The program generates some warnings and information messages. If we ignore these for this exercise, we get the following print statements:
Media Type for bat.docx is: application/vnd.openxmlformats-officedocument.wordprocessingml.document
Media Type for baz.xlsx is: application/vnd.openxmlformats-officedocument.spreadsheetml.sheet
Media Type for foo.pdf is: application/pdf
Media Type for bar.xls is: application/vnd.ms-excel
Media Type for daz.doc is: application/msword
These match the expected official media (MIME) types.
Tika official usages:
https://tika.apache.org/1.26/detection.html
Tika supported formats:
https://tika.apache.org/1.26/formats.html
You could get the answers by simply reading the above 2 pages.
Here are some key quotes:
Microsoft Office and some related applications produce documents in the generic OLE 2 Compound Document and Office Open XML (OOXML) formats. The older OLE 2 format was introduced in Microsoft Office version 97 and was the default format until Office version 2007 and the new XML-based OOXML format. The OfficeParser and OOXMLParser classes use Apache POI libraries to support text and metadata extraction from both OLE2 and OOXML documents.
That means you need to include also Apache POI jars or Maven dependencies for MS office files.
Tika provides a wrapping detector in the form of org.apache.tika.detect.DefaultDetector. This uses the service loader to discover all available detectors, including any available container aware ones, and tries them in turn. For container aware detection, include the Tika Parsers jar and its dependencies in your project, then use DefaultDetector along with a TikaInputStream.
That means you need to include the Tika Parsers jar or Maven dependencies.
Then use
new DefaultDetector().detect(TikaInputStream.get(file), new Metadata());
Currently, tika is processing zip files looking inside them.
I'd like to disable this features and only gets me application/zip mime type.
I'm using this code right now:
public String getMimeType(InputStream is) {
TikaConfig tikaConfig = TikaConfig.getDefaultConfig();
Detector detector = tikaConfig.getDetector(); //new DefaultDetector();
Metadata metadata = new Metadata();
MediaType mediaType = detector.detect(TikaInputStream.get(is), metadata);
}
This code returns me zipped mime type file.
Any ideas?
Based on your example I wrote a dummy app. I then used a large Zip file that doesn't have the zip Extension. I do not see the behavior that Tika parses the whole file.
I looked with a debugger. Tika only reads 65536 bytes to determine the file type.
See: Tika MimeTypes.class:154
public MediaType detect(InputStream input, Metadata metadata) throws IOException {
List<MimeType> possibleTypes = null;
// Get type based on magic prefix
if (input != null) {
input.mark(getMinLength());
try {
byte[] prefix = readMagicHeader(input);
possibleTypes = getMimeType(prefix);
} finally {
input.reset();
}
}
dummy app
import org.apache.tika.config.TikaConfig;
import org.apache.tika.detect.Detector;
import org.apache.tika.io.TikaInputStream;
import org.apache.tika.metadata.Metadata;
import org.apache.tika.mime.MediaType;
import java.io.File;
import java.io.FileInputStream;
import java.io.IOException;
import java.io.InputStream;
public class Main {
public static void main(String[] args) throws IOException {
System.out.println(new Main().getMimeType(new FileInputStream(new File("C:\\temp\\apache-tomcat-8.0.47-windows-x64"))));
}
public String getMimeType(InputStream is) throws IOException {
final TikaConfig tikaConfig = TikaConfig.getDefaultConfig();
Detector detector = tikaConfig.getDetector(); //new DefaultDetector();
Metadata metadata = new Metadata();
MediaType mediaType = detector.detect(TikaInputStream.get(is), metadata);
return mediaType.getType();
}
Maven:
<dependencies>
<dependency>
<groupId>org.apache.tika</groupId>
<artifactId>tika-core</artifactId>
<version>1.18</version>
</dependency>
</dependencies>
Can you show me an example that reads and parses the whole zip-file? I can understand that that would be a problem when files exceed a certain size.
Unfortunately I cant help you if I can't reproduce the problem.
I'm trying to create a new excel file with just "hello" in it.
Here's my code:
import java.io.File;
import java.io.FileInputStream;
import java.io.FileNotFoundException;
import java.io.FileOutputStream;
import java.io.IOException;
import java.util.logging.Level;
import java.util.logging.Logger;
import javax.swing.JFileChooser;
import org.apache.poi.hssf.usermodel.HSSFWorkbook;
import org.apache.poi.ss.usermodel.WorkbookFactory;
import org.apache.poi.xssf.usermodel.XSSFSheet;
import org.apache.poi.xssf.usermodel.XSSFWorkbook;
/**
*
* #author kamal
*/
public class JavaApplication4 {
private static String dir = "";
/**
* #param args the command line arguments
*/
public static void main(String[] args) {
try {
// TODO code application logic here
JFileChooser jc = new JFileChooser();
jc.setFileSelectionMode(JFileChooser.DIRECTORIES_ONLY);
int output = jc.showOpenDialog(null);
if(output == JFileChooser.APPROVE_OPTION){
File f = jc.getSelectedFile();
String directory = f.getAbsolutePath();
setDir(directory);
}
FileOutputStream out = new FileOutputStream(new File(getDir()+"\\Book2.xlsx"));
FileInputStream in = new FileInputStream(new File(getDir()+"\\Book2.xlsx"));
org.apache.poi.ss.usermodel.Workbook workbook = new XSSFWorkbook(in);
org.apache.poi.ss.usermodel.Sheet sheet = workbook.getSheetAt(0);
sheet.createRow(0).createCell(0).setCellValue("hello");
workbook.write(out);
workbook.close();
} catch (FileNotFoundException ex) {
ex.printStackTrace();
} catch (IOException ex) {
Logger.getLogger(JavaApplication4.class.getName()).log(Level.SEVERE, null, ex);
}
}
/**
* #return the dir
*/
public static String getDir() {
return dir;
}
/**
* #param dir the dir to set
*/
public static void setDir(String directory) {
dir = directory;
}
}
..And when I run it I get the following error:
Exception in thread "main" org.apache.poi.openxml4j.exceptions.NotOfficeXmlFileException: No valid entries or contents found, this is not a valid OOXML (Office Open XML) file
at org.apache.poi.openxml4j.opc.ZipPackage.getPartsImpl(ZipPackage.java:286)
at org.apache.poi.openxml4j.opc.OPCPackage.getParts(OPCPackage.java:758)
at org.apache.poi.openxml4j.opc.OPCPackage.open(OPCPackage.java:327)
at org.apache.poi.util.PackageHelper.open(PackageHelper.java:37)
at org.apache.poi.xssf.usermodel.XSSFWorkbook.<init>(XSSFWorkbook.java:291)
at javaapplication4.JavaApplication4.main(JavaApplication4.java:46)
C:\Users\kamal\AppData\Local\NetBeans\Cache\8.2\executor-snippets\run.xml:53: Java returned: 1
BUILD FAILED (total time: 7 seconds)
I looked up this code in youtube and it's same but i'm not sure why am i getting the error? Can you help me with this?
I think the most likely explanations are that either the file is corrupt, or it is an older format spreadsheet file that XSSFWorkbook does not understand.
It is unlikely anyone can give you a definite diagnosis without looking at the file itself.
Okay. Today I encountered the same problem . The server was linux and the excel file is copied from windows to linux through winscp. Winscp has options like transferring the file in binary mode , text mode etc. When we copy the excel file through text mode , I got the same error you mentioned. The error got resolved when I copy the excel file using binary mode. To summarize , this issue came because we copied excel file from windows to linux. Just make sure you are copying in binary mode if using winscp. Make sure the file is copied correctly.
I was facing the same problem. I did create the Excel file by doing right click inside the folder and then ->New->Microsoft Excel Worksheet.
As a trial I removed this file and then created the new Excel through Start Menu->Microsoft Office->Excel
It worked for me, Hopefully same will work for you too.
I have to move files from one directory to other directory.
Am using property file. So the source and destination path is stored in property file.
Am haivng property reader class also.
In my source directory am having lots of files. One file should move to other directory if its complete the operation.
File size is more than 500MB.
import java.io.File;
import java.nio.file.Files;
import java.nio.file.StandardCopyOption;
import static java.nio.file.StandardCopyOption.*;
public class Main1
{
public static String primarydir="";
public static String secondarydir="";
public static void main(String[] argv)
throws Exception
{
primarydir=PropertyReader.getProperty("primarydir");
System.out.println(primarydir);
secondarydir=PropertyReader.getProperty("secondarydir");
File dir = new File(primarydir);
secondarydir=PropertyReader.getProperty("secondarydir");
String[] children = dir.list();
if (children == null)
{
System.out.println("does not exist or is not a directory");
}
else
{
for (int i = 0; i < children.length; i++)
{
String filename = children[i];
System.out.println(filename);
try
{
File oldFile = new File(primarydir,children[i]);
System.out.println( "Before Moving"+oldFile.getName());
if (oldFile.renameTo(new File(secondarydir+oldFile.getName())))
{
System.out.println("The file was moved successfully to the new folder");
}
else
{
System.out.println("The File was not moved.");
}
}
catch (Exception e)
{
e.printStackTrace();
}
}
System.out.println("ok");
}
}
}
My code is not moving the file into the correct path.
This is my property file
primarydir=C:/Desktop/A
secondarydir=D:/B
enter code here
Files should be in B drive. How to do? Any one can help me..!!
Change this:
oldFile.renameTo(new File(secondarydir+oldFile.getName()))
To this:
oldFile.renameTo(new File(secondarydir, oldFile.getName()))
It's best not to use string concatenation to join path segments, as the proper way to do it may be platform-dependent.
Edit: If you can use JDK 1.7 APIs, you can use Files.move() instead of File.renameTo()
Code - a java method:
/**
* copy by transfer, use this for cross partition copy,
* #param sFile source file,
* #param tFile target file,
* #throws IOException
*/
public static void copyByTransfer(File sFile, File tFile) throws IOException {
FileInputStream fInput = new FileInputStream(sFile);
FileOutputStream fOutput = new FileOutputStream(tFile);
FileChannel fReadChannel = fInput.getChannel();
FileChannel fWriteChannel = fOutput.getChannel();
fReadChannel.transferTo(0, fReadChannel.size(), fWriteChannel);
fReadChannel.close();
fWriteChannel.close();
fInput.close();
fOutput.close();
}
The method use nio, it make use os underling operation to improve performance.
Here is the import code:
import java.io.File;
import java.io.FileInputStream;
import java.io.FileOutputStream;
import java.io.IOException;
import java.nio.ByteBuffer;
import java.nio.channels.FileChannel;
If you are in eclipse, just use ctrl + shift + o.
I am trying to create TFTPClient using Apache Commons Net to put file on Server (AIX OS) and TFTP service is running on that Server, there isn't any exception raised while running the below code and it seems that everything is ok, but the file didn't put on the server.
import java.io.File;
import java.io.FileInputStream;
import java.io.IOException;
import java.net.SocketException;
import java.net.UnknownHostException;
import org.apache.commons.net.tftp.TFTP;
import org.apache.commons.net.tftp.TFTPClient;
public class Test {
/**
* #param args
* #throws IOException
* #throws SocketException
*/
public static void main(String[] args) throws SocketException, IOException {
int timeout=5000;
String host="192.168.1.20";
int port=22;
TFTPClient tftpClient=new TFTPClient();
tftpClient.setDefaultTimeout(60000);
tftpClient.open(69);
tftpClient.setSoTimeout(timeout);
System.out.println("DONE");
FileInputStream input = null;
File file;
file = new File("D:\\project.ear");
input = new FileInputStream(file);
try{
tftpClient.sendFile("/home/dev/project.ear", TFTP.BINARY_MODE, input, host);
}
catch (UnknownHostException e)
{
System.err.println("Error: could not resolve hostname.");
System.err.println(e.getMessage());
System.exit(1);
}
System.out.println("DONE2");
tftpClient.close();
}
}
the output of the above code was:
DONE
DONE2
which means that everything is OK but i didn't find the file in the directory specified in code.
please advice.
If you still need help, I think you should try call tftpClient.sendFile method this way:
tftpClient.sendFile("/home/dev/project.ear", TFTP.BINARY_MODE, input, InetAddress.getByName(host));
While using InetAddress.getByName(host) it should determine your host ip address either by ip string representation or hostname, as it says here. Hope it works this way.