Let Tika suggest a file-extension [duplicate] - java

I am uploading files to an Amazon s3 bucket and have access to the InputStream and a String containing the MIME Type of the file but not the original file name. It's up to me to actually create the file name and extension before pushing the file up to S3. Is there a library or convenient way to determine the appropriate extension to use from the MIME Type?
I've seen some references to the Apache Tika library but that seems like overkill and I haven't been able to get it to successfully detect file extensions yet. From what I've been able to gather it seems like this code should work, but I'm just getting an empty string when my type variable is "image/jpeg"
MimeType mimeType = null;
try {
mimeType = new MimeTypes().forName(type);
} catch (MimeTypeException e) {
Logger.error("Couldn't Detect Mime Type for type: " + type, e);
}
if (mimeType != null) {
String extension = mimeType.getExtension();
//do something with the extension
}

As some of the commentors have pointed out, there is no universal 1:1 mapping between mimetypes and file extensions... Some mimetypes have more than one possible extension, many extensions are shared by multiple mimetypes, and some mimetypes have no extension.
Wherever possible, you're much better off storing the mimetype and using that going forward, and forgetting about the extension.
That said, if you do want to get the most common file extension for a given mimetype, then Tika is a good way to go. Apache Tika has a very large set of mimetypes it knows about, and for many of these it also knows mime magic for detection, common extensions, descriptions etc.
If you want to get the most common extension for a JPEG file, then as shown in this Apache Tika unit test you just need to do something like:
MimeTypes allTypes = MimeTypes.getDefaultMimeTypes();
MimeType jpeg = allTypes.forName("image/jpeg");
String jpegExt = jpeg.getExtension(); // .jpg
assertEquals(".jpg", jpeg.getExtension());
The key thing is that you need to load up the xml file that's bundled in the Tika jar to get the definitions of all the mimetypes. If you might be dealing with custom mimetypes too, then Tika supports those, and change line one to be:
TikaConfig config = TikaConfig.getDefaultConfig();
MimeTypes allTypes = config.getMimeRepository();
By using the TikaConfig method to get the MimeTypes, Tika will also check your classpath for custom mimetype defintions, and include those too.

Related

java multipart file how to find mime type if the file extension modified?

Java file upload
getcontentType
mimetype
always gives extension based on the extension passed
How to check if some one passing javascript file with extension changed as pdf.
1. MimetypesFileTypeMap mimeTypesMap = new MimetypesFileTypeMap();
doc.setMimeType(mimeTypesMap.getContentType(file.getOriginalFilename()));
2. file.getContentType()
// works based on extension
it can be used to obtain file extension.
//It will return extension like these png,jpp,jpeg, etc.
String fileExtension = FilenameUtils.getExtension(multipartFile.getOriginalFilename());

is it possible to set custom metadata on files, using Java?

Is it possible to get and set custom metadata on File instances? I want to use the files that I process through my system as some kind of a very simple database, where every file should contain additional custom metadata, such as the email of the sender, some timestamps, etc.
It is for an internal system, so security is not an issue.
In java 7 you can do this using the Path class and UserDefinedFileAttributeView.
Here is the example taken from there:
A file's MIME type can be stored as a user-defined attribute by using this code snippet:
Path file = ...;
UserDefinedFileAttributeView view = Files
.getFileAttributeView(file, UserDefinedFileAttributeView.class);
view.write("user.mimetype",
Charset.defaultCharset().encode("text/html");
To read the MIME type attribute, you would use this code snippet:
Path file = ...;
UserDefinedFileAttributeView view = Files
.getFileAttributeView(file,UserDefinedFileAttributeView.class);
String name = "user.mimetype";
ByteBuffer buf = ByteBuffer.allocate(view.size(name));
view.read(name, buf);
buf.flip();
String value = Charset.defaultCharset().decode(buf).toString();
You should always check if the filesystem supports UserDefinedFileAttributeView for the specific file you want to set
You can simply invoke this
Files.getFileStore(Paths.get(path_to_file))).supportsFileAttributeView(UserDefinedFileAttributeView.class);
From my experience, the UserDefinedFileAttributeView is not supported in FAT* and HFS+ (for MAC) filesystems

Test if a file is an image file

I am using some file IO and want to know if there is a method to check if a file is an image?
This works pretty well for me. Hope I could help
import javax.activation.MimetypesFileTypeMap;
import java.io.File;
class Untitled {
public static void main(String[] args) {
String filepath = "/the/file/path/image.jpg";
File f = new File(filepath);
String mimetype= new MimetypesFileTypeMap().getContentType(f);
String type = mimetype.split("/")[0];
if(type.equals("image"))
System.out.println("It's an image");
else
System.out.println("It's NOT an image");
}
}
if( ImageIO.read(*here your input stream*) == null)
*IS NOT IMAGE*
And also there is an answer: How to check a uploaded file whether it is a image or other file?
In Java 7, there is the java.nio.file.Files.probeContentType() method. On Windows, this uses the file extension and the registry (it does not probe the file content). You can then check the second part of the MIME type and check whether it is in the form <X>/image.
You may try something like this:
String pathname="abc\xyz.png"
File file=new File(pathname);
String mimetype = Files.probeContentType(file.toPath());
//mimetype should be something like "image/png"
if (mimetype != null && mimetype.split("/")[0].equals("image")) {
System.out.println("it is an image");
}
You may try something like this:
import javax.activation.MimetypesFileTypeMap;
File myFile;
String mimeType = new MimetypesFileTypeMap().getContentType( myFile ));
// mimeType should now be something like "image/png"
if(mimeType.substring(0,5).equalsIgnoreCase("image")){
// its an image
}
this should work, although it doesn't seem to be the most elegant version.
There are a variety of ways to do this; see other answers and the links to related questions. (The Java 7 approach seems the most attractive to me, because it uses platform specific conventions by default, and you can supply your own scheme for file type determination.)
However, I'd just like to point out that no mechanism is entirely infallible:
Methods that rely on the file suffix will be tricked if the suffix is non-standard or wrong.
Methods that rely on file attributes (e.g. in the file system) will be tricked if the file has an incorrect content type attribute or none at all.
Methods that rely on looking at the file signature can be tricked by binary files which just happen to have the same signature bytes.
Even simply attempting to read the file as an image can be tricked if you are unlucky ... depending on the image format(s) that you try.
Other answers suggest to load full image into memory (ImageIO.read) or to use standard JDK methods (MimetypesFileTypeMap and Files.probeContentType).
First way is not efficient if read image is not required and all you really want is to test if it is an image or not (and maybe to save it's content type to set it in Content-Type response header when this image will be read in the future).
Inbound JDK ways usually just test file extension and not really give you result that you can trust.
The way that works for me is to use Apache Tika library.
private final Tika tika = new Tika();
private MimeType detectImageContentType(InputStream inputStream, String fileExtension) {
Assert.notNull(inputStream, "InputStream must not be null");
String fileName = fileExtension != null ? "image." + fileExtension : "image";
MimeType detectedContentType = MimeType.valueOf(tika.detect(inputStream, fileName));
log.trace("Detected image content type: {}", detectedContentType);
if (!validMimeTypes.contains(detectedContentType)) {
throw new InvalidImageContentTypeException(detectedContentType);
}
return detectedContentType;
}
The type detection is based on the content of the given document stream and the name of the document. Only a limited number of bytes are read from the stream.
I pass fileExtension just as a hint for the Tika. It works without it. But according to documentation it helps to detect better in some cases.
The main advantage of this method compared to ImageIO.read is that Tika doesn't read full file into memory - only first bytes.
The main advantage compared to JDK's MimetypesFileTypeMap and Files.probeContentType is that Tika really reads first bytes of the file while JDK only checks file extension in current implementation.
TLDR
If you plan to do something with read image (like resize/crop/rotate it), then use ImageIO.read from Krystian's answer.
If you just want to check (and maybe store) real Content-Type, then use Tika (this answer).
If you work in the trusted environment and you are 100% sure that file extension is correct, then use Files.probeContentType from prunge's Answer.
Here's my code based on the answer using tika.
private static final Tika TIKA = new Tika();
public boolean isImageMimeType(File src) {
try (FileInputStream fis = new FileInputStream(src)) {
String mime = TIKA.detect(fis, src.getName());
return mime.contains("/")
&& mime.split("/")[0].equalsIgnoreCase("image");
} catch (IOException e) {
throw new RuntimeException(e);
}
}

Files, URIs, and URLs conflicting in Java

I am getting some strange behavior when trying to convert between Files and URLs, particularly when a file/path has spaces in its name. Is there any safe way to convert between the two?
My program has a file saving functionality where the actual "Save" operation is delegated to an outside library that requires a URL as a parameter. However, I also want the user to be able to pick which file to save to. The issue is that when converting between File and URL (using URI), spaces show up as "%20" and mess up various operations. Consider the following code:
//...user has selected file
File userFile = myFileChooser.getSelectedFile();
URL userURL = userFile.toURI().toURL();
System.out.println(userFile.getPath());
System.out.println(userURL);
File myFile = new File(userURL.getFile());
System.out.println(myFile.equals(userFile);
This will return false (due to the "%20" symbols), and is causing significant issues in my program because Files and URLs are handed off and often operations have to be performed with them (like getting parent/subdirectories). Is there a way to make File/URL handling safe for paths with whitespace?
P.S. Everything works fine if my paths have no spaces in them (and the paths look equal), but that is a user restriction I cannot impose.
The problem is that you use URL to construct the second file:
File myFile = new File(userURL.getFile());
If you stick to the URI, you are better off:
URI userURI = userFile.toURI();
URL userURL = userURI.toURL();
...
File myFile = new File(userURI);
or
File myFile = new File( userURL.toURI() );
Both ways worked for me, when testing file names with blanks.
Use instead..
System.out.println(myFile.toURI().toURL().equals(userURL);
That should return true.

How to get a desired substring form a String in java or jsf?

I am developing an application using JSF in Eclipse IDE with Derby as database. I have a feature to upload files to the database. But the file name is getting stored as "C:\Documents and Settings\Angeline\Desktop\test.txt" instead of "test.txt". How do I get to store only "test.txt" as file name in the database?
This is my code in JSF:
File to Upload:
<t:inputFileUpload id="fileupload" value="#{employeeBean.upFile}" storage="file"/>
Java Bean Code:
String fileName=upFile.getName();
The value of this fileName=C:\Documents and Settings\Angeline\Desktop\test.txt.
lastSlashIndex = name.lastIndexOf("\\");
if (lastSlashIndex == -1) {
lastSlashIndex = name.lastIndexOf("/"); //unix client
}
String shortName = name;
if (lastSlashIndex != -1) {
shortName = name.substring(lastSlashIndex);
}
Note that if the filename on *nix contain a \ this won't work.
new java.io.File(myPath).getName();
You could probably do something more efficient with only String operations but depending on the application load and other operations, it might not be worth it.
Tomahawk t:inputFileUpload is built on top of Apache Commons FileUpload and Apache Commons IO. In the FileUpload FAQ you can find an entry titled "Why does FileItem.getName() return the whole path, and not just the file name? " which contains the following answer:
Internet Explorer provides the entire path to the uploaded file and not just the base file name. Since FileUpload provides exactly what was supplied by the client (browser), you may want to remove this path information in your application. You can do that using the following method from Commons IO (which you already have, since it is used by FileUpload).
String fileName = item.getName();
if (fileName != null) {
filename = FilenameUtils.getName(filename);
}
In short, just use FilenameUtils#getName() to get rid of the complete path which has been unnecessarily appended by MSIE (all the other real/normal webbrowsers doesn't add the complete client side path, but just provide the sole filename as per the HTML forms specs).
So, all you basically need to do is replacing
String fileName = upFile.getName();
by
String fileName = FilenameUtils.getName(upFile.getName());
I think it would be safer to see this not as a string manipulation problem, but as a path name parsing problem:
String filename = new File(pathname).getName()

Categories