Tika could not delete temporary files - java

In our application we are processing files using Apache Tika. But there are some files (e.g. *.mov, *.mp4) which Tika cannot process and leaves the corresponding *.tmp file in the user's Temp folder. After some research I found that it is a known bug: https://issues.apache.org/jira/browse/TIKA-1040?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
In the last comment a user talks about a workaround but it does not work for me:
final Tika tika = new Tika();
final TikaInputStream fileStream = TikaInputStream.get(/*some InputStream*/);
try {
final String extractedString = tika.parseToString(fileStream);
//do something with the string
} finally {
CloseUtils.close(fileStream);
}
Using the code above still leaves temp files in the Temp folder. What could be a solution to this?

The get() method should be called with a File object instead of an InputStream:
final File file = new File("c:/some_file.mov");
final TikaInputStream fileStream = TikaInputStream.get(file);
Tika still cannot process it but it actually manages to delete the correspondig tmp file.

Another workaround is disabling the org.apache.tika.parser.mp4.MP4Parser. Two solutions are here:
with configuration
with code

Related

Use Apache Commons VFS RAM file to avoid using file system with API requiring a file

There is a highly upvoted comment on this post:
how to create new java.io.File in memory?
where Sorin Postelnicu mentions using an Apache Commons VFS RAM file as a way to have an in memory file to pass to an API that requires a java.io.File (I am paraphrasing... I hope I haven't missed the point).
Based on reading related posts I have come up with this sample code:
#Test
public void working () throws IOException {
DefaultFileSystemManager manager = new DefaultFileSystemManager();
manager.addProvider("ram", new RamFileProvider());
manager.init();
final String rootPath = "ram://virtual";
manager.createVirtualFileSystem(rootPath);
String hello = "Hello, World!";
FileObject testFile = manager.resolveFile(rootPath + "/test.txt");
testFile.createFile();
OutputStream os = testFile.getContent().getOutputStream();
os.write(hello.getBytes());
//FileContent test = testFile.getContent();
testFile.close();
manager.close();
}
So, I think that I have an in memory file called ram://virtual/test.txt with contents "Hello, World!"
My question is: how could I use this file with an API that requires a java.io.File?
Java's File API always works with native file system. So there is no way of converting the VFS's FileObject to File without having the file present on the native file system.
But there is a way if your API can also work with InputStream. Most libraries usually have overloaded methods that take in InputStreams. In that case, following should work:
InputStream is = testFile.getContent().getInputStream();
SampleAPI api = new SampleApi(is);

How can I give a file a name

I'm using Java Docker API and I'm trying to send my text file to the docker container but the file doesn't appear there. I imagine this happens because the file has no title? How can I give the input stream a title?
final String configDir = "C:/teste/configuration.txt";
File file = new File(configDir);
InputStream input = new FileInputStream(file);
TarArchiveInputStream tarArchiveInputStream = new TarArchiveInputStream(input,"UTF-8");
dockerClient.copyArchiveToContainerCmd("1025c61de603")
.withRemotePath("/tmp/")
.withTarInputStream(tarArchiveInputStream)
.exec();
EDIT: I Don't get any error in my catch. Seems everything works fine but doesn't create. If you know a easy way to send a file to docker container in Java tell me please
Use Files.copy(source, target, REPLACE_EXISTING); to rename files.

How to access file inside JAR file as a string

I packaged some classes and libraries into a single JAR file. But the current code cannot access the files inside the JAR file as it is.
String scenarioFile = "netlogo/Altruism.nlogo";
// InputStream is = this.getClass().getResourceAsStream(scenarioFile);
simulator = HeadlessWorkspace.newInstance();
simulator.open(scenarioFile);
the .open expects a string but i read that i need to use inputstream format thus its not working. Is there any other workaround?
With the help of Tunaki i was able to get a way about doing it and it worked!
what i did was download commons.io.jar file
import org.apache.commons.io.*;
and then use an inputstream to read the file and then convert it to a string and use openFromSource method that Tunaki suggested of HeadlessWorkspace package to read it.
InputStream is = this.getClass().getResourceAsStream(NetlogoFile);
String scenarioFile = IOUtils.toString(is, "UTF-8");
simulator = HeadlessWorkspace.newInstance();
simulator.openFromSource(scenarioFile);

Replace specific file inside Zip archive without extracting the whole archive in Java

I'm trying to get a specific file inside a Zip Archive, extract it, Encrypt it, and then get it back inside the archive replacing the origial one.
here's what I've tried so far..
public static boolean encryptXML(File ZipArchive, String key) throws ZipException, IOException, Exception {
ZipFile zipFile = new ZipFile(ZipArchive);
List<FileHeader> fileHeaderList = zipFile.getFileHeaders();
for (FileHeader fh : fileHeaderList)
{
if (fh.getFileName().equals("META-INF/file.xml"))
{
Path tempdir = Files.createTempDirectory("Temp");
zipFile.extractFile(fh, tempdir.toString());
File XMLFile = new File(tempdir.toFile(), fh.getFileName());
// Encrypting XMLFile, Ignore this part
// Here, Replace the original XMLFile inside ZipArchive with the encrypted one <<<<<<<<
return true;
}
}
return false;
}
I stuck at the replacing part of the code is there anyway I can do this without having to extract the whole Zip Archive?
Any help is appreciated, thanks in advance.
Not sure if this will help you as you are using a different library but the solution in ZT Zip would be the following.
ZipUtil.unpackEntry(new File("/tmp/demo.zip"), "foo.txt", new File("foo.txt"));
// encrypt the foo.txt
ZipUtil.replaceEntry(new File("/tmp/demo.zip"), "foo.txt", new File("foo.txt"));
This will unpack the foo.txt file and then after you encrypt it you can replace the previous entry with the new one.
You may use the ZipFilesystem (as of Java 7) as explained in the Oracle documentation to read/write within a zip file as if it were its own file system.
However, on my machine, this unpacks and re-packs the zip file under the hood anyway (tested with 7 and 8). I am not sure if there is a way to reliably change zip files like you describe.
Bingo!
I'm able to do it that way
ZipParameters parameters = new ZipParameters();
parameters.setIncludeRootFolder(true);
zipFile.removeFile(fh);
zipFile.addFolder(new File(tempdir.toFile(), "META-INF"), parameters);

Reading files from an embedded ZIP archive

I have a ZIP archive that's embedded inside a larger file. I know the archive's starting offset within the larger file and its length.
Are there any Java libraries that would enable me to directly read the files contained within the archive? I am thinking along the lines of ZipFile.getInputStream(). Unfortunately, ZipFile doesn't work for this use case since its constructors require a standalone ZIP file.
For performance reasons, I cannot copy the ZIP achive into a separate file before opening it.
edit: Just to be clear, I do have random access to the file.
I've come up with a quick hack (which needs to get sanitized here and there), but it reads the contents of files from a ZIP archive which is embedded inside a TAR. It uses Java6, FileInputStream, ZipEntry and ZipInputStream. 'Works on my local machine':
final FileInputStream ins = new FileInputStream("archive.tar");
// Zip starts at 0x1f6400, size is not needed
long toSkip = 0x1f6400;
// Safe skipping
while(toSkip > 0)
toSkip -= ins.skip(toSkip);
final ZipInputStream zipin = new ZipInputStream(ins);
ZipEntry ze;
while((ze = zipin.getNextEntry()) != null)
{
final byte[] content = new byte[(int)ze.getSize()];
int offset = 0;
while(offset < content.length)
{
final int read = zipin.read(content, offset, content.length - offset);
if(read == -1)
break;
offset += read;
}
// DEBUG: print out ZIP entry name and filesize
System.out.println(ze + ": " + offset);
}
zipin.close();
1.create FileInputStream fis=new FileInputStream(..);
position it at the start of embedded zipfile:
fis.skip(offset);
open ZipInputStream(fis)
I suggest using TrueZIP, it provides file system access to many kinds of archives. It worked well for me in the past.
If you're using Java SE 7, it provides a zip fie system which allows you to read/ write files in the zip directly: http://docs.oracle.com/javase/7/docs/technotes/guides/io/fsp/zipfilesystemprovider.html
I think apache commons compress may help you.
There is a class org.apache.commons.compress.archivers.zip.ZipArchiveEntry, which inherit java.util.zip.ZipEntry.
It has a method getDataOffset(), that can get the offset of data stream within the archive file.
7-zip-JavaBinding is a Java wrapper for the 7-zip C++ library.
The code snippets page in particular has some nice examples including printing a list of items in an archive, extracting a single file and opening multi-part archives.
Check whether zip4j helps you or not.
You can try PartInputStream to read zip file as per your use case.
I think it is better to create temp zip file and then accessing it.

Categories