Copied DocumentFile has different siize and hash to original - java

I'm attempting to copy / duplicate a DocumentFile in an Android application, but upon inspecting the created duplicate, it does not appear to be exactly the same as the original (which is causing a problem, because I need to do an MD5 check on both files the next time a copy is called, so as to avoid overwriting the same files).
The process is as follows:
User selects a file from a ACTION_OPEN_DOCUMENT_TREE
Source file's type is obtained
New DocumentFile in target location is initialised
Contents of first file is duplicated into second file
The initial stages are done with the following code:
// Get the source file's type
String sourceFileType = MimeTypeMap.getSingleton().getExtensionFromMimeType(contextRef.getContentResolver().getType(file.getUri()));
// Create the new (empty) file
DocumentFile newFile = targetLocation.createFile(sourceFileType, file.getName());
// Copy the file
CopyBufferedFile(new BufferedInputStream(contextRef.getContentResolver().openInputStream(file.getUri())), new BufferedOutputStream(contextRef.getContentResolver().openOutputStream(newFile.getUri())));
The main copy process is done using the following snippet:
void CopyBufferedFile(BufferedInputStream bufferedInputStream, BufferedOutputStream bufferedOutputStream)
{
// Duplicate the contents of the temporary local File to the DocumentFile
try
{
byte[] buf = new byte[1024];
bufferedInputStream.read(buf);
do
{
bufferedOutputStream.write(buf);
}
while(bufferedInputStream.read(buf) != -1);
}
catch (IOException e)
{
e.printStackTrace();
}
finally
{
try
{
if (bufferedInputStream != null) bufferedInputStream.close();
if (bufferedOutputStream != null) bufferedOutputStream.close();
}
catch (IOException e)
{
e.printStackTrace();
}
}
}
The problem that I'm facing, is that although the file copies successfully and is usable (it's a picture of a cat, and it's still a picture of a cat in the destination), it is slightly different.
The file size has changed from 2261840 to 2262016 (+176)
The MD5 hash has changed completely
Is there something wrong with my copying code that is causing the file to change slightly?
Thanks in advance.

Your copying code is incorrect. It is assuming (incorrectly) that each call to read will either return buffer.length bytes or return -1.
What you should do is capture the number of bytes read in a variable each time, and then write exactly that number of bytes. Your code for closing the streams is verbose and (in theory1) buggy as well.
Here is a rewrite that addresses both of those issues, and some others as well.
void copyBufferedFile(BufferedInputStream bufferedInputStream,
BufferedOutputStream bufferedOutputStream)
throws IOException
{
try (BufferedInputStream in = bufferedInputStream;
BufferedOutputStream out = bufferedOutputStream)
{
byte[] buf = new byte[1024];
int nosRead;
while ((nosRead = in.read(buf)) != -1) // read this carefully ...
{
out.write(buf, 0, nosRead);
}
}
}
As you can see, I have gotten rid of the bogus "catch and squash exception" handlers, and fixed the resource leak using Java 7+ try with resources.
There are still a couple of issues:
It is better for the copy function to take file name strings (or File or Path objects) as parameters and be responsible for opening the streams.
Given that you are doing block reads and writes, there is little value in using buffered streams. (Indeed, it might conceivably be making the I/O slower.) It would be better to use plain streams and make the buffer the same size as the default buffer size used by the Buffered* classes .... or larger.
If you are really concerned about performance, try using transferFrom as described here:
https://www.journaldev.com/861/java-copy-file
1 - In theory, if the bufferedInputStream.close() throws an exception, the bufferedOutputStream.close() call will be skipped. In practice, it is unlikely that closing an input stream will throw an exception. But either way, the try with resource approach will deals with this correctly, and far more concisely.

Related

Read/Write Bytes to and From a File Using Only Java.IO

How can we write a byte array to a file (and read it back from that file) in Java?
Yes, we all know there are already lots of questions like that, but they get very messy and subjective due to the fact that there are so many ways to accomplish this task.
So let's reduce the scope of the question:
Domain:
Android / Java
What we want:
Fast (as possible)
Bug-free (in a rigidly meticulous way)
What we are not doing:
Third-party libraries
Any libraries that require Android API later than 23 (Marshmallow)
(So, that rules out Apache Commons, Google Guava, Java.nio, and leaves us with good ol' Java.io)
What we need:
Byte array is always exactly the same (content and size) after going through the write-then-read process
Write method only requires two arguments: File file, and byte[] data
Read method returns a byte[] and only requires one argument: File file
In my particular case, these methods are private (not a library) and are NOT responsible for the following, (but if you want to create a more universal solution that applies to a wider audience, go for it):
Thread-safety (file will not be accessed by more than one process at once)
File being null
File pointing to non-existent location
Lack of permissions at the file location
Byte array being too large
Byte array being null
Dealing with any "index," "length," or "append" arguments/capabilities
So... we're sort of in search of the definitive bullet-proof code that people in the future can assume is safe to use because your answer has lots of up-votes and there are no comments that say, "That might crash if..."
This is what I have so far:
Write Bytes To File:
private void writeBytesToFile(final File file, final byte[] data) {
try {
FileOutputStream fos = new FileOutputStream(file);
fos.write(data);
fos.close();
} catch (Exception e) {
Log.i("XXX", "BUG: " + e);
}
}
Read Bytes From File:
private byte[] readBytesFromFile(final File file) {
RandomAccessFile raf;
byte[] bytesToReturn = new byte[(int) file.length()];
try {
raf = new RandomAccessFile(file, "r");
raf.readFully(bytesToReturn);
} catch (Exception e) {
Log.i("XXX", "BUG: " + e);
}
return bytesToReturn;
}
From what I've read, the possible Exceptions are:
FileNotFoundException : Am I correct that this should not happen as long as the file path being supplied was derived using Android's own internal tools and/or if the app was tested properly?
IOException : I don't really know what could cause this... but I'm assuming that there's no way around it if it does.
So with that in mind... can these methods be improved or replaced, and if so, with what?
It looks like these are going to be core utility/library methods which must run on Android API 23 or later.
Concerning library methods, I find it best to make no assumptions on how applications will use these methods. In some cases the applications may want to receive checked IOExceptions (because data from a file must exist for the application to work), in other cases the applications may not even care if data is not available (because data from a file is only cache that is also available from a primary source).
When it comes to I/O operations, there is never a guarantee that operations will succeed (e.g. user dropping phone in the toilet). The library should reflect that and give the application a choice on how to handle errors.
To optimize I/O performance always assume the "happy path" and catch errors to figure out what went wrong. This is counter intuitive to normal programming but essential in dealing with storage I/O. For example, just checking if a file exists before reading from a file can make your application twice as slow - all these kind of I/O actions add up fast to slow your application down. Just assume the file exists and if you get an error, only then check if the file exists.
So given those ideas, the main functions could look like:
public static void writeFile(File f, byte[] data) throws FileNotFoundException, IOException {
try (FileOutputStream out = new FileOutputStream(f)) {
out.write(data);
}
}
public static int readFile(File f, byte[] data) throws FileNotFoundException, IOException {
try (FileInputStream in = new FileInputStream(f)) {
return in.read(data);
}
}
Notes about the implementation:
The methods can also throw runtime-exceptions like NullPointerExceptions - these methods are never going to be "bug free".
I do not think buffering is needed/wanted in the methods above since only one native call is done
(see also here).
The application now also has the option to read only the beginning of a file.
To make it easier for an application to read a file, an additional method can be added. But note that it is up to the library to detect any errors and report them to the application since the application itself can no longer detect those errors.
public static byte[] readFile(File f) throws FileNotFoundException, IOException {
int fsize = verifyFileSize(f);
byte[] data = new byte[fsize];
int read = readFile(f, data);
verifyAllDataRead(f, data, read);
return data;
}
private static int verifyFileSize(File f) throws IOException {
long fsize = f.length();
if (fsize > Integer.MAX_VALUE) {
throw new IOException("File size (" + fsize + " bytes) for " + f.getName() + " too large.");
}
return (int) fsize;
}
public static void verifyAllDataRead(File f, byte[] data, int read) throws IOException {
if (read != data.length) {
throw new IOException("Expected to read " + data.length
+ " bytes from file " + f.getName() + " but got only " + read + " bytes from file.");
}
}
This implementation adds another hidden point of failure: OutOfMemory at the point where the new data array is created.
To accommodate applications further, additional methods can be added to help with different scenario's. For example, let's say the application really does not want to deal with checked exceptions:
public static void writeFileData(File f, byte[] data) {
try {
writeFile(f, data);
} catch (Exception e) {
fileExceptionToRuntime(e);
}
}
public static byte[] readFileData(File f) {
try {
return readFile(f);
} catch (Exception e) {
fileExceptionToRuntime(e);
}
return null;
}
public static int readFileData(File f, byte[] data) {
try {
return readFile(f, data);
} catch (Exception e) {
fileExceptionToRuntime(e);
}
return -1;
}
private static void fileExceptionToRuntime(Exception e) {
if (e instanceof RuntimeException) { // e.g. NullPointerException
throw (RuntimeException)e;
}
RuntimeException re = new RuntimeException(e.toString());
re.setStackTrace(e.getStackTrace());
throw re;
}
The method fileExceptionToRuntime is a minimal implementation, but it shows the idea here.
The library could also help an application to troubleshoot when an error does occur. For example, a method canReadFile(File f) could check if a file exists and is readable and is not too large. The application could call such a function after a file-read fails and check for common reasons why a file cannot be read. The same can be done for writing to a file.
Although you can't use third party libraries, you can still read their code and learn from their experience. In Google Guava for example, you usually read a file into bytes like this:
FileInputStream reader = new FileInputStream("test.txt");
byte[] result = ByteStreams.toByteArray(reader);
The core implementation of this is toByteArrayInternal. Before calling this, you should check:
A not null file is passed (NullPointerException)
The file exists (FileNotFoundException)
After that, it is reduced to handling an InputStream and this where IOExceptions come from. When reading streams a lot of things out of the control of your application can go wrong (bad sectors and other hardware issues, mal-functioning drivers, OS access rights) and manifest themselves with an IOException.
I am copying here the implementation:
private static final int BUFFER_SIZE = 8192;
/** Max array length on JVM. */
private static final int MAX_ARRAY_LEN = Integer.MAX_VALUE - 8;
private static byte[] toByteArrayInternal(InputStream in, Queue<byte[]> bufs, int totalLen)
throws IOException {
// Starting with an 8k buffer, double the size of each successive buffer. Buffers are retained
// in a deque so that there's no copying between buffers while reading and so all of the bytes
// in each new allocated buffer are available for reading from the stream.
for (int bufSize = BUFFER_SIZE;
totalLen < MAX_ARRAY_LEN;
bufSize = IntMath.saturatedMultiply(bufSize, 2)) {
byte[] buf = new byte[Math.min(bufSize, MAX_ARRAY_LEN - totalLen)];
bufs.add(buf);
int off = 0;
while (off < buf.length) {
// always OK to fill buf; its size plus the rest of bufs is never more than MAX_ARRAY_LEN
int r = in.read(buf, off, buf.length - off);
if (r == -1) {
return combineBuffers(bufs, totalLen);
}
off += r;
totalLen += r;
}
}
// read MAX_ARRAY_LEN bytes without seeing end of stream
if (in.read() == -1) {
// oh, there's the end of the stream
return combineBuffers(bufs, MAX_ARRAY_LEN);
} else {
throw new OutOfMemoryError("input is too large to fit in a byte array");
}
}
As you can see most of the logic has to do with reading the file in chunks. This is to handle situations, where you don't know the size of the InputStream, before starting reading. In your case, you only need to read files and you should be able to know the length beforehand, so this complexity could be avoided.
The other check is OutOfMemoryException. In standard Java the limit is too big, however in Android, it will be a much smaller value. You should check, before trying to read the file that there is enough memory available.

How do I pass in the string from a text file into my JUnit code ?

I have the following file, which contains a binary representation of an .MSG file :
binaryMessage.txt
And I put it in my Eclipse workspace, in the following folder - src/main/resources/test :
I want to use the string which is within this text file , within the following JUnit code, so I tried the following way :
request.setContent("src/main/resources/test/binaryMessage");
mockMvc.perform(post(EmailController.PATH__METADATA_EXTRACTION_OPERATION)
.contentType(MediaType.APPLICATION_JSON)
.content(json(request)))
.andExpect(status().is2xxSuccessful());
}
But this doesn't work. Is there a way I can pass in the string the file directly without using IO code ?
You can't read a file without using IO code (or libraries that use IO code). That said, it's not that difficult to read the file into memory so you can send it.
To read a binary file into a byte[] you can use this method:
private byte[] readToByteArray(InputStream is) throws IOException {
try {
ByteArrayOutputStream baos = new ByteArrayOutputStream();
byte[] buffer = new byte[1024];
int len;
while ((len = in.read(buffer)) != -1) {
baos.write(buffer, 0, len);
}
return baos.toByteArray();
} finally {
if (is != null) {
is.close();
}
}
}
Then you can do
request.setContent(readToByteArray(getClass().getResourceAsStream("test/binaryMessage")));
In addition to my comment on Samuel's answer, I just noticed that you depend on your concrete execution directory. I personally don't like that and normally use the class loader's functions to find resources.
Thus, to be independent of your working directory, you can use
getClass().getResource("/test/binaryMessage")
Convert this to URI and Path, then use Files.readAllBytes to fetch the contents:
Path resourcePath = Paths.get(getClass().getResource("/test/binaryMessage").toURI());
byte[] content = Files.readAllBytes(resourcePath);
... or even roll that into a single expression.
But to get back to your original question: no, this is I/O code, and you need it. But since the dawn of Java 7 (in 2011!) this does not need to be painful anymore.

Java 7 Deflating Files

I have a piece of code which uses the deflate algorithm to compress a file:
public static File compressOld(File rawFile) throws IOException
{
File compressed = new File(rawFile.getCanonicalPath().split("\\.")[0]
+ "_compressed." + rawFile.getName().split("\\.")[1]);
InputStream inputStream = new FileInputStream(rawFile);
OutputStream compressedWriter = new DeflaterOutputStream(new FileOutputStream(compressed));
byte[] buffer = new byte[1000];
int length;
while ((length = inputStream.read(buffer)) > 0)
{
compressedWriter.write(buffer, 0, length);
}
inputStream.close();
compressedWriter.close();
return compressed;
}
However, I'm not happy with the OutputStream copying loop since it's the "outdated" way of writing to streams. Instead, I want to use a Java 7 API method such as Files.copy:
public static File compressNew(File rawFile) throws IOException
{
File compressed = new File(rawFile.getCanonicalPath().split("\\.")[0]
+ "_compressed." + rawFile.getName().split("\\.")[1]);
OutputStream compressedWriter = new DeflaterOutputStream(new FileOutputStream(compressed));
Files.copy(compressed.toPath(), compressedWriter);
compressedWriter.close();
return compressed;
}
The latter method however does not work correctly, the compressed file is messed up and only a few bytes are copied. How come?
I see mainly two problems.
You copy from the target instead of the source. I think the copying has to be changed to Files.copy(rawFile.toPath(), compressedWriter);.
The Javadoc of copy says: "Note that if the given output stream is Flushable then its flush method may need to invoked after this method completes so as to flush any buffered output." So, you have to call the flush-method of the OutputStream after copy.
Additionally there is one more point. The Javadoc of copy says:
It is strongly recommended that the output stream be promptly closed if an I/O error occurs.
You can close the OutputStream in a finally-block to make sure it happens in case of an error. Another possibility is to use try with resources that was introduced in Java 7.

How to run methods against the entire bytes of a large file

My program needs to do calculations against the entire bytes of a file and it breaks whenever the file gets above a certain size.
Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
I know I can allocate the amount of memory to my program using command line switches, but I'm wondering if there is a more effective way of handling this in my program?
I'm basically trying to figure out a way to read the file in chunks and pass those chunks to another method and essentially rebuild the file in that method.
This is the problem method. I need these bytes to be used in another method.
This method converts the stream to a byte array:
private byte[] inputStreamToByteArray(InputStream inputStream) {
BufferedInputStream bis = null;
ByteArrayOutputStream baos = null;
try {
bis = new BufferedInputStream(inputStream);
baos = new ByteArrayOutputStream(bis);
byte[] buffer = new byte[1024];
int nRead;
while((nRead = bis.read(buffer)) != -1) {
baos.write(buffer, 0, nRead);
}
} catch(IOException ioe) {
ioe.printStackTrace();
}
return baos.toByteArray();
}
This method checks the file type:
private final boolean isMyFileType(byte[] bytes) {
// do stuff
return theBoolean;
}
The reason it is breaking makes sense to me - the byte array ends up being gigantic if I have a gigantic file AND I'm passing around a gigantic byte array.
My goal, I want to read the bytes from a file, determine what type of file it is using another method I wrote, run compression/decompression method against those bytes after determining the file type.
I have most of my goal completed, I just don't know how to handle file streams and large byte arrays effectively.
You are already using a BufferedInputStream. Use the "mark" method to place a mark in the steam. Make sure the "readlimit" argument to "mark" is large enough for you to detect the file type. Read the first X bytes from the stream (but not more than readlimit) and try to figure out the content. Then call reset() to set the stream back to the beginning and continue withw whatever you want to do with the stream.

Best way to detect if a stream is zipped in Java

What is the best way to find out i java.io.InputStream contains zipped data?
Introduction
Since all the answers are 5 years old I feel a duty to write down, what's going on today. I seriously doubt one should read magic bytes of the stream! That's a low level code, it should be avoided in general.
Simple answer
miku writes:
If the Stream can be read via ZipInputStream, it should be zipped.
Yes, but in case of ZipInputStream "can be read" means that first call to .getNextEntry() returns a non-null value. No exception catching et cetera. So instead of magic bytes parsing you can just do:
boolean isZipped = new ZipInputStream(yourInputStream).getNextEntry() != null;
And that's it!
General unzipping thoughts
In general, it appeared that it's much more convenient to work with files while [un]zipping, than with streams. There are several useful libraries, plus ZipFile has got more functionality than ZipInputStream. Handling of zip files is discussed here: What is a good Java library to zip/unzip files? So if you can work with files you better do!
Code sample
I needed in my application to work with streams only. So that's the method I wrote for unzipping:
import org.apache.commons.io.IOUtils;
import java.util.zip.ZipEntry;
import java.util.zip.ZipInputStream;
public boolean unzip(InputStream inputStream, File outputFolder) throws IOException {
ZipInputStream zis = new ZipInputStream(inputStream);
ZipEntry entry;
boolean isEmpty = true;
while ((entry = zis.getNextEntry()) != null) {
isEmpty = false;
File newFile = new File(outputFolder, entry.getName());
if (newFile.getParentFile().mkdirs() && !entry.isDirectory()) {
FileOutputStream fos = new FileOutputStream(newFile);
IOUtils.copy(zis, fos);
IOUtils.closeQuietly(fos);
}
}
IOUtils.closeQuietly(zis);
return !isEmpty;
}
The magic bytes for the ZIP format are 50 4B. You could test the stream (using mark and reset - you may need to buffer) but I wouldn't expect this to be a 100% reliable approach. There would be no way to distinguish it from a US-ASCII encoded text file that began with the letters PK.
The best way would be to provide metadata on the content format prior to opening the stream and then treat it appropriately.
You could check that the first four bytes of the stream are the local file header signature that starts the local file header that proceeds every file in a ZIP file, as shown in the spec here to be 50 4B 03 04.
A little test code shows this to work:
byte[] buffer = new byte[4];
try {
ZipOutputStream zos = new ZipOutputStream(new FileOutputStream("so.zip"));
ZipEntry ze = new ZipEntry("HelloWorld.txt");
zos.putNextEntry(ze);
zos.write("Hello world".getBytes());
zos.close();
FileInputStream is = new FileInputStream("so.zip");
is.read(buffer);
is.close();
}
catch(IOException e) {
e.printStackTrace();
}
for (byte b : buffer) {
System.out.printf("%H ",b);
}
Gave me this output:
50 4B 3 4
Not very elegant, but reliable:
If the Stream can be read via ZipInputStream, it should be zipped.
Checking the magic number may not be the right option.
Docx files are also having similar magic number 50 4B 3 4
Since both .zip and .xlsx having the same Magic number, I couldn't find the valid zip file (if renamed).
So, I have used Apache Tika to find the exact document type.
Even if renamed the file type as zip, it finds the exact type.
Reference: https://www.baeldung.com/apache-tika
I combined answers from #McDowell and
#Innokenty to a small lib function that you can paste into you project:
public static boolean isZipStream(InputStream inputStream) {
if (inputStream == null || !inputStream.markSupported()) {
throw new IllegalArgumentException("InputStream must support mark-reset. Use BufferedInputstream()");
}
boolean isZipped = false;
try {
inputStream.mark(2048);
isZipped = new ZipInputStream(inputStream).getNextEntry() != null;
inputStream.reset();
} catch (IOException ex) {
// cannot be opend as zip.
}
return isZipped;
}
You can use the lib like this:
public static void main(String[] args) {
InputStream inputStream = new BufferedInputStream(...);
if (isZipStream(inputStream)) {
// do zip processing using inputStream
} else {
// do non-zip processing using inputStream
}
}

Categories