Java Read File Larger than 2 GB (Using Chunking)

Java Read File Larger than 2 GB (Using Chunking) - java

I'm implementing a file transfer server, and I've run into an issue with sending a file larger than 2 GB over the network. The issue starts when I get the File I want to work with and try to read its contents into a byte[]. I have a for loop :
for(long i = 0; i < fileToSend.length(); i += PACKET_SIZE){
fileBytes = getBytesFromFile(fileToSend, i);
where getBytesFromFile() reads a PACKET_SIZE amount of bytes from fileToSend which is then sent to the client in the for loop. getBytesFromFile() uses i as an offset; however, the offset variable in FileInputStream.read() has to be an int. I'm sure there is a better way to read this file into the array, I just haven't found it yet.
I would prefer to not use NIO yet, although I will switch to using that in the future. Indulge my madness :-)

It doesn't look like you're reading data from the file properly. When reading data from a stream in Java, it's standard practice to read data into a buffer. The size of the buffer can be your packet size.
File fileToSend = //...
InputStream in = new FileInputStream(fileToSend);
OutputStream out = //...
byte buffer[] = new byte[PACKET_SIZE];
int read;
while ((read = in.read(buffer)) != -1){
out.write(buffer, 0, read);
}
in.close();
out.close();
Note that, the size of the buffer array remains constant. But-- if the buffer cannot be filled (like when it reaches the end of the file), the remaining elements of the array will contain data from the last packet, so you must ignore these elements (this is what the out.write() line in my code sample does)

Umm, realize that your handling of the variable i is not correct..
Iteration 0: i=0
Iteration 1: i=PACKET_SIZE
...
...
Iteration n: i=PACKET_SIZE*n

Related

Downloaded files are corrupted when buffer length is > 1

I'm trying to write a function which downloads a file at a specific URL. The function produces a corrupt file unless I make the buffer an array of size 1 (as it is in the code below).
The ternary statement above the buffer initialization (which I plan to use) along with hard-coded integer values other than 1 will manufacture a corrupted file.
Note: MAX_BUFFER_SIZE is a constant, defined as 8192 (2^13) in my code.
public static void downloadFile(String webPath, String localDir, String fileName) {
try {
File localFile;
FileOutputStream writableLocalFile;
InputStream stream;
url = new URL(webPath);
HttpURLConnection connection = (HttpURLConnection) url.openConnection();
int size = connection.getContentLength(); //File size in bytes
int read = 0; //Bytes read
localFile = new File(localDir);
//Ensure that directory exists, otherwise create it.
if (!localFile.exists())
localFile.mkdirs();
//Ensure that file exists, otherwise create it.
//Note that if we define the file path as we do below initially and call mkdirs() it will create a folder with the file name (I.e. test.exe). There may be a better alternative, revisit later.
localFile = new File(localDir + fileName);
if (!localFile.exists())
localFile.createNewFile();
writableLocalFile = new FileOutputStream(localFile);
stream = connection.getInputStream();
byte[] buffer;
int remaining;
while (read != size) {
remaining = size - read; //Bytes still to be read
//remaining > MAX_BUFFER_SIZE ? MAX_BUFFER_SIZE : remaining
buffer = new byte[1]; //Adjust buffer size according to remaining data (to be read).
read += stream.read(buffer); //Read buffer-size amount of bytes from the stream.
writableLocalFile.write(buffer, 0, buffer.length); //Args: Bytes to read, offset, number of bytes
}
System.out.println("Read " + read + " bytes.");
writableLocalFile.close();
stream.close();
} catch (Throwable t) {
t.printStackTrace();
}
}
The reason I've written it this way is so I may provide a real time progress bar to the user as they are downloading. I've removed it from the code to reduce clutter.

len = stream.read(buffer);
read += len;
writableLocalFile.write(buffer, 0, len);
You must not use buffer.length as the bytes read, you need to use the return value of the read call. Because it might return a short read and then your buffer contains junk (0 bytes or data from previous reads) after the read bytes.
And besides calculating the remaining and using dynamic buffers just go for 16k or something like that. The last read will be short, which is fine.

InputStream.read() may read number of bytes fewer than you requested. But you always append whole buffer to the file. You need to capture actual number of read bytes and append only those bytes to the file.
Additionally:
Watch for InputStream.read() to return -1 (EOF)
Server may return incorrect size. As such, the check read != size is dangerous. I would advise not to rely on the Content-Length HTTP field altogether. Instead, just keep reading from the input stream until you hit EOF.

FileInputStream and DataOutputStream - handling byte[] buffer

I've been working on an app to move files between two hosts and while I got the transfer process to work (code is still really messy so sorry for that, I'm still fixing it) I'm kinda left wondering how exactly it handles the buffer. I'm fairly new to networking in java so I just don't want to end up with "meh i got it to work so let's move on" attitude.
File sending code.
public void sendFile(String filepath, DataOutputStream dos) throws Exception{
if (new File(filepath).isFile()&&dos!=null){
long size = new File(filepath).length();
String strsize = Long.toString(size) +"\n";
//System.out.println("File size in bytes: " + strsize);
outToClient.writeBytes(strsize);
FileInputStream fis = new FileInputStream(filepath);
byte[] filebuffer = new byte[8192];
while(fis.read(filebuffer) > 0){
dos.write(filebuffer);
dos.flush();
}
File recieving code
public void saveFile() throws Exception{
String size = inFromServer.readLine();
long longsize = Long.parseLong(size);
//System.out.println(longsize);
String tmppath = currentpath + "\\" + tmpdownloadname;
DataInputStream dis = new DataInputStream(clientSocket.getInputStream());
FileOutputStream fos = new FileOutputStream(tmppath);
byte[] filebuffer = new byte[8192];
int read = 0;
int remaining = (int)longsize;
while((read = dis.read(filebuffer, 0, Math.min(filebuffer.length, remaining))) > 0){
//System.out.println(Math.min(filebuffer.length, remaining));
//System.out.println(read);
//System.out.println(remaining);
remaining -= read;
fos.write(filebuffer,0, read);
}
}
I'd like to know how exactly buffers on both sides are handled to avoid writing wrong bytes. (ik how receiving code avoids that but i'd still like to know how byte array is handled)
Does fis/dis always wait for buffers to fill up fully? In receiving code it always writes full array or remaining length if it's less than filebuffer.length but what about fis from sending code.

In fact, your code could have a subtle bug, exactly because of the way you handle buffers.
When you read a buffer from the original file, the read(byte[]) method returns the number of bytes actually read. There is no guarantee that, in fact, all 8192 bytes have been read.
Suppose you have a file with 10000 bytes. Your first read operation reads 8192 bytes. Your second read operation, however, will only read 1808 bytes. The third operation will return -1.
In the first read, you write exactly the bytes that you have read, because you read a full buffer. But in the second read, your buffer actually contains 1808 correct bytes, and the remaining 6384 bytes are wrong - they are still there from the previous read.
In this case you are lucky, because this only happens in the last buffer that you write. Thus, the fact that you stop reading on your client side when you reach the pre-sent length causes you to skip those 6384 wrong bytes which you shouldn't have sent anyway.
But in fact, there is no actual guarantee that reading from the file will return 8192 bytes even if the end was not reached yet. The method's contract does not guarantee that, and it's up to the OS and underlying file system. It could, for example, send you 5000 bytes in your first read, and 5000 in your second read. In this case, you would be sending 3192 wrong bytes in the middle of the file.
Therefore, your code should actually look like:
byte[] filebuffer = new byte[8192];
int read = 0;
while(( read = fis.read(filebuffer)) > 0){
dos.write(filebuffer,0,read);
dos.flush();
}
much like the code you have on the receiving side. This guarantees that only the actual bytes read will be written.
So there is nothing actually magical about the way buffers are handled. You give the stream a buffer, you tell it how much of the buffer it's allowed to fill, but there is no guarantee it will fill all of it. It may fill less and you have to take care and use only the portion it tells you it fills.
Another grave mistake you are making, though, is to just convert the long that you received into an int in this line:
int remaining = (int)longsize;
Files may be longer than an integer contains. Especially things like long videos etc. This is why you get that number as a long in the first place. Don't truncate it like that. Keep the remaining as long and change it to int only after you have taken the minimum (because you know the minimum will always be in the range of an int).
long remaining = longsize;
long fileBufferLen = filebuffer.length;
while((read = dis.read(filebuffer, 0, (int)Math.min(fileBufferLen, remaining))) > 0){
...
}
By the way, there is no real reason to use a DataOutputStream and DataInputStream for this. The read(byte[]), read(byte[],int,int), write(byte[]), and write(byte[],int,int) are inherited from the underlying InputStream and there is no reason not to use the socket's OutputStream/InputStream directly, or use a BufferedOutputStream/BufferedOutputStream to wrap it. There is also no need to use flush until you have finished writing/reading.
Also, do not forget to close at least your file input/output streams when you are done with them. You may want to keep the socket input/output streams open for continued communication, but there is no need to keep the files themselves open, it may cause problems. Use a try-with-resources to guarantee that they are closed.

File writing in Java using FileOutputStream

String remoteFile2 = "/test/song.mp3";
File downloadFile2 = new File("D:/Downloads/song.mp3");
OutputStream outputStream2 = new BufferedOutputStream(new FileOutputStream(downloadFile2));
InputStream inputStream = ftpClient.retrieveFileStream(remoteFile2);
byte[] bytesArray = new byte[4096];
int bytesRead = -1;
while ((bytesRead = inputStream.read(bytesArray)) != -1) {
outputStream2.write(bytesArray, 0, bytesRead);
}
This is a sample file writing code in java,
byte[] bytesArray = new byte[4096];
In this line what exactly 4096 means, what is the possibility of changing this value?

When deal with stream, you often read bytes in chunk.
If you read / write byte one by one then there are lots of overhead (like init the array to store the byte, put the byte to stream, remember the current position in file... etc) for each byte.
So if you read a group of bytes, you still have those overhead but lesser (For example if you have 4000 bytes, you have 4000x overhead. But if you read 100 bytes per time, you have 4000/100 = 40x overhead only)
The length of chunk is often choosen to balance between the time to read/write the chunk and the size of chunk.
Its often set to 2k or 4k. Might be related with disk sector (512 bytes, 2048 bytes...)

Here 4096 is the buffer size. So whenever you loop is going on it first read 4096 bytes and after that it will go inside the loop.

Load text file to memory in Java

I have wiki.txt file and its size is 50 MB.
I need to do several things on the file and so I thought that the best way in terms of performance is to load the file to memory, is that correct?
This is the code that I written:
File file = new File("wiki.txt");
FileInputStream fileInputStream = new FileInputStream(file);
FileChannel fileChannel = fileInputStream.getChannel();
MappedByteBuffer mapByteBuffer = fileChannel.map(FileChannel.MapMode.READ_ONLY, 0, file.length());
System.out.println((char)mapByteBuffer.get());
I get error on this code: mapByteBuffer.get().
I tried the get() function a few options but all of them I get error and didn't even get an error on e.getMessage() I just got null.
Another important thing to note, my text file contains English words and actions I need to do is search, if expressed is exist in this text file.
Thank you.

I would suggest using a MemoryMappedFile, to read the file directly from the disk instead of loading it in memory.
RandomAccessFile file = new RandomAccessFile("wiki.txt", "r");
FileChannel channel = file.getChannel();
MappedByteBuffer buf = channel.map(FileChannel.MapMode.READ_WRITE, 0, 1024*50);
And then you can read the buffer as usual.

My answers for point (1):
It depends on what you want to do with the file. If your processing doesn't involve rewind operation (looking what was read behind/before), it's best to just read as a stream and process it in one go (instead of loading all into memory).
Even if you need random access across the file, you may also be interested in doing block file operation, because your solution may not scale well when the file size change to bigger size.
RandomAccessFile if you are on Java 1.4 or above.
For random access, the operating system usually handles the file buffer caching quite well you don't have to handle yourself.

It is important to read the whole error, not just the message. Often the real information is in the exception's name not the text associated with it.
You will get an error if the file is empty as there is no first byte.
Note: the approach you are using assumes ASCII 7-bit characters. If you want to assume ISO-8859-1 characters you can use (char) (byteBuffer.get() & 0xFF)
However, if you have plan text you may find that using strings is simpler to use and not much slower. e.g. you can read a 50 MB file as text in less than a second. I would only use a memory mapped file if this is far too long.

I would suggest to use BufferedReader. It is much faster and requires relatively less resources.
First read number of lines:
InputStream is = new BufferedInputStream(new FileInputStream(filename));
byte[] chars = new byte[1024];
int numberOfChars = 0;
while ((numberOfChars = is.read(chars)) != -1)
{
for (int i = 0; i < numberOfChars; ++i)
{
if (chars[i] == '\n' && numberOfChars - i != 1)
{
++count;
}
}
}
count++
return count; // number of lines
Then read the lines:
BufferedReader in = new BufferedReader(new FileReader(fileName));
for (int i = 0; i < endLine; i++)
{
String oneLine = in.readLine();
}
In this strings you can even do search for what you need.

Reading and writing binary file in Java (seeing half of the file being corrupted)

I have some working code in python that I need to convert to Java.
I have read quite a few threads on this forum but could not find an answer. I am reading in a JPG image and converting it into a byte array. I then write this buffer it to a different file. When I compare the written files from both Java and python code, the bytes at the end do not match. Please let me know if you have a suggestion. I need to use the byte array to pack the image into a message that needs to be sent over to a remote server.
Java code (Running on Android)
Reading the file:
File queryImg = new File(ImagePath);
int imageLen = (int)queryImg.length();
byte [] imgData = new byte[imageLen];
FileInputStream fis = new FileInputStream(queryImg);
fis.read(imgData);
Writing the file:
FileOutputStream f = new FileOutputStream(new File("/sdcard/output.raw"));
f.write(imgData);
f.flush();
f.close();
Thanks!

InputStream.read is not guaranteed to read any particular number of bytes and may read less than you asked it to. It returns the actual number read so you can have a loop that keeps track of progress:
public void pump(InputStream in, OutputStream out, int size) {
byte[] buffer = new byte[4096]; // Or whatever constant you feel like using
int done = 0;
while (done < size) {
int read = in.read(buffer);
if (read == -1) {
throw new IOException("Something went horribly wrong");
}
out.write(buffer, 0, read);
done += read;
}
// Maybe put cleanup code in here if you like, e.g. in.close, out.flush, out.close
}
I believe Apache Commons IO has classes for doing this kind of stuff so you don't need to write it yourself.

Your file length might be more than int can hold and than you end up having wrong array length, hence not reading entire file into the buffer.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Java Read File Larger than 2 GB (Using Chunking) - java

Umm, realize that your handling of the variable i is not correct.. Iteration 0: i=0 Iteration 1: i=PACKET_SIZE ... ... Iteration n: i=PACKET_SIZE*n

Related

Downloaded files are corrupted when buffer length is > 1

FileInputStream and DataOutputStream - handling byte[] buffer

File writing in Java using FileOutputStream

Load text file to memory in Java

Reading and writing binary file in Java (seeing half of the file being corrupted)

Categories

Resources