How can I effectively read from a large file and write bulk data into a file using the Java NIO framework.
I'm working with ByteBuffer and FileChannel and had tried something like below:
public static void main(String[] args)
{
String inFileStr = "screen.png";
String outFileStr = "screen-out.png";
long startTime, elapsedTime;
int bufferSizeKB = 4;
int bufferSize = bufferSizeKB * 1024;
// Check file length
File fileIn = new File(inFileStr);
System.out.println("File size is " + fileIn.length() + " bytes");
System.out.println("Buffer size is " + bufferSizeKB + " KB");
System.out.println("Using FileChannel with an indirect ByteBuffer of " + bufferSizeKB + " KB");
try ( FileChannel in = new FileInputStream(inFileStr).getChannel();
FileChannel out = new FileOutputStream(outFileStr).getChannel() )
{
// Allocate an indirect ByteBuffer
ByteBuffer bytebuf = ByteBuffer.allocate(bufferSize);
startTime = System.nanoTime();
int bytesCount = 0;
// Read data from file into ByteBuffer
while ((bytesCount = in.read(bytebuf)) > 0) {
// flip the buffer which set the limit to current position, and position to 0.
bytebuf.flip();
out.write(bytebuf); // Write data from ByteBuffer to file
bytebuf.clear(); // For the next read
}
elapsedTime = System.nanoTime() - startTime;
System.out.println("Elapsed Time is " + (elapsedTime / 1000000.0) + " msec");
}
catch (IOException ex) {
ex.printStackTrace();
}
}
Can anybody tell, should I follow the same procedure if my file size in more than 2 GB?
What should I follow if the similar things I want to do while writing if written operations are in bulk?
Note that you can simply use Files.copy(Paths.get(inFileStr),Paths.get(outFileStr), StandardCopyOption.REPLACE_EXISTING) to copy the file as your example code does, just likely faster and with only one line of code.
Otherwise, if you already have opened the two file channels, you can just use
in.transferTo(0, in.size(), out) to transfer the entire contents of the in channel to the out channel. Note that this method allows to specify a range within the source file that will be transferred to the target channel’s current position (which is initially zero) and that there’s also a method for the opposite way, i.e. out.transferFrom(in, 0, in.size()) to transfer data from the source channel’s current position to an absolute range within the target file.
Together, they allow almost every imaginable nontrivial bulk transfer in an efficient way without the need to copy the data into a Java side buffer. If that’s not solving your needs, you have to be more specific in your question.
By the way, you can open a FileChannel directly without the FileInputStream/FileOutputStream detour since Java 7.
while ((bytesCount = in.read(bytebuf)) > 0) {
// flip the buffer which set the limit to current position, and position to 0.
bytebuf.flip();
out.write(bytebuf); // Write data from ByteBuffer to file
bytebuf.clear(); // For the next read
}
Your copy loop is not correct. It should be:
while ((bytesCount = in.read(bytebuf)) > 0 || bytebuf.position() > 0) {
// flip the buffer which set the limit to current position, and position to 0.
bytebuf.flip();
out.write(bytebuf); // Write data from ByteBuffer to file
bytebuf.compact(); // For the next read
}
Can anybody tell, should I follow the same procedure if my file size [is] more than 2 GB?
Yes. The file size doesn't make any difference.
Related
I'm trying to write a function which downloads a file at a specific URL. The function produces a corrupt file unless I make the buffer an array of size 1 (as it is in the code below).
The ternary statement above the buffer initialization (which I plan to use) along with hard-coded integer values other than 1 will manufacture a corrupted file.
Note: MAX_BUFFER_SIZE is a constant, defined as 8192 (2^13) in my code.
public static void downloadFile(String webPath, String localDir, String fileName) {
try {
File localFile;
FileOutputStream writableLocalFile;
InputStream stream;
url = new URL(webPath);
HttpURLConnection connection = (HttpURLConnection) url.openConnection();
int size = connection.getContentLength(); //File size in bytes
int read = 0; //Bytes read
localFile = new File(localDir);
//Ensure that directory exists, otherwise create it.
if (!localFile.exists())
localFile.mkdirs();
//Ensure that file exists, otherwise create it.
//Note that if we define the file path as we do below initially and call mkdirs() it will create a folder with the file name (I.e. test.exe). There may be a better alternative, revisit later.
localFile = new File(localDir + fileName);
if (!localFile.exists())
localFile.createNewFile();
writableLocalFile = new FileOutputStream(localFile);
stream = connection.getInputStream();
byte[] buffer;
int remaining;
while (read != size) {
remaining = size - read; //Bytes still to be read
//remaining > MAX_BUFFER_SIZE ? MAX_BUFFER_SIZE : remaining
buffer = new byte[1]; //Adjust buffer size according to remaining data (to be read).
read += stream.read(buffer); //Read buffer-size amount of bytes from the stream.
writableLocalFile.write(buffer, 0, buffer.length); //Args: Bytes to read, offset, number of bytes
}
System.out.println("Read " + read + " bytes.");
writableLocalFile.close();
stream.close();
} catch (Throwable t) {
t.printStackTrace();
}
}
The reason I've written it this way is so I may provide a real time progress bar to the user as they are downloading. I've removed it from the code to reduce clutter.
len = stream.read(buffer);
read += len;
writableLocalFile.write(buffer, 0, len);
You must not use buffer.length as the bytes read, you need to use the return value of the read call. Because it might return a short read and then your buffer contains junk (0 bytes or data from previous reads) after the read bytes.
And besides calculating the remaining and using dynamic buffers just go for 16k or something like that. The last read will be short, which is fine.
InputStream.read() may read number of bytes fewer than you requested. But you always append whole buffer to the file. You need to capture actual number of read bytes and append only those bytes to the file.
Additionally:
Watch for InputStream.read() to return -1 (EOF)
Server may return incorrect size. As such, the check read != size is dangerous. I would advise not to rely on the Content-Length HTTP field altogether. Instead, just keep reading from the input stream until you hit EOF.
I am trying to send chunks of files from server to more than one clients. When I am trying to send file of size 700mb, it showed "OutOfMemory java heap space" error. I am using Netbeans 7.1.2 version.
I also tried VMoption in the properties. But still the same error happens. I think there is some problem with reading the entire file. Below code is working for up to 300mb. Please give me some suggestions.
Thanks in advance
public class SplitFile {
static int fileid = 0 ;
public static DataUnit[] getUpdatableDataCode(File fileName) throws FileNotFoundException, IOException{
int i = 0;
DataUnit[] chunks = new DataUnit[UAProtocolServer.singletonServer.cloudhosts.length];
FileInputStream fis;
long Chunk_Size = (fileName.length())/chunks.length;
int cursor = 0;
long fileSize = (long) fileName.length();
int nChunks = 0, read = 0;long readLength = Chunk_Size;
byte[] byteChunk;
try {
fis = new FileInputStream(fileName);
//StupidTest.size = (int)fileName.length();
while (fileSize > 0) {
System.out.println("loop"+ i);
if (fileSize <= Chunk_Size) {
readLength = (int) fileSize;
}
byteChunk = new byte[(int)readLength];
read = fis.read(byteChunk, 0, (int)readLength);
fileSize -= read;
// cursor += read;
assert(read==byteChunk.length);
long aid = fileid;
aid = aid<<32 | nChunks;
chunks[i] = new DataUnit(byteChunk,aid);
// Lister.add(chunks[i]);
nChunks++;
++i;
}
fis.close();
fis = null;
}catch(Exception e){
System.out.println("File splitting exception");
e.printStackTrace();
}
return chunks;
}
Reading in the whole file would definitely trigger OutOfMemoryError as file size grow. Tuning the -Xmx1024M may be good for temporary fix, but it's definitely not the right/scalable solution. Also, doesn't matter how you move your variables around (like creating buffer outside of the loop instead of inside the loop) you will get OutOfMemoryError sooner or later. The only way to not get OutOfMemoryError for you is to not to read the complete file in memory.
If you have to use just memory, then an approach is to send off chunks to the client so you don't have to keep all the chunks in memory:
instead of:
chunks[i] = new DataUnit(byteChunk,aid);
do:
sendChunkToClient(new DataUnit(byteChunk, aid));
But the above solution has the drawback that if some error happened in-between chunk sending, you may have hard time trying to resume/recover from the error point.
Saving the chunks to temporary files like Ross Drew suggested is probably better and more reliable.
How about creating the
byteChunk = new byte[(int)readLength];
outside of the loop and just reuse it instead of creating an array of bytes over and over if it's always the same.
Alternatively
You could write incoming data to a temporary file as it comes in instead of maintaining that huge array then process it once it's all arrived.
Also
If you are using it multiple times as an int, you should probably just case readLength to an int outside the loop as well
int len = (int)readLength;
And Chunk_Size is a variable right? It should begin with a lower case letter.
I wish to write data to a file at different offsets. Example, at 0th position, at (size/2)th position, at (size/4)th position etc. size represent the file size of the file meant to be created. Is this possible without creating different file parts and joining them?
Well you can write to anywhere you like in a file using RandomAccessFile - just use seek to get to the right place, and start writing.
However, this won't insert bytes at those places - it will just overwrite them (or add data at the end if you're writing past the end of the current file length, of course). It's not clear whether that's what you want or not.
What you are looking for are Random access files. From the official sun java tutorial site -
Random access files permit nonsequential, or random, access to a
file's contents. To access a file randomly, you open the file, seek a
particular location, and read from or write to that file.
This functionality is possible with the SeekableByteChannel interface.
The SeekableByteChannel interface extends channel I/O with the notion
of a current position. Methods enable you to set or query the
position, and you can then read the data from, or write the data to,
that location. The API consists of a few, easy to use, methods:
position – Returns the channel's current position
position(long) – Sets the channel's position
read(ByteBuffer) – Reads bytes into the buffer from the channel
write(ByteBuffer) – Writes bytes from the buffer to the channel
truncate(long) – Truncates the file (or other entity) connected to the channel
and an example, which is provided there -
String s = "I was here!\n";
byte data[] = s.getBytes();
ByteBuffer out = ByteBuffer.wrap(data);
ByteBuffer copy = ByteBuffer.allocate(12);
try (FileChannel fc = (FileChannel.open(file, READ, WRITE))) {
// Read the first 12
// bytes of the file.
int nread;
do {
nread = fc.read(copy);
} while (nread != -1 && copy.hasRemaining());
// Write "I was here!" at the beginning of the file.
// See how they are moving back to the beginning of the
// file?
fc.position(0);
while (out.hasRemaining())
fc.write(out);
out.rewind();
// Move to the end of the file. Copy the first 12 bytes to
// the end of the file. Then write "I was here!" again.
long length = fc.size();
// Now see here. They are going to the end of the file.
fc.position(length-1);
copy.flip();
while (copy.hasRemaining())
fc.write(copy);
while (out.hasRemaining())
fc.write(out);
} catch (IOException x) {
System.out.println("I/O Exception: " + x);
}
If this isn't a huge file you can read the entire thing and than edit the array:
public String read(String fileName){
BufferedReader br = new BufferedReader(new FileReader(fileName));
try {
StringBuilder sb = new StringBuilder();
String line = br.readLine();
while (line != null) {
sb.append(line);
sb.append("\n");
line = br.readLine();
}
String everything = sb.toString();
} finally {
br.close();
}
}
public String edit(String fileContent, Byte b, int offset){
Byte[] bytes = fileContent.getBytes();
bytes[offset] = b;
return new String(bytes);
]
and then write it back to the file (or just delete the old one and write the byte array to a new file with the same name)
I have a module that is responsible for reading, processing, and writing bytes to disk. The bytes come in over UDP and, after the individual datagrams are assembled, the final byte array that gets processed and written to disk is typically between 200 bytes and 500,000 bytes. Occassionally, there will be byte arrays that, after assembly, are over 500,000 bytes, but these are relatively rare.
I'm currently using the FileOutputStream's write(byte\[\]) method. I'm also experimenting with wrapping the FileOutputStream in a BufferedOutputStream, including using the constructor that accepts a buffer size as a parameter.
It appears that using the BufferedOutputStream is tending toward slightly better performance, but I've only just begun to experiment with different buffer sizes. I only have a limited set of sample data to work with (two data sets from sample runs that I can pipe through my application). Is there a general rule-of-thumb that I might be able to apply to try to calculate the optimal buffer sizes to reduce disk writes and maximize the performance of the disk writing given the information that I know about the data I'm writing?
BufferedOutputStream helps when the writes are smaller than the buffer size e.g. 8 KB. For larger writes it doesn't help nor does it make it much worse. If ALL your writes are larger than the buffer size or you always flush() after every write, I would not use a buffer. However if a good portion of your writes are less that the buffer size and you don't use flush() every time, its worth having.
You may find increasing the buffer size to 32 KB or larger gives you a marginal improvement, or make it worse. YMMV
You might find the code for BufferedOutputStream.write useful
/**
* Writes <code>len</code> bytes from the specified byte array
* starting at offset <code>off</code> to this buffered output stream.
*
* <p> Ordinarily this method stores bytes from the given array into this
* stream's buffer, flushing the buffer to the underlying output stream as
* needed. If the requested length is at least as large as this stream's
* buffer, however, then this method will flush the buffer and write the
* bytes directly to the underlying output stream. Thus redundant
* <code>BufferedOutputStream</code>s will not copy data unnecessarily.
*
* #param b the data.
* #param off the start offset in the data.
* #param len the number of bytes to write.
* #exception IOException if an I/O error occurs.
*/
public synchronized void write(byte b[], int off, int len) throws IOException {
if (len >= buf.length) {
/* If the request length exceeds the size of the output buffer,
flush the output buffer and then write the data directly.
In this way buffered streams will cascade harmlessly. */
flushBuffer();
out.write(b, off, len);
return;
}
if (len > buf.length - count) {
flushBuffer();
}
System.arraycopy(b, off, buf, count, len);
count += len;
}
I have lately been trying to explore IO performance. From what I have observed, directly writing to a FileOutputStream has led to better results; which I have attributed to FileOutputStream's native call for write(byte[], int, int). Moreover, I have also observed that when BufferedOutputStream's latency begins to converge towards that of direct FileOutputStream, it fluctuates a lot more i.e. it can abruptly even double-up (I haven't yet been able to find out why).
P.S. I am using Java 8 and will not be able to comment right now on whether my observations will hold for previous java versions.
Here's the code I tested, where my input was a ~10KB file
public class WriteCombinationsOutputStreamComparison {
private static final Logger LOG = LogManager.getLogger(WriteCombinationsOutputStreamComparison.class);
public static void main(String[] args) throws IOException {
final BufferedInputStream input = new BufferedInputStream(new FileInputStream("src/main/resources/inputStream1.txt"), 4*1024);
final ByteArrayOutputStream byteArrayOutputStream = new ByteArrayOutputStream();
int data = input.read();
while (data != -1) {
byteArrayOutputStream.write(data); // everything comes in memory
data = input.read();
}
final byte[] bytesRead = byteArrayOutputStream.toByteArray();
input.close();
/*
* 1. WRITE USING A STREAM DIRECTLY with entire byte array --> FileOutputStream directly uses a native call and writes
*/
try (OutputStream outputStream = new FileOutputStream("src/main/resources/outputStream1.txt")) {
final long begin = System.nanoTime();
outputStream.write(bytesRead);
outputStream.flush();
final long end = System.nanoTime();
LOG.info("Total time taken for file write, writing entire array [nanos=" + (end - begin) + "], [bytesWritten=" + bytesRead.length + "]");
if (LOG.isDebugEnabled()) {
LOG.debug("File reading result was: \n" + new String(bytesRead, Charset.forName("UTF-8")));
}
}
/*
* 2. WRITE USING A BUFFERED STREAM, write entire array
*/
// changed the buffer size to different combinations --> write latency fluctuates a lot for same buffer size over multiple runs
try (BufferedOutputStream outputStream = new BufferedOutputStream(new FileOutputStream("src/main/resources/outputStream1.txt"), 16*1024)) {
final long begin = System.nanoTime();
outputStream.write(bytesRead);
outputStream.flush();
final long end = System.nanoTime();
LOG.info("Total time taken for buffered file write, writing entire array [nanos=" + (end - begin) + "], [bytesWritten=" + bytesRead.length + "]");
if (LOG.isDebugEnabled()) {
LOG.debug("File reading result was: \n" + new String(bytesRead, Charset.forName("UTF-8")));
}
}
}
}
OUTPUT:
2017-01-30 23:38:59.064 [INFO] [main] [WriteCombinationsOutputStream] - Total time taken for file write, writing entire array [nanos=100990], [bytesWritten=11059]
2017-01-30 23:38:59.086 [INFO] [main] [WriteCombinationsOutputStream] - Total time taken for buffered file write, writing entire array [nanos=142454], [bytesWritten=11059]
Situation: I have an ArrayList<String> containing a bunch of links to images (http:/www.foo.com/bar/image1.jpg, http:/www.foo.com/bar/image2.png,... etc)
I have found a working piece of code in order to download them one by one:
public void run() {
try {
int counter = 1;
for (String image : imagesList) {
controller.setDownloadStatusTextArea("Downloading image " + counter + " of " + imagesList.size());
URL u = new URL(image);
URLConnection uc = u.openConnection();
String contentType = uc.getContentType();
int contentLength = uc.getContentLength();
InputStream raw = uc.getInputStream();
InputStream in = new BufferedInputStream(raw);
byte[] data = new byte[contentLength];
int bytesRead;
int offset = 0;
while (offset < contentLength) {
bytesRead = in.read(data, offset, data.length - offset);
if (bytesRead == -1)
break;
offset += bytesRead;
}
in.close();
if (offset != contentLength) {
throw new IOException("Only read " + offset + " bytes; Expected " + contentLength + " bytes");
}
String[] tmp = image.split("/");
String filename = tmp[tmp.length - 1];
FileOutputStream out = new FileOutputStream(filename);
out.write(data);
out.flush();
out.close();
counter++;
}
controller.setDownloadStatusTextArea("Download complete");
} catch (Exception ex) {
controller.setDownloadStatusTextArea("Download failed");
}
}
This is the first time I'm doing something like this in Java, and I have a feeling this code can be much more efficient by moving a bunch of variables outside of the for loop. But I'm not sure which can be safely moved outside without affecting the functionality and/or performance (both in a negative or positive way). An insight in this situation would be greatly appreciated.
Also: Can I specify where the files need to be downloaded to? Now they just appear in the project folder, I want the user to be able to change his download folder.
Thanks in advance.
This code can't be made much more time-efficient.
Think of it this way: even if you polished every last dispensable opcode out it, the time it takes for the JVM to execute this portion of code is not significant at all. The real delay will be in waiting for the data to arrive through the network.
It could be more space-efficient, but I don't think it's necessary.
Edit: what you can do is download multiple images at the same time, using threads. If the code above looks complicated though, I would advise against it: take some more time to learn your way around the language.
You don't need to allocte a byte array for whole image... you only need a small buffer - e.g. 8 kB.
Then, read 8 kB from the connection, and write into the FileOutputStream, in a loop.
To make whole code simpler (kick out the loops), you can use e.g.
Commons-IO
(click on FRAMES link to see whole javadoc).
In a Swing application to let the user select a directory, instantiate a JFileChooser with setFileSelectionMode(JFileChooser.DIRECTORIES_ONLY).
You could move all the variable declarations outside of the loop as long as you ensure they are properly initialized with each iteration. You won't save a lot of time relative to the time it will take to download and save the file though.