Download a large file from HDFS

Download a large file from HDFS - java

I was given DataInputStream from a HDFS client for a large file (around 2GB) and I need to store it as a file on my host.
I was thinking about using apache common IOUtils and doing something like this...
File temp = getTempFile(localPath);
DataInputStream dis = HDFSClient.open(filepath); // around 2GB file (zipped)
in = new BufferedInputStream(dis);
out = new FileOutputStream(temp);
IOUtils.copy(in, out);
I was looking for other solutions that can work better than this approach. Major concern for this is to use buffering in both input and IOUtils.copy ...

For files larger than 2GB is recommended to use IOUtils.copyLarge() (if we are speaking about the same IOUtils: org.apache.commons.io.IOUtils )
The copy in IOUtils uses a default buffer size of 4Kb (although you can specify another buffer size as a parameter).
The difference between copy() and copyLarge() is the returning result.
For copy(), if the stream is bigger than 2GB you will succeed with the copy but the result is -1.
For copyLarge() the result is exactly the amount of bytes you copied.
See more in the documentation here:
http://commons.apache.org/proper/commons-io/apidocs/org/apache/commons/io/IOUtils.html#copyLarge(java.io.InputStream,%20java.io.OutputStream)

Related

How can I read a Base64 file that comes as a chain?

I am currently developing a REST service which receives in its request a field where it is passed a file in base 64 format ("n" characters come). What I do within the service logic is to convert that character string to a File to save it in a predetermined path.
But the problem is that when the file is too large (3MB) the service becomes slow and takes a long time to respond.
This is the code I am using:
String filename = "TEXT.DOCX"
BufferedOutputStream stream = null;
// THE FIELD base64file IS WHAT A STRING IN BASE FORMAT COMES FROM THE REQUEST 64
byte [] fileByteArray = java.util.Base64.getDecoder (). decode (base64file);
// VALID FILE SIZE
if ((1 * 1024 * 1024 <fileByteArray.length) {
    logger.info ("The file [" + filename + "] is too large");
} else {
    stream = new BufferedOutputStream (new FileOutputStream (new File ("C: \" + filename)));
    stream.write (fileByteArray);
}
How can I do to avoid this inconvenience. And that my service does not take so long to convert the file to File.

Buffering does not improve your performance here, as all you are trying to do is simply write the file as fast as possible. Generally it looks fine, change your code to directly use the FileOutputStream and see if it betters things:
try (FileOutputStream stream = new FileOutputStream(path)) {
stream.write(bytes);
}
Alternatively you could also try using something like Apache Commons to do the task for you:
FileUtils.writeByteArrayToFile(new File(path), bytes);

Try the following, also for large files.
Path outPath = Paths.get(filename);
try (InputStream in = Base64.getDecoder ().wrap(base64file)) {
Files.copy(in, outPath);
}
This keeps only a buffer in memory. Your code might become slow because of taking more memory.
wrap takes an InputStream which you should provide, not the entire String.

From Network point of view:
Both json and xml can support large amount of data exchange. And, 3MB is not really huge. But, there is a limitation on how much browser can handle (if this call is from a user interface).
Also, web server like Tomcat has property to handle 2MB by default (check maxPostSize http://tomcat.apache.org/tomcat-6.0-doc/config/http.html#Common_Attributes)
You can also try chunking the request payload (although it shouldn't be required for a 3MB file)
From Implementation point of view:
Write operation on your disk could be slow. It also depends on your OS.
If your file size is really large, you can use Java FileChannel class with ByteBuffer.
To know the cause of slowness (network delay or code), check the performance with a simple Java program against the web service call.

Writing large files using BufferedOutputStream

I am trying read and write large files (larger than 100 MBs) using BufferedInputStream & BufferedOutputStream. I am getting Heap Memory issue & OOM exception.
The code looks like :
BufferedInputStream buffIn = new BufferedInputStream(iStream);
/** iStream is the InputStream object **/
BufferedOutputStream buffOut=new BufferedOutputStream(new FileOutputStream(file));
byte []arr = new byte [1024 * 1024];
int available = -1;
while((available = buffIn.read(arr)) > 0) {
buffOut.write(arr, 0, available);
}
buffOut.flush();
buffOut.close();
My question is when we use the BufferedOutputStreeam is it holding the memory till the full file is written out ?
What is the best way to write large files using BufferedOutputStream?

there is nothing wrong with the code you have provided. your memory issues must lie elsewhere. the buffered streams have a fixed memory usage limit.
the easiest way to determine what has caused an OOME, of course, is to have the OOME generate a heap dump and then examine that heap dump in a memory profiler.

Can I write multiple byte arrays to an HttpClient without client-side buffering?

The Problem
I would like to upload very large files (up to 5 or 6 GB) to a web server using the HttpClient class (4.1.2) from Apache. Before sending these files, I break them into smaller chunks (100 MB, for example). Unfortunately, all of the examples I see for doing a multi-part POST using HttpClient appear to buffer the file contents before sending them (typically, a small file size is assumed). Here is such an example:
HttpClient httpclient = new DefaultHttpClient();
HttpPost post = new HttpPost("http://www.example.com/upload.php");
MultipartEntity mpe = new MultipartEntity();
// Here are some plain-text fields as a part of our multi-part upload
mpe.addPart("chunkIndex", new StringBody(Integer.toString(chunkIndex)));
mpe.addPart("fileName", new StringBody(somefile.getName()));
// Now for a file to include; looks like we're including the whole thing!
FileBody bin = new FileBody(new File("/path/to/myfile.bin"));
mpe.addPart("myFile", bin);
post.setEntity(mpe);
HttpResponse response = httpclient.execute(post);
In this example, it looks like we create a new FileBody object and add it to the MultipartEntity. In my case, where the file could be 100 MB in size, I'd rather not buffer all of that data at once. I'd like to be able to write out that data in smaller chunks (4 MB at a time, for example), eventually writing all 100 MB. I'm able to do this using the HTTPURLConnection class from Java (by writing directly to the output stream), but that class has its own set of problems, which is why I'm trying to use the Apache offerings.
My Question
Is it possible to write 100 MB of data to an HttpClient, but in smaller, iterative chunks? I don't want the client to have to buffer up to 100 MB of data before actually doing the POST. None of the examples I see seem to allow you to write directly to the output stream; they all appear to pre-package things before the execute() call.
Any tips would be appreciated!
--- Update ---
For clarification, here's what I did previously with the HTTPURLConnection class. I'm trying to figure out how to do something similar in HttpClient:
// Get the connection's output stream
out = new DataOutputStream(conn.getOutputStream());
// Write some plain-text multi-part data
out.writeBytes(fieldBuffer.toString());
// Figure out how many loops we'll need to write the 100 MB chunk
int bufferLoops = (dataLength + (bufferSize - 1)) / bufferSize;
// Open the local file (~5 GB in size) to read the data chunk (100 MB)
raf = new RandomAccessFile(file, "r");
raf.seek(startingOffset); // Position the pointer to the beginning of the chunk
// Keep track of how many bytes we have left to read for this chunk
int bytesLeftToRead = dataLength;
// Write the file data block to the output stream
for(int i=0; i<bufferLoops; i++)
{
// Create an appropriately sized mini-buffer (max 4 MB) for the pieces
// of this chunk we have yet to read
byte[] buffer = (bytesLeftToRead < bufferSize) ?
new byte[bytesLeftToRead] : new byte[bufferSize];
int bytes_read = raf.read(buffer); // Read ~4 MB from the local file
out.write(buffer, 0, bytes_read); // Write that bit to the stream
bytesLeftToRead -= bytes_read;
}
// Write the final boundary
out.writeBytes(finalBoundary);
out.flush();

If I'm understanding your question correctly, your concern is loading the whole file into memory (right?). If That is the case, you should employ Streams (such as a FileInputStream). That way, the whole file doesn't get pulled into memory at once.
If that doesn't help, and you still want to divide the file up into chunks, you could code the server to deal with multiple POSTS, concatenating the data as it gets them, and then manually split up the bytes of the file.
Personally, I prefer my first answer, but either way (or neither way if these don't help), Good luck!

Streams are definitely the way to go, I remember doing something similar a while back with some bigger files and it worked perfectly.

All you need is to wrap your custom content generation logic into HttpEntity implementation. This will give you a complete control over the process of content generation and content streaming.
And for the record: MultipartEntity shipped with HttpClient does not buffer file parts in memory prior to writing them out to the connection socket.

getResourceAsStream returns HttpInputStream not of the entire file

I am having a web application with an applet which will copy a file packed witht the applet to the client machine.
When I deploy it to webserver and use: InputStream in = getClass().getResourceAsStream("filename") ;
The in.available() always return a size of 8192 bytes for every file I tried, which means the file is corrupted when it is copied to the client computer.
The InputStream is of type HttpInputStream (sun.net.protocol.http.HttpUrlConnection$httpInputStream). But while I test applet in applet viewer, the files are copied fine, with the InputStream returned is of type BufferedInputStream, which has the file's byte sizes. I guess that when getResourceStream in file system the BufferedInputStream will be used and when at http protocol, HttpInputStream will be used.
How will I copy the file completely, is there a size limited for HttpInputStream?
Thanks a lot.

in.available() tells you how many bytes you can read without blocking, not the total number of bytes you can read from a stream.
Here's an example of copying an InputStream to an OutputStream from org.apache.commons.io.IOUtils:
public static long copyLarge(InputStream input, OutputStream output)
throws IOException {
byte[] buffer = new byte[DEFAULT_BUFFER_SIZE];
long count = 0;
int n = 0;
while (-1 != (n = input.read(buffer))) {
output.write(buffer, 0, n);
count += n;
}
return count;
}

The in.available() always return a size of 8192 bytes for every file I tried, which means the file is corrupted when it is copied to the client computer.
It does not mean that at all!
The in.available() method returns the number of characters that can be read without blocking. It is not the length of the stream. In general, there is no way to determine the length of an InputStream apart from reading (or skipping) all bytes in the stream.
(You may have observed that new FileInputStream("someFile").available() usually gives you the file size. But that behaviour is not guaranteed by the spec, and is certainly untrue for some kinds of file, and possibly for some kinds of file system as well. A better way to get the size of a file is new File("someFile").length(), but even that doesn't work in some cases.)
See #tdavies answer for example code for copying an entire stream's contents. There are also third party libraries that can do this kind of thing; e.g. org.apache.commons.net.io.Util.

Java Heap Space (CMS with huge files)

EDIT:
Got the directory to live. Now there's another issue in sight:
The files in the storage are stored with their DB id as a prefix
to their file names. Of course I don't want the users to see those.
Is there a way to combine the response.redirect and the header setting
für filename and size?
best,
A
Hi again,
new approach:
Is it possible to create a IIS like virtual directory within tomcat in order
to avoid streaming and only make use of header redirect? I played around with
contexts but could'nt get it going...
any ideas?
thx
A
Hi %,
I'm facing a wired issue with the java heap space which is close
to bringing me to the ropes.
The short version is:
I've written a ContentManagementSystem which needs to handle
huge files (>600mb) too. Tomcat heap settings:
-Xmx700m
-Xms400m
The issue is, that uploading huge files works eventhough it's
slow. Downloading files results in a java heap space exception.
Trying to download a 370mb file makes tomcat jump to 500mb heap
(which should be ok) and end in an Java heap space exception.
I don't get it, why does upload work and download not?
Here's my download code:
byte[] byt = new byte[1024*1024*2];
response.setHeader("Content-Disposition", "attachment;filename=\"" + fileName + "\"");
FileInputStream fis = null;
OutputStream os = null;
fis = new FileInputStream(new File(filePath));
os = response.getOutputStream();
BufferedInputStream buffRead = new BufferedInputStream(fis);
while((read = buffRead.read(byt))>0)
{
os.write(byt,0,read);
os.flush();
}
buffRead.close();
os.close();
If I'm getting it right the buffered reader should take care of any
memory issue, right?
Any help would be highly appreciated since I ran out of ideas
Best regards,
W

If I'm getting it right the buffered
reader should take care of any memory
issue, right?
No, that has nothing to do with memory issues, it's actually unnecessary since you're already using a buffer to read the file. Your problem is with writing, not with reading.
I can't see anything immediately wrong with your code. It looks as though Tomcat is buffering the entire response instead of streaming it. I'm not sure what could cause that.
What does response.getBufferSize() return? And you should try setting response.setContentLength() to the file's size; I vaguely remember that a web container under certain circumstances buffers the entire response in order to determine the content length, so maybe that's what's happening. It's good practice to do it anyway since it enables clients to display the download size and give an ETA for the download.

Try using the setBufferSize and flushBuffer methods of the ServletResponse.

You better use java.nio for that, so you can read resources partially and free resources already streamed!
Otherwise, you end up with memory problems despite the settings you've done to the JVM environment.

My suggestions:
The Quick-n-easy: Use a smaller array! Yes, it loops more, but this will not be a problem. 5 kilobytes is just fine. You'll know if this works adequately for you in minutes.
byte[] byt = new byte[1024*5];
A little bit harder: If you have access to sendfile (like in Tomcat with the Http11NioProtocol -- documentation here), then use it
A little bit harder, again: Switch your code to Java NIO's FileChannel. I have very, very similar code running on equally large files with hundreds of concurrent connections and similar memory settings with no problem. NIO is faster than plain old Java streams in these situations. It uses the magic of DMA (Direct Memory Access) allowing the data to go from disk to NIC without ever going through RAM or the CPU. Here is a code snippet for my own code base...I've ripped out much to show the basics. FileChannel.transferTo() is not guaranteed to send every byte, so it is in this loop.
WritableByteChannel destination = Channels.newChannel(response.getOutputStream());
FileChannel source = file.getFileInputStream().getChannel();
while (total < length) {
long sent = source.transferTo(start + total, length - total, destination);
total += sent;
}

The following code is able to streaming data to the client, allocating only a small buffer (BUFFER_SIZE, this is a soft point since you may want to adjust it):
private static final int OUTPUT_SIZE = 1024 * 1024 * 50; // 50 Mb
private static final int BUFFER_SIZE = 4096;
#Override
protected void doGet(HttpServletRequest request,HttpServletResponse response)
throws ServletException, IOException {
String fileName = "42.txt";
// build response headers
response.setStatus(200);
response.setContentLength(OUTPUT_SIZE);
response.setContentType("text/plain");
response.setHeader("Content-Disposition",
"attachment;filename=\"" + fileName + "\"");
response.flushBuffer(); // write HTTP headers to the client
// streaming result
InputStream fileInputStream = new InputStream() { // fake input stream
int i = 0;
#Override
public int read() throws IOException {
if (i++ < OUTPUT_SIZE) {
return 42;
} else {
return -1;
}
}
};
ReadableByteChannel input = Channels.newChannel(fileInputStream);
WritableByteChannel output = Channels.newChannel(
response.getOutputStream());
ByteBuffer buffer = ByteBuffer.allocate(BUFFER_SIZE);
while (input.read(buffer) != -1) {
buffer.flip();
output.write(buffer);
buffer.clear();
}
input.close();
output.close();
}

Are you required to serve files using Tomcat? For this kind of tasks we have used separate download mechanism. We chained Apache -> Tomcat -> storage and then add rewrite rules for download. Then you just by-pass Tomcat and Apache will serve the file to client (Apache->storage). But if works only if you have files stored as files. If you read from DB or other type of non-file storage this solution cannot be used successfully. the overall scenario is that you generate download links for files as e.g. domain/binaries/xyz... and write redirect rule for domain/files using Apache mod_rewrite.

Do you have any filters in the application, or do you use the tcnative library? You could try to profile it with jvisualvm?
Edit: Small remark: Note that you have a HTTP response splitting attack possibility in the setHeader if you do not sanitize fileName.

Why don't you use tomcat's own FileServlet?
It can surely give out files much better than you can possible imagine.

A 2-MByte buffer is way too large! A few k should be ample. Megabyte-sized objects are a real issue for the garbage collector, since they often need to be treated separately from "normal" objects (normal == much smaller than a heap generation). To optimize I/O, your buffer only needs to be slightly larger than your I/O buffer size, i.e. at least as large as a disk block or network package.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Download a large file from HDFS - java

Related

How can I read a Base64 file that comes as a chain?

Writing large files using BufferedOutputStream

Can I write multiple byte arrays to an HttpClient without client-side buffering?

getResourceAsStream returns HttpInputStream not of the entire file

Java Heap Space (CMS with huge files)

Categories

Resources