I wanted to use Base64.java to encode and decode files. Encode.wrap(InputStream) and decode.wrap(InputStream) worked but runned slowly. So I used following code.
public static void decodeFile(String inputFileName,
String outputFileName)
throws FileNotFoundException, IOException {
Base64.Decoder decoder = Base64.getDecoder();
InputStream in = new FileInputStream(inputFileName);
OutputStream out = new FileOutputStream(outputFileName);
byte[] inBuff = new byte[BUFF_SIZE]; //final int BUFF_SIZE = 1024;
byte[] outBuff = null;
while (in.read(inBuff) > 0) {
outBuff = decoder.decode(inBuff);
out.write(outBuff);
}
out.flush();
out.close();
in.close();
}
However, it always throws
Exception in thread "AWT-EventQueue-0" java.lang.IllegalArgumentException: Input byte array has wrong 4-byte ending unit
at java.util.Base64$Decoder.decode0(Base64.java:704)
at java.util.Base64$Decoder.decode(Base64.java:526)
at Base64Coder.JavaBase64FileCoder.decodeFile(JavaBase64FileCoder.java:69)
...
After I changed final int BUFF_SIZE = 1024; into final int BUFF_SIZE = 3*1024;, the code worked. Since "BUFF_SIZE" is also used to encode file, I believe there were something wrong with the file encoded (1024 % 3 = 1, which means paddings are added in the middle of the file).
Also, as #Jon Skeet and #Tagir Valeev mentioned, I should not ignore the return value from InputStream.read(). So, I modified the code as below.
(However, I have to mention that the code does run much faster than using wrap(). I noticed the speed difference because I had coded and intensively used Base64.encodeFile()/decodeFile() long before jdk8 was released. Now, my buffed jdk8 code runs as fast as my original code. So, I do not know what is going on with wrap()... )
public static void decodeFile(String inputFileName,
String outputFileName)
throws FileNotFoundException, IOException
{
Base64.Decoder decoder = Base64.getDecoder();
InputStream in = new FileInputStream(inputFileName);
OutputStream out = new FileOutputStream(outputFileName);
byte[] inBuff = new byte[BUFF_SIZE];
byte[] outBuff = null;
int bytesRead = 0;
while (true)
{
bytesRead = in.read(inBuff);
if (bytesRead == BUFF_SIZE)
{
outBuff = decoder.decode(inBuff);
}
else if (bytesRead > 0)
{
byte[] tempBuff = new byte[bytesRead];
System.arraycopy(inBuff, 0, tempBuff, 0, bytesRead);
outBuff = decoder.decode(tempBuff);
}
else
{
out.flush();
out.close();
in.close();
return;
}
out.write(outBuff);
}
}
Special thanks to #Jon Skeet and #Tagir Valeev.
I strongly suspect that the problem is that you're ignoring the return value from InputStream.read, other than to check for the end of the stream. So this:
while (in.read(inBuff) > 0) {
// This always decodes the *complete* buffer
outBuff = decoder.decode(inBuff);
out.write(outBuff);
}
should be
int bytesRead;
while ((bytesRead = in.read(inBuff)) > 0) {
outBuff = decoder.decode(inBuff, 0, bytesRead);
out.write(outBuff);
}
I wouldn't expect this to be any faster than using wrap though.
Try to use decode.wrap(new BufferedInputStream(new FileInputStream(inputFileName))). With buffering it should be at least as fast as your manually crafted version.
As for why your code doesn't work: that's because the last chunk is likely to be shorter than 1024 bytes, but you try to decode the whole byte[] array. See the #JonSkeet answer for details.
Well, I changed
"final int BUFF_SIZE = 1024;"
into
"final int BUFF_SIZE = 1024 * 3;"
It worked!
So, I guess probabaly there is something wrong with padding... I mean, when encoding the file, (since 1024 % 3 = 1) there must be paddings. And those might raise problems when decoding...
You should records the number of bytes you have read, beside this,
You should be sure that your buffer size is divisible for 3, cause in Base64, every 3 bytes have four output(64 is 2^6, and 3*8 equals 4*6), by doing this, you can avoid padding problems.( In this way your output will not have the wrong ending of "=")
Related
I have code block to read mentioned number of bytes from an InputStream and return a byte[] using ByteArrayOutputStream. When I'm writing that byte[] array to a file, resultant file on the filesystem seems broken. Can anyone help me find out problem in the below code block.
public byte[] readWrite(long bytes, InputStream in) throws IOException {
ByteArrayOutputStream bos = new ByteArrayOutputStream();
int maxReadBufferSize = 8 * 1024; //8KB
long numReads = bytes/maxReadBufferSize;
long numRemainingRead = bytes % maxReadBufferSize;
for(int i=0; i<numReads; i++) {
byte bufr[] = new byte[maxReadBufferSize];
int val = in.read(bufr, 0, bufr.length);
if(val != -1) {
bos.write(bufr);
}
}
if(numRemainingRead > 0) {
byte bufr[] = new byte[(int)numRemainingRead];
int val = in.read(bufr, 0, bufr.length);
if(val != -1) {
bos.write(bufr);
}
}
return bos.toByteArray();
}
My understanding of the problem statement
Read bytes number of bytes from the given InputStream in a ByteArrayOutputStream.
Finally, return a byte array.
Key observations
A lot of work is done to make sure bytes are read in chunks of 8KB.
Also, the last remaining chunk of odd size is read separately.
A lot of work is also done to make sure we are reading from the correct offset.
My views
Unless we are reading a very large file (>10MB) I don't see a valid reason for reading in chunks of 8KB.
Let Java libraries do all the hard work of maintaining offset and making sure we don't read outside limits.
Eg: We don't have to give offset, simply do inputStream.read(b) over and over, the next byte array of size b.length will be read. Similarly, we can simply write to outputStream.
Code
public byte[] readWrite(long bytes, InputStream in) throws IOException {
ByteArrayOutputStream bos = new ByteArrayOutputStream();
byte[] buffer = new byte[(int)bytes];
is.read(buffer);
bos.write(buffer);
return bos.toByteArray();
}
References
About InputStreams
Byte Array to Human Readable Format
I am trying several ways to decode the bytes of a file into characters.
Using java.io.Reader and Channels.newReader(...)
public static void decodeWithReader() throws Exception {
FileInputStream fis = new FileInputStream(FILE);
FileChannel channel = fis.getChannel();
CharsetDecoder decoder = Charset.defaultCharset().newDecoder();
Reader reader = Channels.newReader(channel, decoder, -1);
final char[] buffer = new char[4096];
for(;;) {
if(-1 == reader.read(buffer)) {
break;
}
}
fis.close();
}
Using buffers and a decoder manually:
public static void readWithBuffers() throws Exception {
FileInputStream fis = new FileInputStream(FILE);
FileChannel channel = fis.getChannel();
CharsetDecoder decoder = Charset.defaultCharset().newDecoder();
final long fileLength = channel.size();
long position = 0;
final int bufferSize = 1024 * 1024; // 1MB
CharBuffer cbuf = CharBuffer.allocate(4096);
while(position < fileLength) {
MappedByteBuffer bbuf = channel.map(MapMode.READ_ONLY, position, Math.min(bufferSize, fileLength - position));
for(;;) {
CoderResult res = decoder.decode(bbuf, cbuf, false);
if(CoderResult.OVERFLOW == res) {
cbuf.clear();
} else if (CoderResult.UNDERFLOW == res) {
break;
}
}
position += bbuf.position();
}
fis.close();
}
For a 200MB text file, the first approach consistently takes 300ms to complete. The second approach consistently takes 700ms. Do you have any idea why the reader approach is so much faster?
Can it run even faster with another implementation?
The benchmark is performed on Windows 7, and JDK7_07.
For comparison can you try.
public static void readWithBuffersISO_8859_1() throws Exception {
FileInputStream fis = new FileInputStream(FILE);
FileChannel channel = fis.getChannel();
MappedByteBuffer bbuf = channel.map(FileChannel.MapMode.READ_ONLY, 0, channel.size());
while(bbuf.remaining()>0) {
char ch = (char)(bbuf.get() & 0xFF);
}
fis.close();
}
This assumes an ISO-8859-1. If you want maximum speed, treating the text like a binary format can help if its an option.
As #EJP points out, you are changing a number of things as once and you need to start with the simplest comparable example and see how much difference each element adds.
Here is a third implementation that does not use mapped buffers. In the same conditions than before, it runs consistently in 220ms. The default charset on my machine being "windows-1252", if I force the simpler "ISO-8859-1" charset the decoding is even faster (about 150ms).
It looks like the usage of native features like mapped buffers actually hurts performance (for this very use case). Also interesting, if I allocate direct buffers instead of heap buffers (look at the commented lines) then the performance is reduced (a run then takes around 400ms).
So far the answer seems to be: to decode characters as fast as possible in Java (provided you can't enforce the usage of one charset), use a decoder manually, write the decode loop with heap buffers, do not use mapped buffers or even native ones. I have to admit that I don't really know why it is so.
public static void readWithBuffers() throws Exception {
FileInputStream fis = new FileInputStream(FILE);
FileChannel channel = fis.getChannel();
CharsetDecoder decoder = Charset.defaultCharset().newDecoder();
// CharsetDecoder decoder = Charset.forName("ISO-8859-1").newDecoder();
ByteBuffer bbuf = ByteBuffer.allocate(4096);
// ByteBuffer bbuf = ByteBuffer.allocateDirect(4096);
CharBuffer cbuf = CharBuffer.allocate(4096);
// CharBuffer cbuf = ByteBuffer.allocateDirect(2 * 4096).asCharBuffer();
for(;;) {
if(-1 == channel.read(bbuf)) {
decoder.decode(bbuf, cbuf, true);
decoder.flush(cbuf);
break;
}
bbuf.flip();
CoderResult res = decoder.decode(bbuf, cbuf, false);
if(CoderResult.OVERFLOW == res) {
cbuf.clear();
} else if (CoderResult.UNDERFLOW == res) {
bbuf.compact();
}
}
fis.close();
}
I have an InputStream that I want written to a HttpServletResponse.
There's this approach, which takes too long due to the use of byte[]
InputStream is = getInputStream();
int contentLength = getContentLength();
byte[] data = new byte[contentLength];
is.read(data);
//response here is the HttpServletResponse object
response.setContentLength(contentLength);
response.write(data);
I was wondering what could possibly be the best way to do it, in terms of speed and efficiency.
Just write in blocks instead of copying it entirely into Java's memory first. The below basic example writes it in blocks of 10KB. This way you end up with a consistent memory usage of only 10KB instead of the complete content length. Also the enduser will start getting parts of the content much sooner.
response.setContentLength(getContentLength());
byte[] buffer = new byte[10240];
try (
InputStream input = getInputStream();
OutputStream output = response.getOutputStream();
) {
for (int length = 0; (length = input.read(buffer)) > 0;) {
output.write(buffer, 0, length);
}
}
As creme de la creme with regard to performance, you could use NIO Channels and a directly allocated ByteBuffer. Create the following utility/helper method in some custom utility class, e.g. Utils:
public static long stream(InputStream input, OutputStream output) throws IOException {
try (
ReadableByteChannel inputChannel = Channels.newChannel(input);
WritableByteChannel outputChannel = Channels.newChannel(output);
) {
ByteBuffer buffer = ByteBuffer.allocateDirect(10240);
long size = 0;
while (inputChannel.read(buffer) != -1) {
buffer.flip();
size += outputChannel.write(buffer);
buffer.clear();
}
return size;
}
}
Which you then use as below:
response.setContentLength(getContentLength());
Utils.stream(getInputStream(), response.getOutputStream());
BufferedInputStream in = null;
BufferedOutputStream out = null;
OutputStream os;
os = new BufferedOutputStream(response.getOutputStream());
in = new BufferedInputStream(new FileInputStream(file));
out = new BufferedOutputStream(os);
byte[] buffer = new byte[1024 * 8];
int j = -1;
while ((j = in.read(buffer)) != -1) {
out.write(buffer, 0, j);
}
I think that is very close to the best way, but I would suggest the following change. Use a fixed size buffer(Say 20K) and then do the read/write in a loop.
For the loop do something like
byte[] buffer=new byte[20*1024];
outputStream=response.getOutputStream();
while(true) {
int readSize=is.read(buffer);
if(readSize==-1)
break;
outputStream.write(buffer,0,readSize);
}
ps: Your program will not always work as is, because read don't always fill up the entire array you give it.
I am getting OutOfMemory Exception. Why? I am using this code for logging. Does this approach correct?
Exceptions and closing of streams are handled in parent methods.
private static void writeToFile(File file, FileWriter out, String message) throws IOException {
if (file.exists() && file.isFile()) {
if ((file.length() + message.getBytes().length) <= FILE_MAX_SIZE_B) {
out.write(message);
} else {
int cutLenght = (int) (file.length() + message.getBytes().length - FILE_MAX_SIZE_B);
FileInputStream fileInputStream = new FileInputStream(file);
BufferedReader bufferedReader = new BufferedReader(new InputStreamReader(fileInputStream));
char[] buf = new char[1024];
int numRead = 0;
StringBuffer text = new StringBuffer(1000);
while ((numRead=bufferedReader.read(buf)) != -1) {
text.append(buf,0,numRead);
}
String result = new String(text).substring(cutLenght);
result += message;
FileWriter fileWriter = new FileWriter(file, appendToFile);
writeToFile(file, fileWriter, result);
bufferedReader.close();
}
}
}
EDIT:
I am using this method for writting my logs in file. So for example in one second I can call 10 logs. I am getting error on lines:
while ((numRead=bufferedReader.read(buf)) != -1) {
text.append(buf,0,numRead);
}
My guess is that you are getting the OutOfMemoryError because you are reading the entire contents of the log file back into memory once it has gotten too close to its maximum size.
You could instead read and write it in smaller chunks, but that could be tricky since you have to avoid overwriting something you haven't already read.
Overall, this technique seems like a very inefficient method of maintaining the log data. Some alternative approaches off the top of my head:
(1) maintain a set of n log files, each with maximum size FILE_MAX_SIZE_B/n. When the first log fills up, open the next one for writing, and so on; when the last one fills up, go back to the first one. In this way you are discarding some of the oldest log data each time you switch files, but not all of it, and still maintaining your overall size limit.
(2) rotate the data within a single file. After each write, add a marker that indicates this is the end of the log stream. When the file has reached its maximum size, just start again at the beginning, overwriting the data that is there. The marker will tell you where the latest message is.
Try something like this:
void appendToFile(File f, CharSequence message, Charset cs, long maximumSize) throws IOException {
long available = maximumSize - f.length();
if (available > 0) {
FileOutputStream fos = new FileOutputStream(f, true);
try {
CharBuffer chars = CharBuffer.wrap(message);
ByteBuffer bytes = ByteBuffer.allocate(8 * 1024); // Re-used when encoding the string
CharsetEncoder enc = cs.newEncoder();
CoderResult res;
do {
res = enc.encode(chars, bytes, true);
bytes.flip();
long len = Math.min(available, bytes.remaining());
available -= len;
fos.write(bytes.array(), bytes.position(), (int) len);
bytes.clear();
} while (res == CoderResult.OVERFLOW && available > 0);
} finally {
fos.close();
}
}
}
Testable with this:
File f = new File(getCacheDir(), "tmp.txt");
f.delete();
// Or whatever charset you want.
Charset cs = Charset.forName("UTF-8");
int maxlen = 2 * 1024; // For this test, 2kb
try {
for (int i = 0; i < maxlen / 20; i++) {
// Write 30 characters for maxlen/20 times == guaranteed overflow
appendToFile(f, "123456789012345678901234567890", cs, maxlen);
System.out.println("Length=" + f.length());
}
} catch (Throwable t) {
t.printStackTrace();
}
f.delete();
Well, you're getting OOM because you're trying to load a huge file into memory.
Did you try opening it with append option instead?
you get OOME because you load the whole file, then get some part of the string. Instead, do a skip on your input stream and read.
I'm dealing with the following code that is used to split a large file into a set of smaller files:
FileInputStream input = new FileInputStream(this.fileToSplit);
BufferedInputStream iBuff = new BufferedInputStream(input);
int i = 0;
FileOutputStream output = new FileOutputStream(fileArr[i]);
BufferedOutputStream oBuff = new BufferedOutputStream(output);
int buffSize = 8192;
byte[] buffer = new byte[buffSize];
while (true) {
if (iBuff.available() < buffSize) {
byte[] newBuff = new byte[iBuff.available()];
iBuff.read(newBuff);
oBuff.write(newBuff);
oBuff.flush();
oBuff.close();
break;
}
int r = iBuff.read(buffer);
if (fileArr[i].length() >= this.partSize) {
oBuff.flush();
oBuff.close();
++i;
output = new FileOutputStream(fileArr[i]);
oBuff = new BufferedOutputStream(output);
}
oBuff.write(buffer);
}
} catch (Exception e) {
e.printStackTrace();
}
This is the weird behavior I'm seeing... when I run this code using a 3GB file, the initial iBuff.available() call returns a value of a approximatley 2,100,000,000 and the code works fine. When I run this code on a 12GB file, the initial iBuff.available() call only returns a value of 200,000,000 (which is smaller than the split file size of 500,000,000 and causes the processing to go awry).
I'm thinking this discrepancy in behvaior has something to do with the fact that this is on 32-bit windows. I'm going to run a couple more tests on a 4.5 GB file and a 3.5 GB file. If the 3.5 file works and the 4.5 one doesn't, that will further confirm the theory that it's a 32bit vs 64bit issue since 4GB would then be the threshold.
Well if you read the javadoc it quite clearly states:
Returns the number of bytes that can
be read from this input stream
without blocking (emphasis added by me)
So it's quite clear that what you want is not what this method offers. So depending on the underlying InputStream you may get problems much earlier (eg a stream over the network with a server that doesn't return the filesize - you'd have to read the complete file and buffer it just to return the "correct" available() count, which would take a lot of time - what if you only want to read a header?)
So the correct way to handle this is to change your parsing method to be able to handle the file in pieces. Personally I don't see much reason at all to even use available() here - just calling read() and stopping as soon as read() returns -1 should work fine. Can be made more complicated if you want to assure that every file really contains blockSize byte - just add an internal loop if that scenario is important.
int blockSize = XXX;
byte[] buffer = new byte[blockSize];
int i = 0;
int read = in.read(buffer);
while(read != -1) {
out[i++].write(buffer, 0, read);
read = in.read(buffer);
}
There are few correct uses of available(), and this isn't one of them. You don't need all that junk. Memorize this:
int count;
byte[] buffer = new byte[8192]; // or more
while ((count = in.read(buffer)) > 0)
out.write(buffer, 0, count);
That's the canonical way to copy a stream in Java.
You should not use the InputStream.available() function at all. It is only needed in very special circumstances.
You should also not create byte arrays that are larger than 1 MB. It's a waste of memory. The commonly accepted way is to read a small block (4 kB up to 1 MB) from the source file and then store only as many bytes as you have read in the destination file. Do that until you have reached the end of the source file.
available isn't a measure of how much is still to be read but more a measure how much is guaranteed to be able to read before it might EOF or block waiting for input
and put close calls in the finallies
BufferedInputStream iBuff = new BufferedInputStream(input);
int i = 0;
FileOutputStream output;
BufferedOutputStream oBuff=0;
try{
int buffSize = 8192;
int offset=0;
byte[] buffer = new byte[buffSize];
while(true){
int len = iBuff.read(buffer,offset,buffSize-offset);
if(len==-1){//EOF write out last chunk
oBuff.write(buffer,0,offset);
break;
}
if(len+offset==buffSize){//end of buffer write out to file
try{
output = new FileOutputStream(fileArr[i]);
oBuff = new BufferedOutputStream(output);
oBuff.write(buffer);
}finally{
oBuff.close();
}
++i;
offset=0;
}
offset+=len;
}//while
}finally{
iBuff.close();
}
Here is some code that splits a file. If performance is critical to you, you can experiment with the buffer size.
package so6164853;
import java.io.FileInputStream;
import java.io.FileOutputStream;
import java.io.IOException;
import java.io.InputStream;
import java.io.OutputStream;
import java.util.Formatter;
public class FileSplitter {
private static String printf(String fmt, Object... args) {
Formatter formatter = new Formatter();
formatter.format(fmt, args);
return formatter.out().toString();
}
/**
* #param outputPattern see {#link Formatter}
*/
public static void splitFile(String inputFilename, long fragmentSize, String outputPattern) throws IOException {
InputStream input = new FileInputStream(inputFilename);
try {
byte[] buffer = new byte[65536];
int outputFileNo = 0;
OutputStream output = null;
long writtenToOutput = 0;
try {
while (true) {
int bytesToRead = buffer.length;
if (bytesToRead > fragmentSize - writtenToOutput) {
bytesToRead = (int) (fragmentSize - writtenToOutput);
}
int bytesRead = input.read(buffer, 0, bytesToRead);
if (bytesRead != -1) {
if (output == null) {
String outputName = printf(outputPattern, outputFileNo);
outputFileNo++;
output = new FileOutputStream(outputName);
writtenToOutput = 0;
}
output.write(buffer, 0, bytesRead);
writtenToOutput += bytesRead;
}
if (output != null && (bytesRead == -1 || writtenToOutput == fragmentSize)) {
output.close();
output = null;
}
if (bytesRead == -1) {
break;
}
}
} finally {
if (output != null) {
output.close();
}
}
} finally {
input.close();
}
}
public static void main(String[] args) throws IOException {
splitFile("d:/backup.zip", 1440 << 10, "d:/backup.zip.part%04d");
}
}
Some remarks:
Only those bytes that have actually been read from the input file are written to one of the output files.
I left out the BufferedInputStream and BufferedOutputStream since their buffer's size is only 8192 bytes, which less than the buffer I use in the code.
As soon as I open a file, I make sure that it will be closed at the end, no matter what happens. (The finally blocks.)
The code contains only one call to input.read and only one call to output.write. This makes it easier to check for correctness.
The code for splitting a file does not catch the IOException, since it doesn't know what to do in such a case. It is just passed to the caller; maybe the caller knows how to handle it.
Both #ratchet and #Voo are correct.
As for what is happening.
int max value is 2,147,483,647 (http://download.oracle.com/javase/tutorial/java/nutsandbolts/datatypes.html).
14 gigabytes is 15,032,385,536 which clearly don't fit an int.
See that according to the API Javadoc (http://download.oracle.com/javase/6/docs/api/java/io/BufferedInputStream.html#available%28%29) and as stated by #Voo, this don't break the method contract at all (just isn't what you are looking for).