Here is how I compressed the string into a file:
public static void compressRawText(File outFile, String src) {
FileOutputStream fo = null;
GZIPOutputStream gz = null;
try {
fo = new FileOutputStream(outFile);
gz = new GZIPOutputStream(fo);
gz.write(src.getBytes());
gz.flush();
} catch (Exception e) {
e.printStackTrace();
} finally {
try {
gz.close();
fo.close();
} catch (Exception e) {
e.printStackTrace();
}
}
}
Here is how I decompressed it:
static int BUFFER_SIZE = 8 * 1024;
static int STRING_SIZE = 2 * 1024 * 1024;
public static String decompressRawText(File inFile) {
InputStream in = null;
InputStreamReader isr = null;
StringBuilder sb = new StringBuilder(STRING_SIZE);//constant resizing is costly, so set the STRING_SIZE
try {
in = new FileInputStream(inFile);
in = new BufferedInputStream(in, BUFFER_SIZE);
in = new GZIPInputStream(in, BUFFER_SIZE);
isr = new InputStreamReader(in);
char[] cbuf = new char[BUFFER_SIZE];
int length = 0;
while ((length = isr.read(cbuf)) != -1) {
sb.append(cbuf, 0, length);
}
} catch (Exception e) {
e.printStackTrace();
} finally {
try {
in.close();
} catch (Exception e1) {
e1.printStackTrace();
}
}
return sb.toString();
}
The decompression seems to take forever to do. I have got a feeling that I am doing too much redundant steps in the decompression bit. any idea of how I could speed it up?
EDIT: have modified the code to the above based on the following given recommendations,
1. I chaged the pattern, so to simply my code a bit, but if I couldn't use IOUtils is this still ok to use this pattern?
2. I set the StringBuilder buffer to be of 2M, as suggested by entonio, should I set it to be a little bit more? the memory is still OK, I still have around 10M available as it is suggested by the heap monitor from eclipse
3. I cut the BufferedReader and added a BufferedInputStream, but I am still not sure about the BUFFER_SIZE, any suggestions?
The above modification has improved the time taken to loop all my 30 2M files from almost 30 seconds to around 14, but I need to reduce it to under 10, is it even possible on android? Ok, basically, I need to process a text file in all 60M, I have divided them up into 30 2M, and before I start processing on each strings, I did the above timing on the time cost for me just to loop all the files and get the String in the file into my memory. Since I don't have much experience, will it be better, if I use 60 of 1M files instead? or any other improvement should I adopt? Thanks.
ALSO: Since physical IO is quite time consuming, and since my compressed version of files are all quite small(around 2K from 2M of text), is it possible for me to still do the above, but on a file that is already mapped to memory? possibly using java NIO? Thanks
The BufferedReader's only purpose is the readLine() method you don't use, so why not just read from the InputStreamReader? Also, maybe decreasing the buffer size may be helpful. Also, you should probably specify the encoding while both reading and writing, though that shouldn't have an impact on performance.
edit: more data
If you know the size of the string ahead, you should add a length parameter to decompressRawText and use it to initialise the StringBuilder. Otherwise it will be constantly resized in order to accomodate the result, and that's costly.
edit: clarification
2MB implies a lot of resizes. There is no harm if you specify a capacity higher than the length you end up with after reading (other than temporarily using more memory, of course).
You should wrap the FileInputStream with a BufferedInputStream before wrapping with a GZipInputStream, rather than using a BufferedReader.
The reason is that, depending on implementation, any of the various input classes in your decoration hierarchy could decide to read on a byte-by-byte basis (and I'd say the InputStreamReader is most likely to do this). And that would translate into many read(2) calls once it gets to the FileInputStream.
Of course, this may just be superstition on my part. But, if you're running on Linux, you can always test with strace.
Edit: once nice pattern to follow when building up a bunch of stream delegates is to use a single InputStream variable. Then, you only have one thing to close in your finally block (and can use Jakarta Commons IOUtils to avoid lots of nested try-catch-finally blocks).
InputStream in = null;
try
{
in = new FileInputStream("foo");
in = new BufferedInputStream(in);
in = new GZIPInputStream(in);
// do something with the stream
}
finally
{
IOUtils.closeQuietly(in);
}
Add a BufferedInputStream between the FileInputStream and the GZIPInputStream.
Similarly when writing.
Related
I am working on the application which reads large amounts of data from a file. Basically, I have a huge file (around 1.5 - 2 gigs) containing different objects (~5 to 10 millions of them per file). I need to read all of them and put them to different maps in the app. The problem is that the app runs out of memory while reading the objects at some point. Only when I set it to use -Xmx4096m - it can handle the file. But if the file will be larger, it won't be able to do that anymore.
Here's the code snippet:
String sampleFileName = "sample.file";
FileInputStream fileInputStream = null;
ObjectInputStream objectInputStream = null;
try{
fileInputStream = new FileInputStream(new File(sampleFileName));
int bufferSize = 16 * 1024;
objectInputStream = new ObjectInputStream(new BufferedInputStream(fileInputStream, bufferSize));
while (true){
try{
Object objectToRead = objectInputStream.readUnshared();
if (objectToRead == null){
break;
}
// doing something with the object
}catch (EOFException eofe){
eofe.printStackTrace();
break;
} catch (Exception e) {
e.printStackTrace();
continue;
}
}
} catch (Exception e){
e.printStackTrace();
}finally{
if (objectInputStream != null){
try{
objectInputStream.close();
}catch (Exception e2){
e2.printStackTrace();
}
}
if (fileInputStream != null){
try{
fileInputStream.close();
}catch (Exception e2){
e2.printStackTrace();
}
}
}
First of all, I was using objectInputStream.readObject() instead of objectInputStream.readUnshared(), so it solved the issue partially. When I increased the memory from 2048 to 4096, it started parsing the file. BufferedInputStream is already in use. From the web I've found only examples how to read lines or bytes, but nothing regarding objects, performance wise.
How can I read the file without increasing the memory for JVM and avoiding the OutOfMemory exception? Is there any way to read objects from the file, not keeping anything else in the memory?
When reading big files, parsing objects and keeping them in memory there are several solutions with several tradeoffs:
You can fit all parsed objects into memory for that app deployed on one server. It either requires to store all objects in very zipped way, for example using byte or integer to store 2 numbers or some kind of shifting in other data structures. In other words fitting all objects in possible minimum space. Or increase memory for that server(scale vertically)
a) However reading the files can take too much memory, so you have to read them in chunks. For example this is what I was doing with json files:
JsonReader reader = new JsonReader(new InputStreamReader(in, "UTF-8"));
if (reader.hasNext()) {
reader.beginObject();
String name = reader.nextName();
if ("content".equals(name)) {
reader.beginArray();
parseContentJsonArray(reader, name2ContentMap);
reader.endArray();
}
name = reader.nextName();
if ("ad".equals(name)) {
reader.beginArray();
parsePrerollJsonArray(reader, prerollMap);
reader.endArray();
}
}
The idea is to have a way to identify when certain object starts and ends and read only that part.
b) You can also split files to smaller ones at the source if you can, then it will be easier to read them.
You can't fit all parsed objects for that app on one server. In this case you have to shard based on some object property. For example split data based on US state into multiple servers.
Hopefully it helps in your solution.
I was trying to read a file into an array by using FileInputStream, and an ~800KB file took about 3 seconds to read into memory. I then tried the same code except with the FileInputStream wrapped into a BufferedInputStream and it took about 76 milliseconds. Why is reading a file byte by byte done so much faster with a BufferedInputStream even though I'm still reading it byte by byte? Here's the code (the rest of the code is entirely irrelevant). Note that this is the "fast" code. You can just remove the BufferedInputStream if you want the "slow" code:
InputStream is = null;
try {
is = new BufferedInputStream(new FileInputStream(file));
int[] fileArr = new int[(int) file.length()];
for (int i = 0, temp = 0; (temp = is.read()) != -1; i++) {
fileArr[i] = temp;
}
BufferedInputStream is over 30 times faster. Far more than that. So, why is this, and is it possible to make this code more efficient (without using any external libraries)?
In FileInputStream, the method read() reads a single byte. From the source code:
/**
* Reads a byte of data from this input stream. This method blocks
* if no input is yet available.
*
* #return the next byte of data, or <code>-1</code> if the end of the
* file is reached.
* #exception IOException if an I/O error occurs.
*/
public native int read() throws IOException;
This is a native call to the OS which uses the disk to read the single byte. This is a heavy operation.
With a BufferedInputStream, the method delegates to an overloaded read() method that reads 8192 amount of bytes and buffers them until they are needed. It still returns only the single byte (but keeps the others in reserve). This way the BufferedInputStream makes less native calls to the OS to read from the file.
For example, your file is 32768 bytes long. To get all the bytes in memory with a FileInputStream, you will require 32768 native calls to the OS. With a BufferedInputStream, you will only require 4, regardless of the number of read() calls you will do (still 32768).
As to how to make it faster, you might want to consider Java 7's NIO FileChannel class, but I have no evidence to support this.
Note: if you used FileInputStream's read(byte[], int, int) method directly instead, with a byte[>8192] you wouldn't need a BufferedInputStream wrapping it.
A BufferedInputStream wrapped around a FileInputStream, will request data from the FileInputStream in big chunks (512 bytes or so by default, I think.) Thus if you read 1000 characters one at a time, the FileInputStream will only have to go to the disk twice. This will be much faster!
It is because of the cost of disk access. Lets assume you will have a file which size is 8kb. 8*1024 times access disk will be needed to read this file without BufferedInputStream.
At this point, BufferedStream comes to the scene and acts as a middle man between FileInputStream and the file to be read.
In one shot, will get chunks of bytes default is 8kb to memory and then FileInputStream will read bytes from this middle man.
This will decrease the time of the operation.
private void exercise1WithBufferedStream() {
long start= System.currentTimeMillis();
try (FileInputStream myFile = new FileInputStream("anyFile.txt")) {
BufferedInputStream bufferedInputStream = new BufferedInputStream(myFile);
boolean eof = false;
while (!eof) {
int inByteValue = bufferedInputStream.read();
if (inByteValue == -1) eof = true;
}
} catch (IOException e) {
System.out.println("Could not read the stream...");
e.printStackTrace();
}
System.out.println("time passed with buffered:" + (System.currentTimeMillis()-start));
}
private void exercise1() {
long start= System.currentTimeMillis();
try (FileInputStream myFile = new FileInputStream("anyFile.txt")) {
boolean eof = false;
while (!eof) {
int inByteValue = myFile.read();
if (inByteValue == -1) eof = true;
}
} catch (IOException e) {
System.out.println("Could not read the stream...");
e.printStackTrace();
}
System.out.println("time passed without buffered:" + (System.currentTimeMillis()-start));
}
I made a small program to download data and write it to a file.
Here is the code:
public void run()
{
byte[] bytes = new byte[1024];
int bytes_read;
URLConnection urlc = null;
RandomAccessFile raf = null;
InputStream i = null;
try
{
raf = new RandomAccessFile("file1", "rw");
}
catch(Exception e)
{
e.printStackTrace();
return;
}
try
{
urlc = new URL(link).openConnection();
i = urlc.getInputStream();
}
catch(Exception e)
{
e.printStackTrace();
return;
}
while(canDownload())
{
try
{
bytes_read = i.read(bytes);
}
catch(Exception e)
{
e.printStackTrace();
return;
}
if(bytes_read != -1)
{
try
{
raf.write(bytes, 0, bytes_read);
}
catch(Exception e)
{
e.printStackTrace();
return;
}
}
else
{
try
{
i.close();
raf.close();
return;
}
catch(Exception e)
{
e.printStackTrace();
return;
}
}
}
}
The problem is that when I download big files, I get few bytes missing in the end of the file.
I tried to change the byte array size to 2K, and the problem was solved. But when I downloaded a bigger file (500 MB) , I got few bytes missing again.
I said "Ok, let's try with 4K size". And I changed the byte array size to 4K. It worked!
Nice, but then I downloaded a 4 GB file, I got bytes missing in the end again!
I said "Cool, let's try with 8K size". And then I changed the byte array size to 8K. Worked.
My first question is: Why this happens? (when I change buffer size, the file doesn't get corrupted).
Ok, in theory, the file corrupted problem can be solved changing the byte array size to bigger values.
But there's another problem: how can I measure the download speed (in one second interval) with big byte array sizes?
For example: Let's say that my download speed is 2 KB/s. And the byte array size is 4 K.
My second question is: How can I measure the speed (in one second interval) if the thread will have to wait the byte array to be full? My answer should be: change the byte array size to a smaller value. But the file will get corrupted xD.
After trying to solve the problem by myself, I spent 2 days searching over the internet for a solution. And nothing.
Please, can you guys answer my two questions? Thanks =D
Edit
Code for canDownload():
synchronized private boolean canDownload()
{
return can_download;
}
My advice is to use a proven library such as Apache Commons IO instead of trying to roll your own code. For your particular problem, take a look at the copyURLToFile(URL, File) method.
I would:
Change the RandomAccessFile to a FileOutputStream.
Get rid of canDownload(), whatever it's for, and set a read timeout on the connection instead.
Simplify the copy loop to this:
while ((bytes_read = i.read(bytes)) > 0)
{
out.write(bytes, 0, bytes_read);
}
out.close();
i.close();
with all the exception handling outside this loop.
I think you will find the problem is that you closed the underlying InputStream while the RandomAccessFile still had data in its write buffers. This will be why you are occasionally missing the last few bytes of data.
The race condition is between the JVM flushing the final write, and your call to i.close().
Removing the i.close() should fix the problem; it isn't necessary as the raf.close() closes the underlying stream anyway, but this way you give the RAF a chance to flush any outstanding buffers before it does so.
Basically I have this code to decompress some string that stores in a file:
public static String decompressRawText(File inFile) {
InputStream in = null;
InputStreamReader isr = null;
StringBuilder sb = new StringBuilder(STRING_SIZE);
try {
in = new FileInputStream(inFile);
in = new BufferedInputStream(in, BUFFER_SIZE);
in = new GZIPInputStream(in, BUFFER_SIZE);
isr = new InputStreamReader(in);
int length = 0;
while ((length = isr.read(cbuf)) != -1) {
sb.append(cbuf, 0, length);
}
} catch (Exception e) {
e.printStackTrace();
} finally {
try {
in.close();
} catch (Exception e1) {
e1.printStackTrace();
}
}
return sb.toString();
}
Since physical IO is quite time consuming, and since my compressed version of files are all quite small(around 2K from 2M of text), is it possible for me to still do the above, but on a file that is already mapped to memory? possibly using java NIO? Thanks
It won't make any difference, at least not much. Mapped files are about 20% faster in I/O last time I looked. You still have to actually do the I/O: mapping just saves some data copying. I would look at increasing BUFFER_SIZE to at least 32k. Also the size of cbuf, which should be a local variable in this method, not a member variable, so it will be thread-safe. It might be worth not compressing the files under a certain size threshold, say 10k.
Also you should be closing isr here, not in.
It might be worth trying putting another BufferedInputStream on top of the GZIPInputStream, as well as the one underneath it. Get it to do more at once.
I have a Java Applet that I'm making some edits to and am running into performance issues. More specifically, the applet generates an image which I need to export to the client's machine.
This is really at the proof-of-concept stage so bear with me. For right now, the image is exported to the clients machine at a pre-defined location (This will be replaced with a save-dialog or something in the future). However, the process takes nearly 15 seconds for a 32kb file.
I've done some 'shoot-by-the-hip' profiling where I have printed messages to the console at logical intervals throughout the method in question. I've found, to my surprise, that the bottleneck appears to be with the actual data stream writing process, not the jpeg encoding.
KEEP IN MIND THAT I ONLY HAVE A BASIC KNOWLEDGE OF JAVA AND ITS METHODS
So go slow :p - I'm mainly looking for suggestions to solve the problem rather the solution itself.
Here is the block of code where the magic happens:
ByteArrayOutputStream jpegOutput = new ByteArrayOutputStream();
JPEGImageEncoder encoder = JPEGCodec.createJPEGEncoder(jpegOutput);
encoder.encode(biFullView);
byte[] imageData = jpegOutput.toByteArray();
String myFile="C:" + File.separator + "tmpfile.jpg";
File f = new File(myFile);
try {
dos = new DataOutputStream(new BufferedOutputStream(new FileOutputStream(myFile),512));
dos.writeBytes(byteToString(imageData));
dos.flush();
dos.close();
}
catch (SecurityException ee) {
System.out.println("writeFile: caught security exception");
}
catch (IOException ioe) {
System.out.println("writeFile: caught i/o exception");
}
Like I mentioned, using system.out.println() I've narrowed the performance bottleneck to the DataOutputStream block. Using a variety of machines with varying hardware stats seems to have little effect on the overall performance.
Any pointers/suggestions/direction would be much appreciated.
EDIT:
As requested, byteToString():
public String byteToString(byte[] data){
String text = new String();
for ( int i = 0; i < data.length; i++ ){
text += (char) ( data[i] & 0x00FF );
}
return text;
}
You might want to take a look at ImageIO.
And I think the reason for the performance problem is the looping in byteToString. You never want to do a concatenation in a loop. You could use the String(byte[]) constructor instead, but you don't really need to be turning the bytes into a string anyway.
If you don't need the image data byte array you can encode directly to the file:
String myFile="C:" + File.separator + "tmpfile.jpg";
File f = new File(myFile);
FileOutputStream fos = null;
try {
fos = new FileOutputStream(f);
JPEGImageEncoder encoder = JPEGCodec.createJPEGEncoder(
new BufferedOutputStream(fos));
encoder.encode(biFullView);
}
catch (SecurityException ee) {
System.out.println("writeFile: caught security exception");
}
catch (IOException ioe) {
System.out.println("writeFile: caught i/o exception");
}finally{
if(fos != null) fos.close();
}
If you need the byte array to perform other operations it's better to write it directly to the FileOutputStream:
//...
fos = new FileOutputStream(myFile));
fos.write(imageData, 0, imageData.length);
//...
You could also use the standard ImageIO API (classes in the com.sun.image.codec.jpeg package are not part of the core Java APIs).
String myFile="C:" + File.separator + "tmpfile.jpg";
File f = new File(myFile);
ImageIO.write(biFullView, "jpeg", f);