I am working on the application which reads large amounts of data from a file. Basically, I have a huge file (around 1.5 - 2 gigs) containing different objects (~5 to 10 millions of them per file). I need to read all of them and put them to different maps in the app. The problem is that the app runs out of memory while reading the objects at some point. Only when I set it to use -Xmx4096m - it can handle the file. But if the file will be larger, it won't be able to do that anymore.
Here's the code snippet:
String sampleFileName = "sample.file";
FileInputStream fileInputStream = null;
ObjectInputStream objectInputStream = null;
try{
fileInputStream = new FileInputStream(new File(sampleFileName));
int bufferSize = 16 * 1024;
objectInputStream = new ObjectInputStream(new BufferedInputStream(fileInputStream, bufferSize));
while (true){
try{
Object objectToRead = objectInputStream.readUnshared();
if (objectToRead == null){
break;
}
// doing something with the object
}catch (EOFException eofe){
eofe.printStackTrace();
break;
} catch (Exception e) {
e.printStackTrace();
continue;
}
}
} catch (Exception e){
e.printStackTrace();
}finally{
if (objectInputStream != null){
try{
objectInputStream.close();
}catch (Exception e2){
e2.printStackTrace();
}
}
if (fileInputStream != null){
try{
fileInputStream.close();
}catch (Exception e2){
e2.printStackTrace();
}
}
}
First of all, I was using objectInputStream.readObject() instead of objectInputStream.readUnshared(), so it solved the issue partially. When I increased the memory from 2048 to 4096, it started parsing the file. BufferedInputStream is already in use. From the web I've found only examples how to read lines or bytes, but nothing regarding objects, performance wise.
How can I read the file without increasing the memory for JVM and avoiding the OutOfMemory exception? Is there any way to read objects from the file, not keeping anything else in the memory?
When reading big files, parsing objects and keeping them in memory there are several solutions with several tradeoffs:
You can fit all parsed objects into memory for that app deployed on one server. It either requires to store all objects in very zipped way, for example using byte or integer to store 2 numbers or some kind of shifting in other data structures. In other words fitting all objects in possible minimum space. Or increase memory for that server(scale vertically)
a) However reading the files can take too much memory, so you have to read them in chunks. For example this is what I was doing with json files:
JsonReader reader = new JsonReader(new InputStreamReader(in, "UTF-8"));
if (reader.hasNext()) {
reader.beginObject();
String name = reader.nextName();
if ("content".equals(name)) {
reader.beginArray();
parseContentJsonArray(reader, name2ContentMap);
reader.endArray();
}
name = reader.nextName();
if ("ad".equals(name)) {
reader.beginArray();
parsePrerollJsonArray(reader, prerollMap);
reader.endArray();
}
}
The idea is to have a way to identify when certain object starts and ends and read only that part.
b) You can also split files to smaller ones at the source if you can, then it will be easier to read them.
You can't fit all parsed objects for that app on one server. In this case you have to shard based on some object property. For example split data based on US state into multiple servers.
Hopefully it helps in your solution.
Related
I am working on the application which reads large amounts of data from a file. Basically, I have a huge file (around 1.5 - 2 gigs) containing different objects (~5 to 10 millions of them per file). I need to read all of them and put them to different maps in the app. The problem is that the app runs out of memory while reading the objects at some point. Only when I set it to use -Xmx4096m - it can handle the file. But if the file will be larger, it won't be able to do that anymore.
Here's the code snippet:
String sampleFileName = "sample.file";
FileInputStream fileInputStream = null;
ObjectInputStream objectInputStream = null;
try{
fileInputStream = new FileInputStream(new File(sampleFileName));
int bufferSize = 16 * 1024;
objectInputStream = new ObjectInputStream(new BufferedInputStream(fileInputStream, bufferSize));
while (true){
try{
Object objectToRead = objectInputStream.readUnshared();
if (objectToRead == null){
break;
}
// doing something with the object
}catch (EOFException eofe){
eofe.printStackTrace();
break;
} catch (Exception e) {
e.printStackTrace();
continue;
}
}
} catch (Exception e){
e.printStackTrace();
}finally{
if (objectInputStream != null){
try{
objectInputStream.close();
}catch (Exception e2){
e2.printStackTrace();
}
}
if (fileInputStream != null){
try{
fileInputStream.close();
}catch (Exception e2){
e2.printStackTrace();
}
}
}
First of all, I was using objectInputStream.readObject() instead of objectInputStream.readUnshared(), so it solved the issue partially. When I increased the memory from 2048 to 4096, it started parsing the file. BufferedInputStream is already in use. From the web I've found only examples how to read lines or bytes, but nothing regarding objects, performance wise.
How can I read the file without increasing the memory for JVM and avoiding the OutOfMemory exception? Is there any way to read objects from the file, not keeping anything else in the memory?
When reading big files, parsing objects and keeping them in memory there are several solutions with several tradeoffs:
You can fit all parsed objects into memory for that app deployed on one server. It either requires to store all objects in very zipped way, for example using byte or integer to store 2 numbers or some kind of shifting in other data structures. In other words fitting all objects in possible minimum space. Or increase memory for that server(scale vertically)
a) However reading the files can take too much memory, so you have to read them in chunks. For example this is what I was doing with json files:
JsonReader reader = new JsonReader(new InputStreamReader(in, "UTF-8"));
if (reader.hasNext()) {
reader.beginObject();
String name = reader.nextName();
if ("content".equals(name)) {
reader.beginArray();
parseContentJsonArray(reader, name2ContentMap);
reader.endArray();
}
name = reader.nextName();
if ("ad".equals(name)) {
reader.beginArray();
parsePrerollJsonArray(reader, prerollMap);
reader.endArray();
}
}
The idea is to have a way to identify when certain object starts and ends and read only that part.
b) You can also split files to smaller ones at the source if you can, then it will be easier to read them.
You can't fit all parsed objects for that app on one server. In this case you have to shard based on some object property. For example split data based on US state into multiple servers.
Hopefully it helps in your solution.
I'm looking for a foolproof way to generate a temporary file that will have always end up with a unique name on a per JVM basis. Basically I want to be sure in a multithreaded application that if two or more threads attempt to create a temporary file at the exact same moment in time that they will both end up with a unique temporary file and no exceptions will be thrown.
This is the method I have currently:
public File createTempFile(InputStream inputStream) throws FileUtilsException {
File tempFile = null;
OutputStream outputStream = null;
try {
tempFile = File.createTempFile("app", ".tmp");
tempFile.deleteOnExit();
outputStream = new FileOutputStream(tempFile);
IOUtils.copy(inputStream, outputStream);
} catch (IOException e) {
logger.debug("Unable to create temp file", e);
throw new FileUtilsException(e);
} finally {
try { if (outputStream != null) outputStream.close(); } catch (Exception e) {}
try { if (inputStream != null) inputStream.close(); } catch (Exception e) {}
}
return tempFile;
}
Is this perfectly safe for what my goal is? I reviewed the documentation at the below URL but I'm not sure.
See java.io.File#createTempFile
The answer posted at the below URL answers my question. The method I posted is safe in a multithreaded single JVM process environment. To make it safe in a multithreaded multi-JVM process environment (e.g. a clustered web app) you can use Chris Cooper's idea which involves passing a unique value in the prefix argument for the File.createTempFile method within each JVM process.
Is createTempFile thread-safe?
Just use the thread name and current time in millis to name the file.
You can supply a different prefix or suffix to the temporary files for this exact reason.
Assign a unique ID to each process starting up, and use that unique id as the prefix or suffix, multiple threads in the same VM will not clash, and now VMs will not clash either.
I have this piece of code that would copy files from IFS to a local drive. And I would like to ask some suggestions on how to make it better.
public void CopyFile(AS400 system, String source, String destination){
File destFile = new File(destination);
IFSFile sourceFile = new IFSFile(system, source);
if (!destFile.exists()){
try {
destFile.createNewFile();
} catch (IOException e) {
e.printStackTrace();
}
}
IFSFileInputStream in = null;
OutputStream out = null;
try {
in = new IFSFileInputStream(sourceFile);
out = new FileOutputStream(destFile);
// Transfer bytes from in to out
byte[] buf = new byte[1024];
int len;
while ((len = in.read(buf)) > 0) {
out.write(buf, 0, len);
}
} catch (AS400SecurityException e) {
e.printStackTrace();
} catch (IOException e) {
e.printStackTrace();
} finally {
try {
if(in != null) {
in.close();
}
if(out != null) {
out.close();
}
} catch (IOException e) {
e.printStackTrace();
}
} // end try catch finally
} // end method
Where
source = full IFS path + filename and
destination = full local path + filename
I would like to ask some things regarding the following:
a. Performance considerations
would this have a big impact in terms for CPU usage for the host AS400 system?
would this have a big impact on the JVM to be used (in terms of memory usage)
would including this to a web app affect app server performance (would it be a heavy task or not)?
would using this to copy multiple files (running it redundantly) be a big burden to all resources involved?
b. Code Quality
Did my implementation of IFSFileInputStream suffice, or would a simple FileInputStream object do the job nicely?
AFAIK, I just needed the AS400 object to make sure the source file referenced is a file from IFS.
I am a noob at AS400 and IFS an would like to ask an honest opinion from experienced ones.
All in all it looks fine (without trying). It should not have a noticeable impact.
in.read() may return 0. Test for -1 instead.
Instead of manually buffering, just wrap in and out with their respective BufferedInputStream/BufferedOutputstream and read one character at a time and test it for -1.
try-catch is hard to get pretty. This will do, but you will later get more experience and learn how to do it somewhat better.
Do NOT swallow exceptions and print them. The code calling you will have no idea whether it went well or not.
When done with an AS400 object, use as400.disconnectAllServices().
See IBM Help example code:
http://publib.boulder.ibm.com/infocenter/iadthelp/v7r1/index.jsp?topic=/com.ibm.etools.iseries.toolbox.doc/ifscopyfileexample.htm
Regards
Here is how I compressed the string into a file:
public static void compressRawText(File outFile, String src) {
FileOutputStream fo = null;
GZIPOutputStream gz = null;
try {
fo = new FileOutputStream(outFile);
gz = new GZIPOutputStream(fo);
gz.write(src.getBytes());
gz.flush();
} catch (Exception e) {
e.printStackTrace();
} finally {
try {
gz.close();
fo.close();
} catch (Exception e) {
e.printStackTrace();
}
}
}
Here is how I decompressed it:
static int BUFFER_SIZE = 8 * 1024;
static int STRING_SIZE = 2 * 1024 * 1024;
public static String decompressRawText(File inFile) {
InputStream in = null;
InputStreamReader isr = null;
StringBuilder sb = new StringBuilder(STRING_SIZE);//constant resizing is costly, so set the STRING_SIZE
try {
in = new FileInputStream(inFile);
in = new BufferedInputStream(in, BUFFER_SIZE);
in = new GZIPInputStream(in, BUFFER_SIZE);
isr = new InputStreamReader(in);
char[] cbuf = new char[BUFFER_SIZE];
int length = 0;
while ((length = isr.read(cbuf)) != -1) {
sb.append(cbuf, 0, length);
}
} catch (Exception e) {
e.printStackTrace();
} finally {
try {
in.close();
} catch (Exception e1) {
e1.printStackTrace();
}
}
return sb.toString();
}
The decompression seems to take forever to do. I have got a feeling that I am doing too much redundant steps in the decompression bit. any idea of how I could speed it up?
EDIT: have modified the code to the above based on the following given recommendations,
1. I chaged the pattern, so to simply my code a bit, but if I couldn't use IOUtils is this still ok to use this pattern?
2. I set the StringBuilder buffer to be of 2M, as suggested by entonio, should I set it to be a little bit more? the memory is still OK, I still have around 10M available as it is suggested by the heap monitor from eclipse
3. I cut the BufferedReader and added a BufferedInputStream, but I am still not sure about the BUFFER_SIZE, any suggestions?
The above modification has improved the time taken to loop all my 30 2M files from almost 30 seconds to around 14, but I need to reduce it to under 10, is it even possible on android? Ok, basically, I need to process a text file in all 60M, I have divided them up into 30 2M, and before I start processing on each strings, I did the above timing on the time cost for me just to loop all the files and get the String in the file into my memory. Since I don't have much experience, will it be better, if I use 60 of 1M files instead? or any other improvement should I adopt? Thanks.
ALSO: Since physical IO is quite time consuming, and since my compressed version of files are all quite small(around 2K from 2M of text), is it possible for me to still do the above, but on a file that is already mapped to memory? possibly using java NIO? Thanks
The BufferedReader's only purpose is the readLine() method you don't use, so why not just read from the InputStreamReader? Also, maybe decreasing the buffer size may be helpful. Also, you should probably specify the encoding while both reading and writing, though that shouldn't have an impact on performance.
edit: more data
If you know the size of the string ahead, you should add a length parameter to decompressRawText and use it to initialise the StringBuilder. Otherwise it will be constantly resized in order to accomodate the result, and that's costly.
edit: clarification
2MB implies a lot of resizes. There is no harm if you specify a capacity higher than the length you end up with after reading (other than temporarily using more memory, of course).
You should wrap the FileInputStream with a BufferedInputStream before wrapping with a GZipInputStream, rather than using a BufferedReader.
The reason is that, depending on implementation, any of the various input classes in your decoration hierarchy could decide to read on a byte-by-byte basis (and I'd say the InputStreamReader is most likely to do this). And that would translate into many read(2) calls once it gets to the FileInputStream.
Of course, this may just be superstition on my part. But, if you're running on Linux, you can always test with strace.
Edit: once nice pattern to follow when building up a bunch of stream delegates is to use a single InputStream variable. Then, you only have one thing to close in your finally block (and can use Jakarta Commons IOUtils to avoid lots of nested try-catch-finally blocks).
InputStream in = null;
try
{
in = new FileInputStream("foo");
in = new BufferedInputStream(in);
in = new GZIPInputStream(in);
// do something with the stream
}
finally
{
IOUtils.closeQuietly(in);
}
Add a BufferedInputStream between the FileInputStream and the GZIPInputStream.
Similarly when writing.
I have a Java Applet that I'm making some edits to and am running into performance issues. More specifically, the applet generates an image which I need to export to the client's machine.
This is really at the proof-of-concept stage so bear with me. For right now, the image is exported to the clients machine at a pre-defined location (This will be replaced with a save-dialog or something in the future). However, the process takes nearly 15 seconds for a 32kb file.
I've done some 'shoot-by-the-hip' profiling where I have printed messages to the console at logical intervals throughout the method in question. I've found, to my surprise, that the bottleneck appears to be with the actual data stream writing process, not the jpeg encoding.
KEEP IN MIND THAT I ONLY HAVE A BASIC KNOWLEDGE OF JAVA AND ITS METHODS
So go slow :p - I'm mainly looking for suggestions to solve the problem rather the solution itself.
Here is the block of code where the magic happens:
ByteArrayOutputStream jpegOutput = new ByteArrayOutputStream();
JPEGImageEncoder encoder = JPEGCodec.createJPEGEncoder(jpegOutput);
encoder.encode(biFullView);
byte[] imageData = jpegOutput.toByteArray();
String myFile="C:" + File.separator + "tmpfile.jpg";
File f = new File(myFile);
try {
dos = new DataOutputStream(new BufferedOutputStream(new FileOutputStream(myFile),512));
dos.writeBytes(byteToString(imageData));
dos.flush();
dos.close();
}
catch (SecurityException ee) {
System.out.println("writeFile: caught security exception");
}
catch (IOException ioe) {
System.out.println("writeFile: caught i/o exception");
}
Like I mentioned, using system.out.println() I've narrowed the performance bottleneck to the DataOutputStream block. Using a variety of machines with varying hardware stats seems to have little effect on the overall performance.
Any pointers/suggestions/direction would be much appreciated.
EDIT:
As requested, byteToString():
public String byteToString(byte[] data){
String text = new String();
for ( int i = 0; i < data.length; i++ ){
text += (char) ( data[i] & 0x00FF );
}
return text;
}
You might want to take a look at ImageIO.
And I think the reason for the performance problem is the looping in byteToString. You never want to do a concatenation in a loop. You could use the String(byte[]) constructor instead, but you don't really need to be turning the bytes into a string anyway.
If you don't need the image data byte array you can encode directly to the file:
String myFile="C:" + File.separator + "tmpfile.jpg";
File f = new File(myFile);
FileOutputStream fos = null;
try {
fos = new FileOutputStream(f);
JPEGImageEncoder encoder = JPEGCodec.createJPEGEncoder(
new BufferedOutputStream(fos));
encoder.encode(biFullView);
}
catch (SecurityException ee) {
System.out.println("writeFile: caught security exception");
}
catch (IOException ioe) {
System.out.println("writeFile: caught i/o exception");
}finally{
if(fos != null) fos.close();
}
If you need the byte array to perform other operations it's better to write it directly to the FileOutputStream:
//...
fos = new FileOutputStream(myFile));
fos.write(imageData, 0, imageData.length);
//...
You could also use the standard ImageIO API (classes in the com.sun.image.codec.jpeg package are not part of the core Java APIs).
String myFile="C:" + File.separator + "tmpfile.jpg";
File f = new File(myFile);
ImageIO.write(biFullView, "jpeg", f);