new FileOutputStream slow, is there a better way? - java

I'm writing a bunch of relatively small files (about 50k or so each).
The total processing time for writing all of these files is about 400 seconds.
I put in some checks to see what's taking the most time and of that 400 total seconds, 12 seconds is spent writing the data to the files and 380 seconds are spent just doing this code:
fos = new FileOutputStream(fileObj);
I would expect the writing and closing of the file to take most of the time but it looks like just creating the FileOutputStream is taking the most amount of time by far.
Is there a better way to create my files or is the file creation just generally a slow operation? This is the total time for thousands of files by the way, not just the time for a single file.

What you are seeing is pretty much normal behavior, its not java-specific.
When a file is created, the file system needs to add a file entry to its structures, and in the process modify existing structure (e.g. the directory the file is contained in) to take note of the new entry.
On a typical harddisk this requires some head movements, a single seek takes time in the order of milliseconds. On the other hand, once you start writing to the file, the file system will assign new blocks to the file in a linear fashion (as long as possible), so you can write sequential data with about the maximum speed the drive can handle.
The only way to make major improvements in speed is use a faster device (e.g. an SSD drive).
You can pretty much observe this effect everywhere, Windows explorer and similar tools all show the same behavior: large files are copied with speeds close to the devices limits, while tons of small files go painfully slow.

Something to avoid that problem and spend the same time in all files is when you give the path of the file delete the extension and when you finish to copy that file, rename the file with the extension you took before. Here is an example:
public static void copiarArchivo(String pathOrigen, String pathDestino)
{
InputStream in = null;
OutputStream out = null;
// ultPunto has the index where the last point is in the name of the
// file. Before of the last point is the fileName after is the extension
int ultPunto = pathDestino.lastIndexOf(".");
// take the extension of the file
String extension = pathDestino.substring(ultPunto, pathDestino.length());
// take the fileName without extension
String pathSinExtension = pathDestino.substring(0, ultPunto);
try
{
in = new FileInputStream(pathOrigen);
// creates the new file without extension cause it is faster as
// expleanied below
out = new FileOutputStream(pathSinExtension);
byte[] buf = new byte[buffer];
int len;
// binary copy of the content file
while ((len = in.read(buf)) > 0)
{
out.write(buf, 0, len);
}
} catch (IOException e) {
e.printStackTrace();
}
// when the finished is copyed or and exception occour the streams must
// be closed to save resources
finally
{
try
{
if(in != null )
in.close();
if(out != null)
out.close();
} catch (IOException e) {
e.printStackTrace();
}
}
// the file was copyed with out extension and it must be added after the
// fileName
new File(pathSinExtension).renameTo(new File(pathSinExtension + extension));
}
Where pathOrigen is the path of the file you want to copy and the pathDestino is the path where it is going to be copyed.

Related

Creating mp4 file doesn't remove tmp files

I'm trying to write an InputStream that is an mp4 that I get from calling an external SOAP service, when I do so, it always generates this tmp files for my chosen temporary directory(java.io.tmpdir) that aren't removable and stay after the writing is done.
Writing images that I also get from the SOAP service works normal without the permanent tmp on the directory. I'm using java 1.8 SpringBoot
tmp files
This is what I'm doing:
File targetFile = new File("D:/archive/video.mp4");
targetFile.getParentFile().mkdirs();
targetFile.setWritable(true);
InputStream inputStream = filesToWrite.getInputStream();
OutputStream outputStream = new FileOutputStream(targetFile);
try {
int byteRead;
while ((byteRead = inputStream.read()) != -1) {
outputStream.write(byteRead);
}
} catch (IOException e) {
logger.fatal("Error# SaveFilesThread for guid: " + guid, e);
}finally {
try {
inputStream.close();
outputStream.flush();
outputStream.close();
}catch (Exception e){
e.printStackTrace();
}
also tried:
byte data[] = IOUtils.toByteArray(inputStream);
Path file = Paths.get("video.mp4");
Files.write(file, data);
And from apache commons IO:
FileUtils.copyInputStreamToFile(initialStream, targetFile);
When your code starts, the damage is already done. Your code is not the source of the temporary files (It's.. a ton of work for something that could be done so much simpler, though, see below), it's the framework that ends up handing you that filesToWrite variable.
It is somewhat likely that you can hook in at an earlier point and get the raw inputstream representing the socket or HTTP connection, and start saving the files straight from there. Alternatively, Perhaps filesToWrite has a way to get at the files themselves, in which case you can just move them into place instead of copying them over.
But, your code to do this is a mess, it has bad exception handling, and leaks memory, and is way too much code for a simple job, and is possibly 2000x to 10000x slower than needed depending on your harddisk (I'm not exaggerating, calling single-byte read() on unbuffered streams is thousands of times slower!)
// add `throws IOException` to your method signature.
// it saves files, it's supposed to throw IOException,
// 'doing I/O' is in the very definition of your method!
try (InputStream in = filesToWrite.getInputStream();
OutputStream out = new FileOutputStream(targetFile)) {
in.transferTo(out);
}
That's it. That solves all the problems - no leaks, no speed loss, tiny amount of code, fixes the deplorable error handling (which, here, is 'log something to the log, then print something to standard out, then potentially leak a bunch of resources, then don't tell the calling code anything went wrong and return exactly as if the copy operation succeeded).

How to test if a file is "complete" (completely written) with Java

Let's say you had an external process writing files to some directory, and you had a separate process periodically trying to read files from this directory. The problem to avoid is reading a file that the other process is currently in the middle of writing out, so it would be incomplete. Currently, the process that reads uses a minimum file age timer check, so it ignores all files unless their last modified date is more than XX seconds old.
I'm wondering if there is a cleaner way to solve this problem. If the filetype is unknown (could be a number of different formats) is there some reliable way to check the file header for the number of bytes that should be in the file, vs the number of bytes currently in the file to confirm they match?
Thanks for any thoughts or ideas!
The way I've done this in the past is that the process writing the file writes to a "temp" file, and then moves the file to the read location when it has finished writing the file.
So the writing process would write to info.txt.tmp. When it's finished, it renames the file to info.txt. The reading process then just had to check for the existence of info.txt - and it knows that if it exists, it has been written completely.
Alternatively you could have the write process write info.txt to a different directory, and then move it to the read directory if you don't like using weird file extensions.
You could use an external marker file. The writing process could create a file XYZ.lock before it starts creating file XYZ, and delete XYZ.lock after XYZ is completed. The reader would then easily know that it can consider a file complete only if the corresponding .lock file is not present.
I had no option of using temp markers etc as the files are being uploaded by clients over keypair SFTP. they can be very large in size.
Its quite hacky but I compare file size before and after sleeping a few seconds.
Its obviously not ideal to lock the thread but in our case it is merely running as a background system processes so seems to work fine
private boolean isCompletelyWritten(File file) throws InterruptedException{
Long fileSizeBefore = file.length();
Thread.sleep(3000);
Long fileSizeAfter = file.length();
System.out.println("comparing file size " + fileSizeBefore + " with " + fileSizeAfter);
if (fileSizeBefore.equals(fileSizeAfter)) {
return true;
}
return false;
}
Note: as mentioned below this might not work on windows. This was used in a Linux environment.
One simple solution I've used in the past for this scenario with Windows is to use boolean File.renameTo(File) and attempt to move the original file to a separate staging folder:
boolean success = potentiallyIncompleteFile.renameTo(stagingAreaFile);
If success is false, then the potentiallyIncompleteFile is still being written to.
This possible to do by using Apache Commons IO maven library FileUtils.copyFile() method. If you try to copy file and get IOException its means that file is not completely saved.
Example:
public static void copyAndDeleteFile(File file, String destinationFile) {
try {
FileUtils.copyFile(file, new File(fileDirectory));
} catch (IOException e) {
e.printStackTrace();
copyAndDeleteFile(file, fileDirectory, delayThreadPeriod);
}
Or periodically check with some delay size of folder that contains this file:
FileUtils.sizeOfDirectory(folder);
Even the number of bytes are equal, the content of the file may be different.
So I think, you have to match the old and the new file byte by byte.
2 options that seems to solve this issue:
the best option- writer process notify reading process somehow that
the writing was finished.
write the file to {id}.tmp, than when finish- rename it to {id}.java, and the reading process run only on *.java files. renaming taking much less time and the chance this 2 process work together decrease.
First, there's Why doesn't OS X lock files like windows does when copying to a Samba share? but that's variation of what you're already doing.
As far as reading arbitrary files and looking for sizes, some files have that information, some do not, but even those that do do not have any common way of representing it. You would need specific information of each format, and manage them each independently.
If you absolutely must act on the file the "instant" it's done, then your writing process would need to send some kind of notification. Otherwise, you're pretty much stuck polling the files, and reading the directory is quite cheap in terms of I/O compared to reading random blocks from random files.
One more method to test that a file is completely written:
private void waitUntilIsReadable(File file) throws InterruptedException {
boolean isReadable = false;
int loopsNumber = 1;
while (!isReadable && loopsNumber <= MAX_NUM_OF_WAITING_60) {
try (InputStream in = new BufferedInputStream(new FileInputStream(file))) {
log.trace("InputStream readable. Available: {}. File: '{}'",
in.available(), file.getAbsolutePath());
isReadable = true;
} catch (Exception e) {
log.trace("InputStream is not readable yet. File: '{}'", file.getAbsolutePath());
loopsNumber++;
TimeUnit.MILLISECONDS.sleep(1000);
}
}
}
Use this for Unix if you are transferring files using FTP or Winscp:
public static void isFileReady(File entry) throws Exception {
long realFileSize = entry.length();
long currentFileSize = 0;
do {
try (FileInputStream fis = new FileInputStream(entry);) {
currentFileSize = 0;
while (fis.available() > 0) {
byte[] b = new byte[1024];
int nResult = fis.read(b);
currentFileSize += nResult;
if (nResult == -1)
break;
}
} catch (Exception e) {
e.printStackTrace();
}
System.out.println("currentFileSize=" + currentFileSize + ", realFileSize=" + realFileSize);
} while (currentFileSize != realFileSize);
}

Fastest way to access given lines of text file with and without using GZip and the Jar File (GZip in memory?)

I have given number (5-7) of large UTF8 text files (7 MB). In unicode their size is about 15MB each.
I need to load given parts of a given file. The files are known and does not change. I would like to access and load lines at give place as fast as possible. I load these lines adding HTML tags and display them in a JEditorPane. I know the bottle neck will be the rendering by the JEditorPane of the HTML generated but for now I would like to concentrate on the file access performances.
Moreover the user can search for a given word in all the files.
For now the code I use is :
private static void loadFile(String filename, int startLine, int stopLine) {
try {
FileInputStream fis = new FileInputStream(filename);
InputStreamReader isr = new InputStreamReader(fis, "UTF8");
BufferedReader reader = new BufferedReader(isr);
for (int j = startLine; j <= stopLine; j++) {
//here I add HTML tags
//or do string comparison in case of search by the user
sb.append(reader.readLine());
}
reader.close();
} catch (FileNotFoundException e) {
System.out.println(e);
} catch (IOException e) {
System.out.println(e);
}
}
Now my questions :
As the number of parts of each file is known, 67 in my case (for each file), I could create 67 smaller files. It will be "faster" to load a given part but be slower when I do a search as I must open each of the 67 file.
I have not done bench marking but my feelings says that opening 67 files in case of a search is much longer than the time to perform empty reader.readlines when loading a part of the file.
So in my case it is better to have a single larger file. Do you agree with that ?
If I put each large file in the resource, I mean in the Jar file, will the performance be worse, if yes is it significantly worse ?
And the related question is what if I zip each file to spare size. As far as I undersand a Jar file is simply a zip file.
I think I don't know how unzipping works. If I zip a file, will the file be decompressed in memory or will my program be able to access the given lines I need directly on the disk.
Same for the Jar file will it be decompressed in memory.
If unzipping is not in memory can someone edit my code to use zip file.
Final question and the most important for me. I could increase all the performance if everything was performed in memory, but due to unicode and the quite large files this could easily result in a heap of memory of more than 100MB. Is there a possibility of having the zip file loaded in memory and work on that. This would be fast and use only few memory.
Summary of the questions
In my case, 1 large file is best than plenty of small ones.
If files are zipped, is the unzip process (GZipInputStream) performed in memory. Is all the file unzipped in memory and then access or is it possible to access it directly on disk.
If yes to question 2, can someone edit my code to be able to do it ?
MOST IMPORTANT : is it possible to have the zip file loaded in memory and how ?
I hope my questions are clear enough. ;-)
UPDATE : Thanks to Mike for the getResourceAsStream hint, I get it working
Notice that benchmarking give that load the Gzip file is efficient, but in ma case is too slow.
~200 ms for the gzip file
~125 ms for the standard file so 1.6 times faster.
Assuming that the resource folder is called resources
private static void loadFile(String filename, int startLine, int stopLine) {
try {
GZIPInputStream zip = new GZIPInputStream(this.class.getResourceAsStream("resources/"+filename));
InputStreamReader isr = new InputStreamReader(zip, "UTF8");
BufferedReader reader = new BufferedReader(isr);
for (int j = startLine; j <= stopLine; j++) {
//here I add HTML tags
//or do string comparison in case of search by the user
sb.append(reader.readLine());
}
reader.close();
} catch (FileNotFoundException e) {
System.out.println(e);
} catch (IOException e) {
System.out.println(e);
}
}
If the files really aren't changing very often I would suggest using some other data structures. Creating a hash table of all the words and locations they show up would make searching much faster, creating an index of all the line start positions would make that process much faster.
But, to answer your questions more directly:
Yes, one large file is probably still better than many small files, I doubt that reading a line and decoding from UTF8 will be noticeable compared to opening many files, or decompressing many files.
Yes, the unzipping process is performed in memory, and on the fly. It happens as you request data, but acts as a buffered stream, it will decompress entire blocks at a time, so it is actually very efficient.
I can't fix your code directly, but I can suggest looking up getResourceAsStream:
http://docs.oracle.com/javase/6/docs/api/java/lang/Class.html#getResourceAsStream%28java.lang.String%29
This function will open a file that is in a zip / jar file and give you access to it as a stream, automatically decompressing it in memory as you use it.
If you treat it as a resource, java will do it all for you, you will have to read up on some of the specifics of handling resources, but java should handle it fairly intelligently.
I think it would be quicker for you to load the file(s) into memory. You can then zip around to whatever part of the file you need.
Take a look at RandomAccessFile for this.
The GZipInputStream reads the files into memory as a buffered stream.
That's another question entirely :)
Again, the zip file will be decompressed in memory depending on what Class you use to open it.

reduce number of opened files in java code

Hi I have some code that uses block
RandomAccessFile file = new RandomAccessFile("some file", "rw");
FileChannel channel = file.getChannel();
// some code
String line = "some data";
ByteBuffer buf = ByteBuffer.wrap(line.getBytes());
channel.write(buf);
channel.close();
file.close();
but the specific of the application is that I have to generate large number of temporary files, more then 4000 in average (used for Hive inserts to the partitioned table).
The problem is that sometimes I catch exception
Failed with exception Too many open files
during the app running.
I wounder if there any way to tell OS that file is closed already and not used anymore, why the
channel.close();
file.close();
does not reduce the number of opened files. Is there any way to do this in Java code?
I have already increased max number of opened files in
#/etc/sysctl.conf:
kern.maxfiles=204800
kern.maxfilesperproc=200000
kern.ipc.somaxconn=8096
Update:
I tried to eliminate the problem, so I parted the code to investigate each part of it (create files, upload to hive, delete files).
Using class 'File' or 'RandomAccessFile' fails with the exception "Too many open files".
Finally I used the code:
FileOutputStream s = null;
FileChannel c = null;
try {
s = new FileOutputStream(filePath);
c = s.getChannel();
// do writes
c.write("some data");
c.force(true);
s.getFD().sync();
} catch (IOException e) {
// handle exception
} finally {
if (c != null)
c.close();
if (s != null)
s.close();
}
And this works with large amounts of files (tested on 20K with 5KB size each). The code itself does not throw exception as previous two classes.
But production code (with hive) still had the exception. And it appears that the hive connection through the JDBC is the reason of it.
I will investigate further.
The amount of open file handles that can be used by the OS is not the same thing as the number of file handles that can be opened by a process. Most unix systems restrict the number of file handles per process. Most likely it something like 1024 file handles for your JVM.
a) You need to set the ulimit in the shell that launches the JVM to some higher value. (Something like 'ulimit -n 4000')
b) You should verify that you don't have any resource leaks that are preventing your files from being 'finalized'.
Make sure to use a finally{} block. If there is an exception for some reason the close will never happen in the code as written.
Is this the exact code? Because I can think of one scenario where you might be opening all the files in a loop and written the code to close all of them in the end which is causing this problem. Please post the full code.

Java - Reading multiple images from a single zip file and eventually turning them into BufferedImage objects. Good idea?

I'm working on a game, and I need to load multiple image files (png, gif, etc.) that I'll eventually want to convert into BufferedImage objects. In my setup, I'd like to load all of these images from a single zip file, "Resources.zip". That resource file will contain images, map files, and audio files - all contained in various neatly ordered sub-directories. I want to do this because it will (hopefully) make resource loading easy in both applet and application versions of my program. I'm also hoping that for the applet version, this method will make it easy for me to show the loading progress of the game resources zip file (which could eventually amount to 10MB depending on how elaborate this game gets, though I'm hoping to keep it under that size so that it's browser-friendly).
I've included my zip handling class below. The idea is, I have a separate resource handling class, and it creates a ZipFileHandler object that it uses to pull specific resources out of the Resources.zip file.
import java.io.BufferedInputStream;
import java.io.File;
import java.io.IOException;
import java.util.Enumeration;
import java.util.zip.ZipEntry;
import java.util.zip.ZipFile;
public class ZipFileHandler
{
private ZipFile zipFile;
public ZipFileHandler(String zipFileLocation)
{
try
{
zipFile = new ZipFile(zipFileLocation);
}
catch (IOException e) {System.err.println("Unable to load zip file at location: " + zipFileLocation);}
}
public byte[] getEntry(String filePath)
{
ZipEntry entry = zipFile.getEntry(filePath);
int entrySize = (int)entry.getSize();
try
{
BufferedInputStream bis = new BufferedInputStream(zipFile.getInputStream(entry));
byte[] finalByteArray = new byte[entrySize];
int bufferSize = 2048;
byte[] buffer = new byte[2048];
int chunkSize = 0;
int bytesRead = 0;
while(true)
{
//Read chunk to buffer
chunkSize = bis.read(buffer, 0, bufferSize); //read() returns the number of bytes read
if(chunkSize == -1)
{
//read() returns -1 if the end of the stream has been reached
break;
}
//Write that chunk to the finalByteArray
//System.arraycopy(src, srcPos, dest, destPos, length)
System.arraycopy(buffer, 0, finalByteArray, bytesRead, chunkSize);
bytesRead += chunkSize;
}
bis.close(); //close BufferedInputStream
System.err.println("Entry size: " + finalByteArray.length);
return finalByteArray;
}
catch (IOException e)
{
System.err.println("No zip entry found at: " + filePath);
return null;
}
}
}
And I use the ZipFileHandler class like this:
ZipFileHandler zfh = new ZipFileHandler(the_resourceRootPath + "Resources.zip");
InputStream in = new ByteArrayInputStream(zfh.getEntry("Resources/images/bg_tiles.png"));
try
{
BufferedImage bgTileSprite = ImageIO.read(in);
}
catch (IOException e)
{
System.err.println("Could not convert zipped image bytearray to a BufferedImage.");
}
And the good news is, it works!
But I feel like there might be a better way to do what I'm doing (and I'm fairly new to working with BufferedInputStreams).
In the end, my question is this:
Is this even a good idea?
Is there a better way to load a whole bunch of game resource files in a single download/stream, in an applet- AND application-friendly way?
I welcome all thoughts and suggestions!
Thanks!
Taking multiple resources and putting in them in one compressed file is how several web applications work (i.e. GWT) It is less expensive to load one large file than multiple small ones. This assumes that you are going to use all those resources in your app. If not Lazy loading is also a viable alternative.
That being said, it is usually best to get the app working and then to profile to find where the bottlenecks are. If not you will end up with a lot of complicated code and it will take you a lot longer to get your app working. 10%-20% of the code takes 80-90% of the time to execute. You just don;t know which 10-20% that is until the project is mostly complete.
If your goal is to learn the technologies and tinker, then good going - looks good.
If you are using a Java program, it is usually considered good practice to bundle it as a jar file anyway. So why do not put your classes simply inside this jar file (in directories, of course). Then you can simply use
`InputStream stream = MayClass.class.getResourceAsStream(imagePath);`
to load the data for each image, instead of having to handle all the zip by yourself (and it also works for jars not actually on the file system, such as http url in applets).
I also assume the jar will be cached, but you should measure and compare the performance to your solution with an external zip file.

Categories