Java Recursive File Copy optimization - java

I have a little application which copy PDF files (subfolders) to a destination folder. But it works very slow, I would like to optimize it. Can you help me?
The code:
public void pdfFolderCopy(File src, File dest)
throws IOException {
if (src.isDirectory()) {
if (!dest.exists()) {
dest.mkdir();
}
String files[] = src.list();
for (String file : files) {
File srcFile = new File(src, file);
File destFile = new File(dest, file);
pdfFolderCopy(srcFile, destFile);
}
} else {
if (!dest.exists()) {
System.out.println("Copying: " + src);
//Use the Apache IO copyFile method:
FileUtils.copyFile(src, dest);
}
}
}
It is runs about one and a half minutes if every files are already exist. And about 5 minutes, if we need to copy about 500 files.

The only real time consuming task in your code is FileUtils.copyFile(). The necessary time will growing according to the number of files to copy and their size.
Regarding your code I would suggest to extract the check for dest directory since it shouldn't change during the copy process. Check and create the dest directory before you start pdfFolderCopy.

I'd try to simply invoke a process doing
/bin/cp -R -n src dest
where -R means recursive and -n means don't overwrite. There's a good chance that the OS can do this faster then you. No idea what's the corresponding command for Windows or other OS.
For this you need just
new ProcessBuilder()
.command("/bin/cp", "-R", "-n", src.toString(), dest.toString())
.start();
In case you want to do it in Java, I'd try some minor changes:
dest.mkdir() without any check works and might be a bit faster
listFiles might be faster than composing them manually (probably irrelevant)
once you've created a dest folder yourself, you don't need to check if there are any pre-existing files there
I guess, multithreading may lead to a nice speedup: Let the main thread create copying jobs and submit them to some executor (with some 4-8 threads).
Note that such multithreaded writes may lead to higher disk fragmentation, but I wouldn't care. If I had to, then I'd create file reding jobs instead, let them return the file content (n * 100 KB is nothing), and use a single writer thread.

Related

handling about 450.000 files in a zip

My question is simple. Would Java handle a .zip file with about 450,000 files in there? The code that I wrote would not load all of the files, just one specific file would be searched in the zip, and be read line by line. The file size is about 500kb.
Would this work or will I get an OutOfMemory Exception?
Oh sry, uncompressed there about 0,5MB. Zipped are they whole files about 250mb.
Ok, the name of the Files are IDs + Date(unique) in that zip file. If i have to check a log, ill call Java and give the ID + Date and Java is reading just that one file, never more.
Edit: It works, it works very well. About 400.000 files in a zip, if u have the Memory to Zip the Files works without any problem.
Edit2: It works on Linux Filesystems witout a problem, on NTFS sometimes it crashed. NTFS has a problem with that musch files in 1 Zip.
Using the zip filesystem in Java 7, you can actually access one individual file pretty easily and open a BufferedReader on it.
First you have to create the FileSystem:
public static FileSystem getZipFileSystem(final String zipPath)
{
final Path path = Paths.get(zipPath).toAbsolutePath();
final Map<String, Object> env = new HashMap<>();
final URI uri = URI.create("jar:file:" + path.toString());
return FileSystems.newFileSystem(uri, env, null);
}
Once you have done that, you can create a BufferedReader from an entry in the zip itself:
try (
final FileSystem fs = getZipFileSystem("/path/to/the.zip");
final BufferedReader reader = Files.newBufferedReader(fs.getPath("path/to/entry"),
StandardCharsets.UTF_8);
) {
// operate on the reader
}
You could also read all lines in the entry at once using Files.readAllLines().
If you wish to copy a zip entry to a file on the filesystem, you can also do that:
Files.copy(zipfs.getPath("path/to/entry"), Paths.get("file/on/local/fs"));
Or you can directly copy the result to an OutputStream, or directly create an entry from an OutputStream...
Or even walk the entire zip using Files.walkFileTree().
Or get all the entries in a "directory" in a zip using Files.newDirectoryStream(). Note that as its name says, this is a stream; unlike File.listFiles() (which only works on files on disk anyway), this returns a iterator over the entries.
Or... Or... Or...
Note that a FileSystem needs to be .close()d.
I'm not sure that I understand what you're trying to do.
If it's 0.5 MB/file and 450,000 files, you'll need 225GB. You won't have enough memory to do all this in a single zip in memory even if you get 90% compression.
I'd recommend breaking it into manageable chunks. You'll be able to parallelize that way too, so it's not a bad idea.

zip folder in windows using command line

I am writing a program that needs to zip a file.
This will run over both linux and windows machines. It works just fine in Linux but I am not able to get anything done in windows.
To send commands I am using the apache-net project. I've also tried using Runtime().exec
but it isn't working.
Can somebody suggest something?
CommandLine cmdLine = new CommandLine("zip");
cmdLine.addArgument("-r");
cmdLine.addArgument("documents.zip");
cmdLine.addArgument("documents");
DefaultExecutor exec = new DefaultExecutor();
ExecuteWatchdog dog = new ExecuteWatchdog(60*1000);
exec.setWorkingDirectory(new File("."));
exec.setWatchdog(dog);
int check =-1;
try {
check = exec.execute(cmdLine);
} catch (ExecuteException e) {
} catch (IOException e) {
}
Java provides its own compression library in java.util.zip.* that supports the .zip format. An example that zips a folder can be found here. Here's a quickie example that works on a single file. The benefit of going with native Java is that it will work on multiple operating systems and is not dependent on having specific binaries installed.
public static void zip(String origFileName) {
try {
String zipName=origFileName + ".zip";
ZipOutputStream out = new ZipOutputStream(new BufferedOutputStream(new FileOutputStream(zipName)));
byte[] data = new byte[1000];
BufferedInputStream in = new BufferedInputStream(new FileInputStream(origFileName));
int count;
out.putNextEntry(new ZipEntry(origFileName));
while((count = in.read(data,0,1000)) != -1) {
out.write(data, 0, count);
}
in.close();
out.flush();
out.close();
} catch (Exception ex) {
ex.printStackTrace();
}
}
The same code won't work in Windows. Windows doesn't have a "zip" program the way that Linux does. You will need to see if Windows 7 has a command line zip program (I don't think it does; see here: http://answers.microsoft.com/en-us/windows/forum/windows_vista-files/how-to-compress-a-folder-from-command-prompt/02f93b08-bebc-4c9d-b2bb-907a2184c8d5). You will likely need to do two things
Make sure the user has a suitable 3rd party zip program
Do OS detection to execute the proper command.
You can use inbuilt compact.exe to compress/uncompress in dos
It displays or alters the compression of files on NTFS partitions.
COMPACT [/C | /U] [/S[:dir]] [/A] [/I] [/F] [/Q] [filename [...]]
/C Compresses the specified files. Directories will be marked so that files added afterward will be compressed.
/U Uncompresses the specified files. Directories will be marked so that files added afterward will not be compressed.
/S Performs the specified operation on files in the given directory and all subdirectories. Default "dir" is the current directory.
/A Displays files with the hidden or system attributes. These files are omitted by default.
/I Continues performing the specified operation even after errors have occurred. By default, COMPACT stops when an error is encountered.
/F Forces the compress operation on all specified files, even those that are already compressed. Already-compressed files are skipped by default.
/Q Reports only the most essential information.
filename Specifies a pattern, file, or directory.
Used without parameters, COMPACT displays the compression state of the current directory and any files it contains. You may use multiple filenames and wildcards. You must put spaces between multiple parameters.
Examples
compact
Display all the files in the current directory and their compact status.
compact file.txt
Display the compact status of the file file.txt
compact file.txt /C
Compacts the file.txt file.

How to test if a file is "complete" (completely written) with Java

Let's say you had an external process writing files to some directory, and you had a separate process periodically trying to read files from this directory. The problem to avoid is reading a file that the other process is currently in the middle of writing out, so it would be incomplete. Currently, the process that reads uses a minimum file age timer check, so it ignores all files unless their last modified date is more than XX seconds old.
I'm wondering if there is a cleaner way to solve this problem. If the filetype is unknown (could be a number of different formats) is there some reliable way to check the file header for the number of bytes that should be in the file, vs the number of bytes currently in the file to confirm they match?
Thanks for any thoughts or ideas!
The way I've done this in the past is that the process writing the file writes to a "temp" file, and then moves the file to the read location when it has finished writing the file.
So the writing process would write to info.txt.tmp. When it's finished, it renames the file to info.txt. The reading process then just had to check for the existence of info.txt - and it knows that if it exists, it has been written completely.
Alternatively you could have the write process write info.txt to a different directory, and then move it to the read directory if you don't like using weird file extensions.
You could use an external marker file. The writing process could create a file XYZ.lock before it starts creating file XYZ, and delete XYZ.lock after XYZ is completed. The reader would then easily know that it can consider a file complete only if the corresponding .lock file is not present.
I had no option of using temp markers etc as the files are being uploaded by clients over keypair SFTP. they can be very large in size.
Its quite hacky but I compare file size before and after sleeping a few seconds.
Its obviously not ideal to lock the thread but in our case it is merely running as a background system processes so seems to work fine
private boolean isCompletelyWritten(File file) throws InterruptedException{
Long fileSizeBefore = file.length();
Thread.sleep(3000);
Long fileSizeAfter = file.length();
System.out.println("comparing file size " + fileSizeBefore + " with " + fileSizeAfter);
if (fileSizeBefore.equals(fileSizeAfter)) {
return true;
}
return false;
}
Note: as mentioned below this might not work on windows. This was used in a Linux environment.
One simple solution I've used in the past for this scenario with Windows is to use boolean File.renameTo(File) and attempt to move the original file to a separate staging folder:
boolean success = potentiallyIncompleteFile.renameTo(stagingAreaFile);
If success is false, then the potentiallyIncompleteFile is still being written to.
This possible to do by using Apache Commons IO maven library FileUtils.copyFile() method. If you try to copy file and get IOException its means that file is not completely saved.
Example:
public static void copyAndDeleteFile(File file, String destinationFile) {
try {
FileUtils.copyFile(file, new File(fileDirectory));
} catch (IOException e) {
e.printStackTrace();
copyAndDeleteFile(file, fileDirectory, delayThreadPeriod);
}
Or periodically check with some delay size of folder that contains this file:
FileUtils.sizeOfDirectory(folder);
Even the number of bytes are equal, the content of the file may be different.
So I think, you have to match the old and the new file byte by byte.
2 options that seems to solve this issue:
the best option- writer process notify reading process somehow that
the writing was finished.
write the file to {id}.tmp, than when finish- rename it to {id}.java, and the reading process run only on *.java files. renaming taking much less time and the chance this 2 process work together decrease.
First, there's Why doesn't OS X lock files like windows does when copying to a Samba share? but that's variation of what you're already doing.
As far as reading arbitrary files and looking for sizes, some files have that information, some do not, but even those that do do not have any common way of representing it. You would need specific information of each format, and manage them each independently.
If you absolutely must act on the file the "instant" it's done, then your writing process would need to send some kind of notification. Otherwise, you're pretty much stuck polling the files, and reading the directory is quite cheap in terms of I/O compared to reading random blocks from random files.
One more method to test that a file is completely written:
private void waitUntilIsReadable(File file) throws InterruptedException {
boolean isReadable = false;
int loopsNumber = 1;
while (!isReadable && loopsNumber <= MAX_NUM_OF_WAITING_60) {
try (InputStream in = new BufferedInputStream(new FileInputStream(file))) {
log.trace("InputStream readable. Available: {}. File: '{}'",
in.available(), file.getAbsolutePath());
isReadable = true;
} catch (Exception e) {
log.trace("InputStream is not readable yet. File: '{}'", file.getAbsolutePath());
loopsNumber++;
TimeUnit.MILLISECONDS.sleep(1000);
}
}
}
Use this for Unix if you are transferring files using FTP or Winscp:
public static void isFileReady(File entry) throws Exception {
long realFileSize = entry.length();
long currentFileSize = 0;
do {
try (FileInputStream fis = new FileInputStream(entry);) {
currentFileSize = 0;
while (fis.available() > 0) {
byte[] b = new byte[1024];
int nResult = fis.read(b);
currentFileSize += nResult;
if (nResult == -1)
break;
}
} catch (Exception e) {
e.printStackTrace();
}
System.out.println("currentFileSize=" + currentFileSize + ", realFileSize=" + realFileSize);
} while (currentFileSize != realFileSize);
}

Directory size mismatch after file copy

Hopefully someone has seen this before. I'm trying to copy all directory contents from the source to a different directory, and for this I started using the Commons FileUtils.copyDirectorytoDirectory method(File src, File dest). The code is pretty simple:
public static void copyDirtoDir(String src, String dest) {
File s = new File(src);
File d = new File(dest);
try {
FileUtils.copyDirectoryToDirectory(s, d);
} catch (IOException e) {
e.printStackTrace();
}
}
To run this test on Linux, I'm running the app as a JAR and passing the src and dest strings from the command line. The problem is that when I check the resulting directory size after execution, there's a huge difference in size (with the copied dir around twice the size of the original - checked using 'du -sh').
I then simply tried with nio.FileChannels, as follows:
public static void copyFile(File in, File out) throws IOException {
FileChannel source = new FileInputStream(in).getChannel();
FileChannel destination = new FileOutputStream(out).getChannel();
source.transferTo(0, source.size(), destination);
source.close();
destination.close();
}
Calling this method for every file inside the directory. The resulting size from this variation is also around twice the size of the original. If I do a listing of the directories' contents, they are the same.
Is there any missing parameter or something that could be causing this size difference?
Not sure what's going on, but you can use diff to diff directories. I'm sure that will pin down the differences easily.
The javadoc says that copyDirectoryToDirectory copies the source directory and all its contents to a directory of the same name in the specified destination directory.
Without seeing your directory structure, I'm guessing this may cause the double data. Any reason why you're not using the simple FileUtils.copyDirectory() ?

How to list a 2 million files directory in java without having an "out of memory" exception

I have to deal with a directory of about 2 million xml's to be processed.
I've already solved the processing distributing the work between machines and threads using queues and everything goes right.
But now the big problem is the bottleneck of reading the directory with the 2 million files in order to fill the queues incrementally.
I've tried using the File.listFiles() method, but it gives me a java out of memory: heap space exception. Any ideas?
First of all, do you have any possibility to use Java 7? There you have a FileVisitor and the Files.walkFileTree, which should probably work within your memory constraints.
Otherwise, the only way I can think of is to use File.listFiles(FileFilter filter) with a filter that always returns false (ensuring that the full array of files is never kept in memory), but that catches the files to be processed along the way, and perhaps puts them in a producer/consumer queue or writes the file-names to disk for later traversal.
Alternatively, if you control the names of the files, or if they are named in some nice way, you could process the files in chunks using a filter that accepts filenames on the form file0000000-filefile0001000 then file0001000-filefile0002000 and so on.
If the names are not named in a nice way like this, you could try filtering them based on the hash-code of the file-name, which is supposed to be fairly evenly distributed over the set of integers.
Update: Sigh. Probably won't work. Just had a look at the listFiles implementation:
public File[] listFiles(FilenameFilter filter) {
String ss[] = list();
if (ss == null) return null;
ArrayList v = new ArrayList();
for (int i = 0 ; i < ss.length ; i++) {
if ((filter == null) || filter.accept(this, ss[i])) {
v.add(new File(ss[i], this));
}
}
return (File[])(v.toArray(new File[v.size()]));
}
so it will probably fail at the first line anyway... Sort of disappointing. I believe your best option is to put the files in different directories.
Btw, could you give an example of a file name? Are they "guessable"? Like
for (int i = 0; i < 100000; i++)
tryToOpen(String.format("file%05d", i))
If Java 7 is not an option, this hack will work (for UNIX):
Process process = Runtime.getRuntime().exec(new String[]{"ls", "-f", "/path"});
BufferedReader reader = new BufferedReader(new InputStreamReader(process.getInputStream()));
String line;
while (null != (line = reader.readLine())) {
if (line.startsWith("."))
continue;
System.out.println(line);
}
The -f parameter will speed it up (from man ls):
-f do not sort, enable -aU, disable -lst
In case you can use Java 7 this can be done in this way and you won't have those out of memory problems.
Path path = FileSystems.getDefault().getPath("C:\\path\\with\\lots\\of\\files");
Files.walkFileTree(path, new FileVisitor<Path>() {
#Override
public FileVisitResult preVisitDirectory(Path dir, BasicFileAttributes attrs) throws IOException {
return FileVisitResult.CONTINUE;
}
#Override
public FileVisitResult visitFile(Path file, BasicFileAttributes attrs) throws IOException {
// here you have the files to process
System.out.println(file);
return FileVisitResult.CONTINUE;
}
#Override
public FileVisitResult visitFileFailed(Path file, IOException exc) throws IOException {
return FileVisitResult.TERMINATE;
}
#Override
public FileVisitResult postVisitDirectory(Path dir, IOException exc) throws IOException {
return FileVisitResult.CONTINUE;
}
});
Use File.list() instead of File.listFiles() - the String objects it returns consume less memory than the File objects, and (more importantly, depending on the location of the directory) they don't contain the full path name.
Then, construct File objects as needed when processing the result.
However, this will not work for arbitrarily large directories either. It's an overall better idea to organize your files in a hierarchy of directories so that no single directory has more than a few thousand entries.
You can do that with Apache FileUtils library. No memory problem. I did check with visualvm.
Iterator<File> it = FileUtils.iterateFiles(folder, null, true);
while (it.hasNext())
{
File fileEntry = (File) it.next();
}
Hope that helps.
bye
This also requires Java 7, but it's simpler than the Files.walkFileTree answer if you just want to list the contents of a directory and not walk the whole tree:
Path dir = Paths.get("/some/directory");
try (DirectoryStream<Path> stream = Files.newDirectoryStream(dir)) {
for (Path path : stream) {
handleFile(path.toFile());
}
} catch (IOException e) {
handleException(e);
}
The implementation of DirectoryStream is platform-specific and never calls File.list or anything like it, instead using the Unix or Windows system calls that iterate over a directory one entry at a time.
Since you're on Windows, it seems like you should have simply used ProcessBuilder to start something like "cmd /k dir /b target_directory", capture the output of that, and route it into a file. You can then process that file a line at a time, reading the file names out and dealing with them.
Better late than never? ;)
Why do you store 2 million files in the same directory anyway? I can imagine it slows down access terribly on the OS level already.
I would definitely want to have them divided into subdirectories (e.g. by date/time of creation) already before processing. But if it is not possible for some reason, could it be done during processing? E.g. move 1000 files queued for Process1 into Directory1, another 1000 files for Process2 into Directory2 etc. Then each process/thread sees only the (limited number of) files portioned for it.
At fist you could try to increase the memory of your JVM with passing -Xmx1024m e.g.
Please post the full stack trace of the OOM exception to identify where the bottleneck is, as well as a short, complete Java program showing the behaviour you see.
It is most likely because you collect all of the two million entries in memory, and they don't fit. Can you increase heap space?
If file names follow certain rules, you can use File.list(filter) instead of File.listFiles to get manageable portions of file listing.
I faced same problem when I developed malware scanning application. My solution is execute shell command to list all files. It's faster than recursively methods to browse folder by folder.
see more about shell command here: http://adbshell.com/commands/adb-shell-ls
Process process = Runtime.getRuntime().exec("ls -R /");
BufferedReader bufferedReader = new BufferedReader(new InputStreamReader(process.getInputStream()));
//TODO: Read the stream to get a list of file path.
You could use listFiles with a special FilenameFilter. The first time the FilenameFilter is sent to listFiles it accepts the first 1000 files and then saves them as visited.
The next time FilenameFilter is sent to listFiles, it ignores the first 1000 visited files and returns the next 1000, and so on until complete.
As a first approach you might try tweaking some JVM memory settings, e.g. increase heap size as it was suggested or even use AggressiveHeap option.
Taking into account the large amount of files, this may not help, then I would suggest to workaround the problem. Create several files with filenames in each, say 500k filenames per file and read from them.
Try this, it works to me, but I hadn't so many documents...
File dir = new File("directory");
String[] children = dir.list();
if (children == null) {
//Either dir does not exist or is not a directory
System.out.print("Directory doesn't exist\n");
}
else {
for (int i=0; i<children.length; i++) {
// Get filename of file or directory
String filename = children[i];
}

Categories