OutOfMemory when list files in a directory

OutOfMemory when list files in a directory - java

When I list files of a directory that has 300,000 files with Java, out of memory occurs.
String[] fileNames = file.list();
What I want is a way that can list all files of a directory incrementally no matter how many files in that specific directory and won't have "out of memory" problem with the default 64M heap limit.
I have Google a while, and cannot find such a way in pure Java.
Please help me!!
Note, JNI is a possible solution, but I hate JNI.

I know you said "with the default 64M heap limit", but let's look at the facts - you want to hold a (potentially) large number of items in memory, using the mechanisms made available to you by Java. So, unless there is some dire reason that you can't, I would say increasing the heap is the way to go.
Here is a link to the same discussion at JavaRanch: http://www.coderanch.com/t/381939/Java-General/java/iterate-over-files-directory
Edit, in response to comment: the reason I said he wants to hold a large number of items in memory is because this is the only mechanism Java provides for listing a directory without using the native interface or platform-specific mechanisms (and the OP said he wanted "pure Java").

The only possible solution for you is Java7 and then you sould use an iterator.
final Path p = FileSystems.getDefault().getPath("Yourpath");
Files.walk(p).forEach(filePath -> {
if (Files.isRegularFile(filePath)) {
//Do something with filePath
}
});

You are a bit out of luck here. In the least there will need to be created 300k strings. With an average length of 8-10 char and 2 bytes per char thats 6Mb in the minimum. Add object pointer overhead per string (8 bytes) and you run into your memory limit.
If you absolutely must have that many files in a single dir, which i would not recommend as your file system will have problems, your best bet is to run a native process (not JNI) via Runtime.exec. Keep in mind that you will tie yourself down to the OS (ls vs dir). You will be able to get a list of files as one large string and will be responsible for post processing it into what you want.
Hope this helps.

Having 300 000 files in a dir is not a good idea - AFAIK filesystems are not good at having that many sub-nodes in a single node. Interesting question, though.
EDIT: THE FOLLOWING DOES NOT HELP, see comments.
I think you could use a FileFilter, reject all files, and process them in the filter.
new File("c:/").listFiles( new FileFilter() {
#Override public boolean accept(File pathname) {
processFile();
return false;
}
});

If you can write your code in Java 7 or up, then following is a good option.
Files.newDirectoryStream(Path dir)
Here is the java doc for the API.
Hope this helps.

Related

Recent ways to obtain file size in Java

I know this question has been widely discussed in different posts:
java get file size efficiently
How to get file size in Java
Get size of folder or file
https://www.mkyong.com/java/how-to-get-file-size-in-java/
My problem is that I need to obtain the sizes of a large number of files (regular files existing in a HD), and for this I need a solution that provides the best performance. My intuition is that it should be done through a method that reads directly the file system table, not obtaining the size of the file by reading the whole file contents. It is difficult to know which specific method is used by reading the documentation.
As stated in this page:
Files has the size() method to determine the size of the file. This is
the most recent API and it is recommended for new Java applications.
But this is apparently not the best advise, in terms of performance. I have made different measurements of different methods:
file.length();
Files.size(path);
BasicFileAttributes attr = Files.readAttributes(path, BasicFileAttributes.class); attr.size();
And my surprise is that file.length(); is the fastest, having to create a File object instead of using the newer Path. I do not now if this also reads the file system or the contents. So my question is:
What is the fastest, recommended way to obtain file sizes in recent Java versions (9/10/11)?
EDIT
I do not think these details add anything to the question. Basically the benchmark reads like this:
Length: 49852 with previous instanciation: 84676
Files: 3451537 with previous instanciation: 5722015
Length: 48019 with previous instanciation:: 79910
Length: 47653 with previous instanciation:: 86875
Files: 83576 with previous instanciation: 125730
BasicFileAttr: 333571 with previous instanciation:: 366928
.....
Length is quite consistent. Files is noticeable slow on the first call, but it must cache something since later calls are faster (still slower than Length). This is what other people observed in some of the links I reference above. BasicFileAttr was my hope but still is slow.
I am asing what is recommended in modern Java versions, and I considered 9/10/11 as "modern". It is not a dependency, nor a limitation, but I suppose Java 11 is supposed to provide better means to get file sizes than Java 5. If Java 8 released the fastest way, that is OK.
It is not a premature optimisation, at the moment I am optimising a CRC check with an initial size check, because it should be much faster and does not need, in theory, to read file contents. So I can use directly the "old" Length method, and all I am asking is what are the new advances on this respect in modern Java, since the new methods are apparently not as fast as the old ones.

MappedByteBuffer clear cached Pages

I've got a problem with MappedByteBuffer specially how it works internally. The way I understand it the caching is done completely by the Operating System. So if I read from the file (using MappedByteBuffer) the OS will read whole pages from the hard drive and saves the page in RAM for faster access when needed again. This also allows to provide a shared cache for multiple applications/processes which access the same file. Is this correct?
If so, how is it possible to invalidate this cache? Just reinitializing the Mapped-Object shouldn't work. I have written an application which reads a lot from the hard drive. I need to do a few benchmarks, so I need to clear this cache when needed. I've tried to use "echo 3 > /proc/sys/vm/drop_caches" but this doesn't make a difference so I think it is not working.

This also allows to provide a shared cache for multiple applications/processes which access the same file. Is this correct?
This is how it works on Linux, Windows and MacOS. On other OSes, it probably is the same.
If so, how is it possible to invalidate this cache?
delete the file and it will not longer be valid.
I need to do a few benchmarks, so I need to clear this cache when needed.
That is what the OS is for. If you need to force the cache to be invalid, this is tricky and entirely OS dependant.
I've tried to use "echo 3 > /proc/sys/vm/drop_caches" but this doesn't make a difference so I think it is not working.
It may have no impact on your benchmark. I suggest you look at /proc/meminfo for
Cached: 588104 kB
SwapCached: 264 kB
BTW if you want to unmap a MappedByteBuffer I do the following
public static void clean(ByteBuffer bb) {
if (bb instanceof DirectBuffer) {
Cleaner cl = ((DirectBuffer) bb).cleaner();
if (cl != null)
cl.clean();
}
}
This works for direct ByteBuffers as well, but probably won't work in Java 9 as this interface will be removed.

It's known sad issue (which is still unresolved in JDK): http://bugs.java.com/view_bug.do?bug_id=4724038
But even though there's no public API for that there's a dangerous workaround (use at your own risk): https://community.oracle.com/message/9387222
An alternative is not to use huge MappedByteBuffers and hope that they eventually will be garbage-collected.
In case what is needed is for different programs that want to map this file use their very own MappedByteBuffer copy out of this File, MapMode.PRIVATE could help.
Hope that helps.

Optimising Java's NIO for small files

We have a file I/O bottleneck. We have a directory which contains lots of JPEG files, and we want to read them in in real time as a movie. Obviously this is not an ideal format, but this is a prototype object tracking system and there is no possibility to change the format as they are used elsewhere in the code.
From each file we build a frame object which basically means having a buffered image and an explicit bytebuffer containing all of the information from the image.
What is the best strategy for this? The data is on a SSD which in theory has read/write rates around 400Mb/s, but in practice is reading no more than 20 files per second (3-4Mb/s) using the naive implementation:
bufferedImg = ImageIO.read(imageFile);[1]
byte[] data = ((DataBufferByte)bufferedImg.getRaster().getDataBuffer()).getData();[2]
imgBuf = ByteBuffer.wrap(data);
However, Java produces lots of possibilities for improving this.
(1) CHannels. Esp File Channels
(2) Gathering/Scattering.
(3) Direct Buffering
(4) Memory Mapped Buffers
(5) MultiThreading - use a bunch of callables to access many files simultaneously.
(6) Wrapping the files in a single large file.
(7) Other things I haven't thought of yet.
I would just like to know if anyone has extensively tested the different options, and knows what is optimal? I assume that (3) is a must, but I would still like to optimise the reading of a single file as far as possible, and am unsure of the best strategy.
Bonus Question: In the code snipped above, when does the JVM actually 'hit the disk' and read in the contents of the file, is it [1] or is that just a file handler which `points' to the object? It kind of makes sense to lazily evaluate but I don't know how the implementation of the ImageIO class works.

ImageIO.read(imageFile)
As it returns BufferedImage, I assume it will hit disk and just not file handler.

Java: list all files(10-20,000+) from single directory

I want to list large number of files(10, 20 thousand or so) contained in a single directory, quickly and efficiently.
I have read quite a few posts especially over here explaining the short coming of Java to achieve such, basically due to the underlying filesystem (and that probably Java 7 has some answer to it).
Some of the posts here have proposed alternatives like native calls or piping etc and I do understand the best possible option under normal circumstances is the java call
- String[] sList = file.list(); which is only slightly better than file.listFiles();
Also, there was a suggestion for the use of multithreading(also Executor service).
Well, here the issue is I have very little practical know-how of how to code multithreading way. So my logic is bound to be incorrect. Still, I tried this way:
created a list of few thread objects
Ran a loop of this list, called the .start() and immediately .sleep(500)
In the thread class, over-rode the run methos to include the .list()
Something like this, Caller class -
String[] strList = null;
for (int i = 0; i < 5; i++){
ThreadLister tL = new ThreadLister(fit);
threadList.add(tL);
}
for (int j = 0; j < threadList.size(); j++) {
thread = threadList.get(j);
thread.start();
thread.sleep(500);
}
strList = thread.fileList;
and the Thread class as -
public String[] fileList;
public ThreadLister(File f) {
this.f = f;
}
public void run() {
fileList = f.list();
}
I might be way off here with multithreading, I guess that.
I would very much appreciate a solution to my requirement the multithreading. Added benefit is I would learn a bit more about practical multithreading.
Query Update
Well, Obviously multithreading isn't going to help me(well I now realise its not actually a solution). Thank you for helping me to rule out threading.
So I tried,
1. FileUtils.listFiles() from apache commons - not much difference.
2. Native call viz. exec("cmd /c dir /B .\\Test") - here this executes fast but then when I read the Stream using a while loop that takes ages.
What actually I require is filename depending upon a certain filter amongst about 100k files in single directory. So I am using like File.list(new FileNameFilter()).
I believe FileNameFilter has no benefit, as it will try to match accordingly with all the files first and then give out the output.
Yes, I understand, I need a different approach of storing these files. One option I can try is storing these files in multiple directories, I am yet to try this(I dont know if this will help enough) - As suggested by Boris earlier.
What else can be a better option, will a native call on Unix ls with filename match work effectively. I know on windows it doesnt work, I mean unless we are searching in same directory
Kind Regards

Multi-threading is useful for listing multiple directories. However, you cannot split a single call to a single directory and I doubt it would be much faster if you could as the OS returns the files in any order it pleases.
The first thing about learning multi-threading is that not all solutions will be faster or simpler just by using multiple threads.

Am as a completely different suggestion. Did you try using Apache Commons File util?
http://commons.apache.org/io/api-release/index.html Check out the method FileUtils.listFiles().
It will list out all the files in a directory. Maybe it is fast enough and optimized enough for you needs. Maybe you really don't need to reinvent the wheel and the solution is already out there?

What eventually, I have done is.
1. As a quickfix, to get over the problem at the moment, I used a native call to write all the filenames in a temp text file and then used a BufferedReader to read each line.
2. Wrote an utility to archive the inactive files(most of them) into some other archive location, thereby reducing the total no.of files in the active directory. So that the normal list() call returns much quicker.
3. But going forward as a long term solution, I will be modifying the way all these files are stored and create a kind of directory hierarchy structure wherein then each directory will be holding comparatively few files and hence the list() can work very fast.
One thing came to my mind and I noticed while testing was this list() when runs for the first time takes a long time but subsequent requests were very very fast. Makes me believe that JVM inetlligently retrieves the list which has remained on the heap. I tried a few things like adding files to the dir or changing the File variable name but still the response was instant. So I believe that this array sits on the heap till gc'ed and Java intelligently responds for same request. <*Am I right? or is that not how it behaves? some explanation pls.*>
Due to this, I thought, if I can write a small program to get this list once everyday and keep a static reference to it then this array won't be gc'ed and every request to retrieve this list will be fast. <*Again, some comments/suggestion appreciated.*>
Is there a way to configure Tomcat, wherein the GC may gc all other non-referenced objects but doesn't for some which are specified so? Somebody told me in Linux something like this is implemented at obviously for the OS level, I dont know whether its true or not though.

Which file system are you using? each file system has its own limitation on number of files/folders a directory can have (including the directory depth). So not sure how you could create and if created through some program were you able to read all the files back.
As suggested above the FileNameFilter, is a post file name filter so I am not sure if it would be any help (although you are probably creating smaller lists of file lists) as each listFiles() method would get the complete list.
For example:
1) Say Thread 1 is capturing list of file names starting with "T*", listFiles() call would retrieve all the thousands of file names and then filters as per FileNameFilter criteria
2) Thread 2 if capturing list of file names starting with "S*" would repeat the all the steps from 1.
So, you reading the directory listing multiple times putting more and more load on Heap/JVM native calls/file system etc.
If possible best suggestion would be to re-organize the directory structure.

Java: Where can I find advanced file manipulation source/libraries?

I'm writing arbitrary byte arrays (mock virus signatures of 32 bytes) into arbitrary files, and I need code to overwrite a specific file given an offset into the file. My specific question is: is there source code/libraries that I can use to perform this particular task?
I've had this problem with Python file manipulation as well. I'm looking for a set of functions that can kill a line, cut/copy/paste, etc. My assumptions are that these are extremely common tasks, and I couldn't find it in the Java API nor my google searches.
Sorry for not RTFM well; I haven't come across any information, and I've been looking for a while now.

Maybe you are looking for something like the RandomAccessFile class in the standard Java JDK. It supports reads and writes at some offset, as well as byte arrays.

Java's RandomAccessFile is exactly what you want.
It includes methods like seek(long) that allow you to move wherever you need in the file. It also allows for reading and writing at the same time.

As far as I know, Java has primarily lower level functions for manipulating files directly. Here is the best I've come up with
The actions you describe are standard in the Swing world, and for text comes down to manipulating a Document object. These act on data in memory. The class java.nio.channels.FileChannel has similar methods that act directly on a file. Neither fine the end of lines automatically, but other classes in java.io and java.nio do.
Apache Commons has a sandbox library called Flatfile which looks like it does what you want. The problem is that no code has been released yet. You may, however, want to talk to people working on it to get some more ideas. I didn't do a general check on libraries.

Have you looked into File/FileReader/FileWriter/BufferedReader? You can get the contents of the files and manipulate it as you like, you can search the data in the files, you can overwrite files, create new, append to an existing....
I am not sure this is exactly what you are asking for but I use these APIs all the time for logging, RTF editors, text file creation for email, and many other things.
As far as cut/copy/past goes, I have not come across the ability to do that directly, however, you can output the contents of the file and "copy" what part of it you want and "paste" it into a new file, or append it to an existing.

While writing a byte array to a file is a common task, writing to a give file 32-bytes byte array just once is just not something you are going to find in java.io :)
To get started, would the below method and comments look reasonable to you? I bet someone here, maybe even myself, could whip it out quick like.
public static void writeFauxVirusSignature(File file, byte[] bytes, long offset) {
//open file
//move to offset
//write bytes
//close file
}
Questions:
How big could the potential target files be?
Do you need performance?
I ask because clean, easy to read code would use Apache Commons lib's, but large file writes in a performance sensitive environment will necessitate using java.nio libraries

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.