Picking n random files from a directory

Picking n random files from a directory - java

I have a folder containing over 100k folders in it. If I use listFiles() then it takes a lot of time because it returns all the entries present in the folder. What I want is, n random entries from the folder which I will process and will move to a different location.

I was curious to see what sort of performance you get with listFiles(), so I tested. With 100,000 children, I saw a delay of 0.051 seconds. You will likely see this rate hold relatively well (nothing I have found would suggest any substantial increase within Java; any rapid degradation will come natively). While this delay is relatively small, I looked into how listFiles works to determine if there was any potential improvements that could be made.
Improvement 1
The first solution is to use File.list() as opposed to File.listFiles(). If you look at the code for the listFiles() method, you can see how Java finds the children of a Folder.
public File[] listFiles() {
String[] ss = list();
if (ss == null) return null;
int n = ss.length;
File[] fs = new File[n];
for (int i = 0; i < n; i++) {
fs[i] = new File(ss[i], this);
}
return fs;
}
The listFiles() method takes an array of the names of the children, which are Strings, and creates a File object for each child. The iteration and instantiation of File objects would create an unnecessary overheard for yours task; You only want a single File which would be less expensive if the conversion from a String[] to a File[] was ignored. Fortunately, the list(); method is public, so you can use this method instead to obtain a slight performance increase.
A rough test shows that this reduced the time by approximately 25% (when searching a folder with 100,000 children).
Improvement 2
The next logical step would be to look at the list() and see what it does. Here things get a little bit sticky:
public String[] list() {
SecurityManager security = System.getSecurityManager();
if (security != null) {
security.checkRead(path);
}
if (isInvalid()) {
return null;
}
return fs.list(this);
}
Under the assumption you are okay with skipping the security and validation checks, you would want to follow fs.list(this); to where it takes you. Following this takes you down a bit of a rabbit hole:
fs.list(this)
DefaultFileSystem.getFileSystem().list(File f)
new WinNTFileSystem.list(File f)
which is where you stop. The .list(File f)is declarednative` meaning that it has been implemented in native code using JNI. All the way down the line access is restricted meaning
If you are wanting to go as deep as you can possibly go, you could use reflection to gain access to these methods. The lowest level I believe you can go is the native method WinNTFileSystem.file(File f), though I would highly recommend against doing this.
/* Setup */
// Get FileSystem from File class
Field fieldFileSystem = File.class.getDeclaredField("fs");
fieldFileSystem.setAccessible(true);
Object fs = fieldFileSystem.get(null);
// Get WinNTFileSystem class
Class<?> classWinNTFileSystem = Class.forName("java.io.WinNTFileSystem");
// Get native `list` method from WinNTFileSystem class
Method methodList = classWinNTFileSystem .getMethod("list", File.class);
methodList.setAccessible(true);
/* Each time you want to invoke the method */
String[] files = (String[]) methodList.invoke(fs, root);
The performance upgrade for this varied significantly. At times I saw slightly better than using the previous method while others I saw drastic improvements of over 50%, though I am skeptical of this performance. Using this method you should see at least a minor increase over File.list(). (The assumption was made that you only create the Method object once and reused it through the code).
Note
Short of using keys are file names, you won't see any significant performance increases beyond what I have shown. In order to index a File, as you want, you would need the list as there simply is no native implementation for "get child at index n". You could use a key or index as the filename itself, and simply create a new File object using new File(root, "12353");.

Actually java has DirectoryStream interface can be used to iterate over a directory without preloading its content into memory. A sample code for the same is mentioned below.
Path logFolder = Paths.get(windowsClientParentFolder);
try (DirectoryStream<Path> stream = Files.newDirectoryStream(logFolder)) {
for (Path entry : stream) {
String folderName = entry.getFileName().toString();
//process the folder
}
} catch (IOException ex) {
System.out.println("Exception occurred while reading folders.");
}

Related

ConcurrentMap on demand loading java

I'm working on an on demand cache that needs to be thread-safe. I have data for about 30K + items (in one file) that I want to obtain only when needed for my mult-threaded game. However I'm not sure if my approach is how ConcurrentMap's computeIfAbsent is supposed to be used, and if it isn't what alternative is there for me to lazily load contents from a single file without worrying about threading issues? I want to avoid locking if the object exists in my map, which I've read using CHM does on reads.
I've pre-cached file names (which are IDs) that I want to load to ensure they exist to avoid constant checking via the headers hash map. The headers map is read-only and will only be loaded once upon starting of my program.
this is what I've done:
private static final ConcurrentMap<Integer, ItemData> items = new ConcurentHashMap<>();
private static final HashMap<Integer, Byte> headers = new HashMap<>(); // pre loaded file names to avoid checking if file exists
public static ItemData getItem(int itemID) {
var item = items.get(itemID);
if (item != null) {
return item;
}
// if item doesn't exist in map, check if it exists in file on disk
if (!headers.containsKey(itemID)) {
return null;
}
// if item exists in file add it to cache
return items.computeIfAbsent(itemID, k -> {
try (var dis = new DataInputStream(new FileInputStream("item.bin"))) {
var data = new ItemData(itemID);
data.load(dis); // obtains only data for one item
return item;
} catch (IOException e) {
// ommited for brevity. logging goes here.
return null;
}
});
}
Update: Pre-loading isn't an option for me, I agree doing that would solve threading-issues as it will only be read-only. But my game assets combined have a total size of over 2GB. I don't want to load everything during start up as some items in the files may never be used. Thus I'm looking for an approach to load them only when needed.

You wrote
I want to avoid locking if the object exists in my map, which I've read using CHM does on reads.
I don’t know where you read that but it’s definitely wrong. It’s not even an outdated statement as even the very first version specifies:
Retrieval operations (including get) generally do not block…
The general structure of your approach is fine. In case of concurrent first time accesses for a key, it’s possible that multiple threads pass the first check but only one will do the actual retrieval in computeIfAbsent and all of them will use the result. Subsequent accesses to an already loaded item may benefit from the first plain get access.
There’s still something to improve.
return items.computeIfAbsent(itemID, k -> {
try (var dis = new DataInputStream(new FileInputStream("item.bin"))) {
var data = new ItemData(k);
data.load(dis); // obtains only data for one item
return item;
} catch (IOException e) {
// may still do logging here
throw new UncheckIOException(e);
}
});
First, while it’s a good approach to do logging (which you omitted for brevity), returning null and forcing the calling code to deal with null is not a good idea. You already have the headers.containsKey(…) check that tells us that the resource is supposed to be there, so the application likely has no way to deal with the absence, so we’re talking about an exceptional situation.
Further, you can use the k parameter passed to the function rather than accessing itemID from the surrounding scope. Limiting access scopes is not only cleaner, in this case, it turns the lambda expression into a non-capturing one, which means that it doesn’t require to create a new object each time, that would otherwise be needed to hold the captured value.
If you really read the same item.bin file for all ItemData, you may consider using memory mapped I/O to share the data, instead of reading it with a DataInputStream. The ByteBuffer representation of a memory mapped file offers almost the same methods to get compound items, it even supports little endian processing that DataInputStream doesn’t support.

Using a PriorityBlockingQueue to feed in logged objects for processing

I have an application that reads in objects from multiple serialized object logs and hands them off to another class for processing. My question focuses on how to efficiently and cleanly read in the objects and send them off.
The code was pulled from an older version of the application, but we ended up keeping it as is. It hasn't really been used much until the past week, but I recently started looking at the code more closely to try and improve it.
It opens N ObjectInputStreams, and reads one object from each stream to store them in an array (assume inputStreams below is just an array of ObjectInputStream objects that corresponds to each log file):
for (int i = 0; i < logObjects.length; i++) {
if (inputStreams[i] == null) {
continue;
}
try {
if (logObjects[i] == null) {
logObjects[i] = (LogObject) inputStreams[i].readObject();
}
} catch (final InvalidClassException e) {
LOGGER.warn("Invalid object read from " + logFileList.get(i).getAbsolutePath(), e);
} catch (final EOFException e) {
inputStreams[i] = null;
}
}
The objects that were serialized to file are LogObject objects. Here is the LogObject class:
public class LogObject implements Serializable {
private static final long serialVersionUID = -5686286252863178498L;
private Object logObject;
private long logTime;
public LogObject(Object logObject) {
this.logObject = logObject;
this.logTime = System.currentTimeMillis();
}
public Object getLogObject() {
return logObject;
}
public long getLogTime() {
return logTime;
}
}
Once the objects are in the array, it then compares the log time and sends off the object with the earliest time:
// handle the LogObject with the earliest log time
minTime = Long.MAX_VALUE;
for (int i = 0; i < logObjects.length; i++) {
logObject = logObjects[i];
if (logObject == null) {
continue;
}
if (logObject.getLogTime() < minTime) {
index = i;
minTime = logObject.getLogTime();
}
}
handler.handleOutput(logObjects[index].getLogObject());
My first thought was to create a thread for each file that reads in and puts the objects in a PriorityBlockingQueue (using a custom comparator that uses the LogObject log time to compare). Another thread could then be taking the values out and sending them off.
The only issue here is that one thread could put an object on the queue and have it taken off before another thread could put one on that may have an earlier time. This is why the objects were read in and stored in an array initially before checking for the log time.
Does this constraint prohibit me from implementing a multi-threaded design? Or is there a way I can tweak my solution to make it more efficient?

As far as I understand your problem you need to process LogObjects strictly in order. In that case initial part of your code is totally correct. What this code does is merge sort of several input streams. You need to read one object for each stream (this is why temporary array is needed) then take appropriate (minimum/maximum) LogObject and handle to processor.
Depending on your context you might be able to do processing in several threads. The only thing you need to change is to put LogObjects in ArrayBlockingQueue and processors might runs on several independent threads. Another option is to send LogObjects for processing in ThreadPoolExecutor. Last option is more simple and straightforward.
But be aware of several pitfalls on the way:
for this algorithm to work correctly individual streams must be already sorted. Otherwise your program is broken;
when you do processing in parallel message processing order is strictly speaking is not defined. So proposed algorithms only guarantees message processing start order (dispatch order). It might be not what you want.
So now you should face several questions:
Do processing order is really required?
If so, does global order required (over all messages) or local one (over independent group of messages)?
Answer to those question will have great impact on your ability to do parallel processing.
If the answer on first question is yes, sadly, parallel processing is not an option.

I agree with you. Throw this away and use a PriorityBlockingQueue.
The only issue here is that if Thread 1 has read an object from File 1 in and put it in the queue (and the object File 2 was going to read in has an earlier log time), the reading Thread could take it and send it off, resulting in a log object with a later time being sent first
This is exactly like the merge phase of a balanced merge (Knuth ACP vol 3). You must read the next input from the same file as you got the previous lowest element from.
Does this constraint prohibit me from implementing a multi-threaded design?
It isn't a constraint. It's imaginary.
Or is there a way I can tweak my solution to make it more efficient?
Priority queues are already pretty efficient. In any case you should certainly worry about correctness first. Then add buffering ;-) Wrap the ObjectInputStreams around BufferedInputStreams, and ensure there is a BufferedOutputStream in your output stack.

Efficiency of method call in for loop condition

I am writing a game engine, in which a set of objects held in a ArrayList are iterated over using a for loop. Obviously, efficiency is rather important, and so I was wondering about the efficiency of the loop.
for (String extension : assetLoader.getSupportedExtensions()) {
// do stuff with the extension here
}
Where getSupportedExtension() returns an ArrayList of Strings. What I'm wondering is if the method is called every time the loop iterates over a new extension. If so, would it be more efficient to do something like:
ArrayList<String> supportedExtensions = ((IAssetLoader<?>) loader).getSupportedExtensions();
for (String extension : supportedExtensions) {
// stuff
}
? Thanks in advance.

By specification, the idiom
for (String extension : assetLoader.getSupportedExtensions()) {
...
}
expands into
for (Iterator<String> it = assetLoader.getSupportedExtensions().iterator(); it.hasNext();)
{
String extension = it.next();
...
}
Therefore the call you ask about occurs only once, at loop init time. It is the iterator object whose methods are being called repeatedly.
However, if you are honestly interested about the performance of your application, then you should make sure you're focusing on the big wins and not small potatoes like this. It is almost impossible to make a getter call stand out as a bottleneck in any piece of code. This goes double for applications running on HotSpot, which will inline that getter call and turn it into a direct field access.

No, the method assetLoader.getSupportedExtensions() is called only once before the first iteration of the loop, and is used to create an Iterator<String> used by the enhanced for loop.
The two snippets will have the same performance.

Direct cost.
Since, as people said before, the following
for (String extension : assetLoader.getSupportedExtensions()) {
//stuff
}
transforms into
for (Iterator<String> it = assetLoader.getSupportedExtensions().iterator(); it.hasNext();) {
String extension = it.next();
//stuf
}
getSupportedExtensions() is called once and both of your code snippets have the same performance cost, but not the best performance possible to go through the List, because of...
Indirect cost
Which is the cost of instantiation and utilization of new short-living object + cost of method next(). Method iterator() prepares an instance of Iterator. So, it is need to spend time to instantiate the object and then (when that object becomes unreachable) to GC it. The total indirect cost isn't so much (about 10 instructions to allocate memory for new object + a few instructions of constructor + about 5 lines of ArrayList.Itr.next() + removing of the object from Eden on minor GC), but I personally prefer indexing (or even plain arrays):
ArrayList<String> supportedExtensions = ((IAssetLoader<?>) loader).getSupportedExtensions();
for (int i = 0; i < supportedExtensions.size(); i++) {
String extension = supportedExtensions.get(i);
// stuff
}
over iterating when I have to iterate through the list frequently in the main path of my application. Some other examples of standard java code with hidden cost are some String methods (substring(), trim() etc.), NIO Selectors, boxing/unboxing of primitives to store them in Collections etc.

Sharing array of bins between threads

I have an application that is multithreaded and working OK. However it's hitting lock contention issues (checked by snapshotting the java stack and seeing whats waiting).
Each thread consumes objects off a list and either rejects each or places it into a Bin.
The Bins are initially null as each can be expensive (and there is potentially a lot of them).
The code that is causing the contention looks roughly like this:
public void addToBin(Bin[] bins, Item item) {
Bin bin;
int bin_index = item.bin_index
synchronized(bins) {
bin = bins[bin_index];
if(bin==null) {
bin = new Bin();
bins[bin_index] = bin;
}
}
synchronized(bin) {
bin.add(item);
}
}
It is the synchronization on the bins array that is the bottleneck.
It was suggested to me by a colleague to use double checked locking to solve this, but we're unsure exactly what would be involved to make it safe. The suggested solution looks like this:
public void addToBin(Bin[] bins, Item item) {
int bin_index = item.bin_index
Bin bin = bins[bin_index];
if(bin==null) {
synchronized(bins) {
bin = bins[bin_index];
if(bin==null) {
bin = new Bin();
bins[bin_index] = bin;
}
}
}
synchronized(bin) {
bin.add(item);
}
}
Is this safe and/or is there a better/safer/more idiomatic way to do this?

As already stated in the answer of Malt, Java already provides many lock-free data structures and concepts that can be used to solve this problem. I'd like to add a more detailed example using AtomicReferenceArray:
Assuming, bins is an AtomicReferenceArray, the following code performs a lock free update in case of a null entry:
Bin bin = bins.get(index);
while (bin == null) {
bin = new Bin();
if (!bins.compareAndSet(index, null, bin)) {
// some other thread already set the bin in the meantime
bin = bins.get(index);
}
}
// use bin as usual
Since Java 8, there is a more elegant solution for that:
Bin bin = bins.updateAndGet(index, oldBin -> oldBin == null ? new Bin() : oldBin);
// use bin as usual
Node: The Java 8 version is - while still non-blocking - perceptibly slower than the Java 7 version above due to the fact that updateAndGet will always update the array even if the value does not change. This might or might not be negligible depending on the overal costs for the entire bin-update-operation.
Another very elegant strategy might be to just pre-fill the entire bins array with newly created Bin instances, before handing over the array to the worker threads. As the threads then don't have to modify the array, this will reduce the needs for synchronization to the Bin objects themselves. Fill the array can be easily done multi-threaded by using Arrays.parallelSetAll (since Java 8):
Arrays.parallelSetAll(bins, i -> new Bin());
Update 2: If this is an option depends on the expected output of your algorithm: Will in the end the bins array be filled totally, densely or just sparsely? (In the first case, pre-filling is advicable. In the second case, it depends - as so often. In the latter case it's probably a bad idea).
Update 1: Don't use double-checked-locking! It is not safe! The problem here is visibility, not atomicitiy. In your case, the reading thread might get a partly constructed (hence corrupt) Bin instance. For details see http://www.cs.umd.edu/~pugh/java/memoryModel/DoubleCheckedLocking.html.

Java has a variety of excellent lock-free concurrent data structures, so there's really no need to use arrays with synchronizations for this type of thing.
ConcurrentSkipListMap is a concurrent, sorted, key-value map.
ConcurrentHashMap is a concurrent unsorted key-value.
You can simply use one of these instead of the array. Just set the map key be the Integer index you already use and you're good to go.
There's also Google's ConcurrentLinkedHashMap and Google's Guava Cache, which are excellent for keeping ordered data, and for removing old entries.

I would advise against the 2nd solution because it accesses the bins array outside of a synchronized block therefore it is not guaranteed the changes made by another thread is visible to the code that is reading an element from it unsynchronized.
It is not guaranteed that a concurrently added new Bin will be seen therefore it might create a new Bin for the same index again and discard a concurrently created and stored one - also forgetting that items might be placed in the discarded one.

If none of the built in java classes help you, you could just create 8 bins locks, say binsALock to binsFLock.
Then divide bin_index by 8, use the reminder to choose the lock to use.
If you choose a larger number that is more than the number of threads you have, and use a lock that is very fast when it is contended, then you may do better than choosing 8.
You may also get better result by reducing the number of threads you use.

Fast sort by date of huge file ArrayList

I have an ArrayList in Java with a huge amount of files (~40.000 files). I need to sort these files ascending/descending by their date. Currently, I use a simple
Collections.sort(fileList, new FileDateComparator());
where FileDateComparator is
public class FileDateComparator implements Comparator<File>
{
#Override
public int compare(File o1, File o2)
{
if(o1.lastModified() < o2.lastModified())
return -1;
if(o1.lastModified()==o2.lastModified())
return 0;
return 1;
}
}
Sorting takes up up a too long time for me, like 20 seconds or more. Is there a more efficient way to realize this? I already tried Apache I/O LastModifiedFileComparator as comparator, but it seems to be implemented the same way, since it takes the same time.

I think you need to cache the modification times to speed this up. You could e.g. try something like this:
class DatedFile {
File f;
long moddate;
public DatedFile(File f, long moddate) {
this.f = f;
this.moddate = moddate;
}
};
ArrayList<DatedFile> datedFiles = new ArrayList<DatedFile>();
for (File f: fileList) {
datedFiles.add(new DatedFile(f, f.lastModified()));
}
Collections.sort(fileList, new FileDateComparator());
ArrayList<File> sortedFiles = new ArrayList<File>();
for (DatedFile f: datedFiles) {
sortedFiles.add(f.f);
}
(with an appropriate FileDateComparator implementation)

Sorting is O(n lg N), so your list of 40,000 files will need about 600,000 operations (comparisons). If it takes about 20 seconds, that is about 30,000 comparisons per second. So each comparison is taking about 100,000 clock cycles. That can not be due to CPU-bound processing. The sorting is almost certainly I/O bound rather than CPU bound. Disk seeks are particularly expensive.
You might be able to reduce the time by using multi-threading to reduce the impact of disk seeks. That is, by having several reads queued and waiting for the disk drive to provide their data. To do that, use a (concurrent) map that maps file names to modification times, and populate that map using multiple threads. Then have your sort method use that map rather than use File.lastModified() itself.
Even if you populated that map with only one thread, you would gain a little benefit because your sort method would be using locally cached modification times, rather than querying the O/S every time for the modification times. The benefit of that caching might not be large, because the O/S itself is likely to cache that information.

Java's array .sort() is (from about Java 6) actually TimSort [ http://svn.python.org/projects/python/trunk/Objects/listsort.txt ], the fastest general purpose #sort out there (much better than qsort in many situations); you won't be able to sort anything noticeably faster without a heuristic.
"like 20 seconds or more" signifies to me that your problem is probably the famous ApplicationProfilingSkippedByDeveloperException - do a profiling and locate the exact bottleneck. I'd go with the OS file I/O as one; doing a native request of the file attributes in batch, caching the results and then processing them at once seems the only sensible solution here.

You need to cache the lastModified() One way you can do this is in the Comparator itself.
public class FileDateComparator implements Comparator<File> {
Map<File, Long> lastModifiedMap = new HashMap<>();
Long lastModified(File f) {
Long ts = lastModifiedMap.get(f);
if (ts == null)
lastModifiedMap.put(f, ts = f.lastModified());
return ts;
}
#Override
public int compare(File f1, File f2) {
return lastModified(f1).compareTo(lastModified(f2));
}
}
This will improve performance by only looking up the modified date of each file once.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Picking n random files from a directory - java

I have a folder containing over 100k folders in it. If I use listFiles() then it takes a lot of time because it returns all the entries present in the folder. What I want is, n random entries from the folder which I will process and will move to a different location.

Related

ConcurrentMap on demand loading java

Using a PriorityBlockingQueue to feed in logged objects for processing

Efficiency of method call in for loop condition

Sharing array of bins between threads

Fast sort by date of huge file ArrayList

Categories

Resources