Files.walkFileTree vs Files.walk performance on Windows NTFS - java

My application needs to periodically scan filesystems in order to process files. Initially I was using java.nio.file.Files.walk to perform the scan, but soon enough I encountered an issue with some AccessDeniedException. Found out after some googling that Files.walk expects the explored directory tree to be accessible by the user, else it would crash and stop, which renders this function unusable for my application (it is self-hosted by a multitude of people on various kind of systems).
I changed the implementation to use java.nio.file.Files.walkFileTree instead, which seemed to work great and handled the AccessDeniedException in the user code.
However, someone recently reported that scanning time skyrocketed from a mere 12 seconds (using Files.walk) to 80 minutes (using Files.walkFileTree)! The user has around 10,000 folders and 120,000 files. It is running Windows, and the disk is using NTFS. Other users with similar number of folders/files but running Linux experience scan times under 10s, whatever the method used.
I am trying to understand what could cause the massive performance hit by using Files.walkFileTree on Windows with NTFS, but given I don't have access to a test system running Windows, I can't debug the code to understand where the time is spent.
Would you know if there are know issues around walking a file tree under Windows NTFS? And if there are some other methods I could use to perform that task? Keeping in mind that I need to handle AccessDeniedException in user code.

Related

stop java execution if condition is met

So the idea is a kind of virtual classroom (a website) where students uploads uncompiled .java files, our server will compile it and execute it through C# or PHP, the language doesn't matter, creating a .bat file and get the feedback of the console if the program compiled correctly or not and if the execution was correct based on some pre-maded test, so far our tests did work but we have completely no control on what's inside the .java file so we want to stop the execution if some criterias did happen, i.e. an user input, infite loop, sockets instances, etc... I've digging on internet if there's a way to configure the java environment to avoid this but so far can't find anything, and we don't want our backend language to go through the file to check this things cause will be a completly mess up
Thanks for the help
You could configure a security manager, but it doesn't have a very good track record of stopping a determined attacker, and doesn't do resource limiting anyways.
You could load the untrusted code with a dedicated class loader that only sees white-listed classes.
Or you could use something like docker to isolate the process at the operating system level. This could also limit its cpu and memory consumption.
I'd probably combine these approaches, but some risk will remain in either case.
(Yes, I realize that is complex, but safely sandboxing arbitrary java code is a hard problem.)

Handle File Access Exceptions

I have been playing with a program that copies a set of files into a directory, manipulates said files, and outputs them to another directory. On 99% of the environments it has been tested / deployed on it works fine, however on one particular machine I will occasionally (roughly once every 600 files) receive FileNotFoundException: (Access denied) exceptions, which are not repeatable using the same testing conditions and inputs. After thoroughly checking all my code (and the open source lib it uses) for any unclosed resources, I ran the program alongside Process Monitor to see what other processes were accessing the files: javaw and Explorer were to be expected, but there was also a windows audio service randomly querying and locking files (wtf).
Ultimately, it would make sense to retry when catching this exception, but there is no guarantee it wont still be locked. Would it make sense to deny access rights to the SYSTEM user? Is there a file-locking mechanism I should be utilizing? Has anyone else dealt with a file access issue in the past?
Additional Info: I was able to run the program with no issue on my machine after removing SYSTEM privileges on the directories where files are stored, but I have never had an issue on this PC. I also implemented a retry after a wait whenever the exception is caught (again, never fired because this comp has never had the issue). Will be redeploying this week to see how it works: (are there still file access exceptions after changing privileges / allowing a small wait on failure)...
Final Follow-Up: On a whim I reinstated this Stackoverflow account and found this question. If I recall, the final resolution was to:
+ Add any access denied files to a list
+ Process the list at the end of processing all other files
Certainly not fool proof, but this was also running on XP, so hopefully is defunct issue now.

how to cpu scavenging in Java?

My program is a distributed software for a small laboratory. It is written in java but because during daytime the computers are used, I have to manually restart it in the evening. I would solve the problem by starting it from a service every time the computer is started but I need a mechanism to detect:
1) user input(mouse, keyboard etc.)
2) user logon
Detecting any user input from java it is not possible. Is there any framework for something like this?
Detecting user input from Java is possible, but not with standard tecnologies. Considering the excellent example of aTunes, you can do it using, depending upon your platform
JIntelliType on Windows
JXGrabKey on Linux
Considering your other question, I would use native abilities of client OSes to better handle your problem. First thing is to make your Java process lower priority. This way, it will run all time long, without having the user blocked by it. I would also force this program to stop when CPU load is over a given target. This can be achieved using JMX, as this previous question explains.

How to reliably detect disk is down on Linux via Java

Is there a good way to detect that particular disk went offline on server on Linux, via Java?
I have an application, that due to performance reasons, writes to all disks directly (without any RAID in middle).
I need to detect if Linux would unmount any disk due to disk crash during run-time, so I would stop using it. The problem is that each mount has a root directory, so without proper detection, the application will just fill-up the root partition.
Will appreciate any advice on this.
In Linux, everything is accessible through text files. I don't really understand what is the exact information you require, but check /proc/diskstat, /proc/mounts, /proc/mdstat (for raids), etc...
As anyone with sysadmin experience could tell you, disks crashing or otherwise going away has a nasty habit of making any process that touches anything under the mountpoint wait in uninterruptible sleep. Additionally, in my experience, this can include things like trying to read /proc/mounts, or running the 'df' command.
My recommendation would be to use RAID, and if necessary, invest your way out of the problem. Say, if performance is limited by small random writes, a RAID card with a battery backed write cache can do wonders.

Running a standalone Hadoop application on multiple CPU cores

My team built a Java application using the Hadoop libraries to transform a bunch of input files into useful output.
Given the current load a single multicore server will do fine for the coming year or so. We do not (yet) have the need to go for a multiserver Hadoop cluster, yet we chose to start this project "being prepared".
When I run this app on the command-line (or in eclipse or netbeans) I have not yet been able to convince it to use more that one map and/or reduce thread at a time.
Given the fact that the tool is very CPU intensive this "single threadedness" is my current bottleneck.
When running it in the netbeans profiler I do see that the app starts several threads for various purposes, but only a single map/reduce is running at the same moment.
The input data consists of several input files so Hadoop should at least be able to run 1 thread per input file at the same time for the map phase.
What do I do to at least have 2 or even 4 active threads running (which should be possible for most of the processing time of this application)?
I'm expecting this to be something very silly that I've overlooked.
I just found this: https://issues.apache.org/jira/browse/MAPREDUCE-1367
This implements the feature I was looking for in Hadoop 0.21
It introduces the flag mapreduce.local.map.tasks.maximum to control it.
For now I've also found the solution described here in this question.
I'm not sure if I'm correct, but when you are running tasks in local mode, you can't have multiple mappers/reducers.
Anyway, to set maximum number of running mappers and reducers use configuration options mapred.tasktracker.map.tasks.maximum and mapred.tasktracker.reduce.tasks.maximum by default those options are set to 2, so I might be right.
Finally, if you want to be prepared for multinode cluster go straight with running this in fully-distributed way, but have all servers (namenode, datanode, tasktracker, jobtracker, ...) run on a single machine
Just for clarification...
If hadoop runs in local mode you don't have parallel execution on a task level (except you're running >= hadoop 0.21 (MAPREDUCE-1367)). Though you can submit multiple jobs at once and these getting executed in parallel then.
All those
mapred.tasktracker.{map|reduce}.tasks.maximum
properties do only apply to the hadoop running in distributed mode!
HTH
Joahnnes
According to this thread on the hadoop.core-user email list, you'll want to change the mapred.tasktracker.tasks.maximum setting to the max number of tasks you would like your machine to handle (which would be the number of cores).
This (and other properties you may want to configure) is also documented in the main documentation on how to setup your cluster/daemons.
What you want to do is run Hadoop in "pseudo-distributed" mode. One machine, but, running task trackers and name nodes as if it were a real cluster. Then it will (potentially) run several workers.
Note that if your input is small Hadoop will decide it's not worth parallelizing. You may have to coax it by changing its default split size.
In my experience, "typical" Hadoop jobs are I/O bound, sometimes memory-bound, way before they are CPU-bound. You may find it impossible to fully utilize all the cores on one machine for this reason.

Categories