Currently i am tasked with making a tool that can check whether a link is correct or not using java. The link is fed from Jericho HTML Parser, and my job is only to check whether the file is exist / the link is correct or not. That part is done, the hard part is to optimize it, since my code run (i have to say) rather sluggishly on 65ms per run
public static String checkRelativeURL(String originalFileLoc, String relativeLoc){
StringBuilder sb = new StringBuilder();
String absolute = Common.relativeToAbsolute(originalFileLoc, relativeLoc); //built in function to replace the link from relative link to absolute path
sb.append(absolute);
sb.append("\t");
try {
Path path = Paths.get(absolute);
sb.append(Files.exists(path));
}catch (InvalidPathException | NullPointerException ex) {
sb.append(false);
}
sb.append("\t");
return sb.toString();
}
and on this line it took 65 ms
Path path = Paths.get(absolute);
sb.append(Files.exists(path));
I have tried using
File file = new File(absolute);
sb.append(file.isFile());
It's still ran around 65~100ms.
So is there any other faster way to check whether a file exists or not other than this?
Since i am processing more than 70k html files and every milliseconds counts, thanks :(
EDIT:
I tried listing all the files into some List, and it doesn't really helps since it take more than 20mins just to list all the file....
The code that i use to list all the file
static public void listFiles2(String filepath){
Path path = Paths.get(filepath);
File file = null;
String pathString = new String();
try {
if(path.toFile().isDirectory()){
DirectoryStream<Path> stream = Files.newDirectoryStream(path);
for(Path entry : stream){
file = entry.toFile();
pathString = entry.toString();
if(file.isDirectory()){
listFiles2(pathString);
}
if (file.isFile()){
filesInProject.add(pathString);
System.out.println(pathString);
}
}
stream.close();
}
} catch (IOException e) {
e.printStackTrace();
}
}
If you know in advance the target OS set (usually it is the case), ultimately the fastest way will be to list so many files through a shell, by invoking a process e.g. using Runtime.exec.
On Windows you can do with
dir /s /b
On Linux
ls -R -1
You can check what is the OS and use appropriate command (error or resort to directory stream if not supported).
If you wish simplicity and don't need to report a progress, you can avoid dealing with the process IO and store the list to a temporary file e.g. ls -R -1 > /tmp/filelist.txt. Alternatively, you can read from the process output directly. Read with a buffered stream, a reader or alike, with large enough buffer.
On SSD it will complete in a blink of an eye and on modern HDD in seconds (half million files is not a problem with this approach).
Once you have the list, you can approach it differently depending on maximum files count and memory requirements. If requirements are loose, e.g. desktop program, you can do with very simple code e.g. pre-loading the complete file list to a HashSet and check existence when needed. Shortening path by removing common root will require much less memory. You can also reduce memory by keeping only filename hash instead of full name (common root removal will probably reduce more).
Or you can optimize it further if you wish, the question just reduces now to a problem of checking existense of a string in a list of strings stored in memory or file, which has many well known optimal solutions.
Bellow is very loose, simplistic sample for Windows. It executes dir on HDD (not SSD) drive root with ~400K files, reads the list and benchmarks (well, kind of) time and memory for string set and md5 set approaches:
public static void main(String args[]) throws Exception {
final Runtime rt = Runtime.getRuntime();
System.out.println("mem " + (rt.totalMemory() - rt.freeMemory())
/ (1024 * 1024) + " Mb");
long time = System.currentTimeMillis();
// windows command: cd to t:\ and run recursive dir
Process p = rt.exec("cmd /c \"t: & dir /s /b > filelist.txt\"");
if (p.waitFor() != 0)
throw new Exception("command has failed");
System.out.println("done executing shell, took "
+ (System.currentTimeMillis() - time) + "ms");
System.out.println();
File f = new File("T:/filelist.txt");
// load into hash set
time = System.currentTimeMillis();
Set<String> fileNames = new HashSet<String>(500000);
try (BufferedReader reader = new BufferedReader(new InputStreamReader(
new FileInputStream(f), StandardCharsets.UTF_8),
50 * 1024 * 1024)) {
for (String line = reader.readLine(); line != null; line = reader
.readLine()) {
fileNames.add(line);
}
}
System.out.println(fileNames.size() + " file names loaded took "
+ (System.currentTimeMillis() - time) + "ms");
System.gc();
System.out.println("mem " + (rt.totalMemory() - rt.freeMemory())
/ (1024 * 1024) + " Mb");
time = System.currentTimeMillis();
// check files
for (int i = 0; i < 70_000; i++) {
StringBuilder fileToCheck = new StringBuilder();
while (fileToCheck.length() < 256)
fileToCheck.append(Double.toString(Math.random()));
if (fileNames.contains(fileToCheck))
System.out.println("to prevent optimization, never executes");
}
System.out.println();
System.out.println("hash set 70K checks took "
+ (System.currentTimeMillis() - time) + "ms");
System.gc();
System.out.println("mem " + (rt.totalMemory() - rt.freeMemory())
/ (1024 * 1024) + " Mb");
// Test memory/performance with MD5 hash set approach instead of full
// names
time = System.currentTimeMillis();
Set<String> nameHashes = new HashSet<String>(50000);
MessageDigest md5 = MessageDigest.getInstance("MD5");
for (String name : fileNames) {
String nameMd5 = new String(md5.digest(name
.getBytes(StandardCharsets.UTF_8)), StandardCharsets.UTF_8);
nameHashes.add(nameMd5);
}
System.out.println();
System.out.println(fileNames.size() + " md5 hashes created, took "
+ (System.currentTimeMillis() - time) + "ms");
fileNames.clear();
fileNames = null;
System.gc();
Thread.sleep(100);
System.gc();
System.out.println("mem " + (rt.totalMemory() - rt.freeMemory())
/ (1024 * 1024) + " Mb");
time = System.currentTimeMillis();
// check files
for (int i = 0; i < 70_000; i++) {
StringBuilder fileToCheck = new StringBuilder();
while (fileToCheck.length() < 256)
fileToCheck.append(Double.toString(Math.random()));
String md5ToCheck = new String(md5.digest(fileToCheck.toString()
.getBytes(StandardCharsets.UTF_8)), StandardCharsets.UTF_8);
if (nameHashes.contains(md5ToCheck))
System.out.println("to prevent optimization, never executes");
}
System.out.println("md5 hash set 70K checks took "
+ (System.currentTimeMillis() - time) + "ms");
System.gc();
System.out.println("mem " + (rt.totalMemory() - rt.freeMemory())
/ (1024 * 1024) + " Mb");
}
Output:
mem 3 Mb
done executing shell, took 5686ms
403108 file names loaded took 382ms
mem 117 Mb
hash set 70K checks took 283ms
mem 117 Mb
403108 md5 hashes created, took 486ms
mem 52 Mb
md5 hash set 70K checks took 366ms
mem 48 Mb
Related
I have a file of 400+ GB like:
ID Data ...4000+columns
001 dsa
002 Data
… …
17201297 asdfghjkl
I wish to chunk down the file as per ID to get faster data retrieval as like:
mylocation/0/0/1/data.json
mylocation/0/0/2/data.json
.....
mylocation/1/7/2/0/1/2/9/7/data.json
my code is working fine but whatever writer I'm using with loop end closing it takes at least 159,206 milisoconds for 0.001% completion of file creation.
In that case can multithread be an option to reduce Time complexity (as like writing 100 or 1k files at a time)?
My Current code is:
int percent = 0;
File file = new File(fileLocation + fileName);
FileReader fileReader = new FileReader(file); // to read input file
BufferedReader bufReader = new BufferedReader(fileReader);
BufferedWriter fw = null;
LinkedHashMap<String, BufferedWriter> fileMap = new LinkedHashMap<>();
int dataCounter = 0;
while ((theline = bufReader.readLine()) != null) {
String generatedFilename = generatedFile + chrNo + "//" + directory + "gnomeV3.json";
Path generatedJsonFilePath = Paths.get(generatedFilename);
if (!Files.exists(generatedJsonFilePath)) {// create directory
Files.createDirectories(generatedJsonFilePath.getParent());
files.createFile(generatedJsonFilePath);
}
String jsonData = DBFileMaker(chrNo, theline, pos);
if (fileMap.containsKey(generatedFilename)) {
fw = fileMap.get(generatedFilename);
fw.write(jsonData);
} else {
fw = new BufferedWriter(new OutputStreamWriter(new FileOutputStream(generatedFilename)));
fw.write(jsonData);
fileMap.put(generatedFilename, fw);
}
if (dataCounter == 172 * percent) {// As I know my number of rows
long millisec = stopwatch.elapsed(TimeUnit.MILLISECONDS);
System.out.println("Upto: " + pos + " as " + (Double) (0.001 * percent)
+ "% completion successful." + " took: " + millisec + " miliseconds");
percent++;
}
dataCounter++;
}
for (BufferedWriter generatedFiles : fileMap.values()) {
generatedFiles.close();
}
That really depends on your storage. Magnetic disks really like sequential writes, so multithreading would probably have a bad effect on their performance. However, SSDs may benefit from multithreaded writing.
What you should do is Either separate your code to 2 threads, where one thread creates the buffers of data to be written to disk and the second thread only writes the data. This way your disk would always keep busy and not wait for more data to be generated.
Or to have a single thread that generates the buffers to be written, but to use java nio in order to write the data asynchronously, while going on to generate the next buffer.
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 5 years ago.
Improve this question
Currently we have a nightly automation run that does a comparison between a resulting test file produced by our software and a baseline file. This comparison is is done several times and the files are large. The file comparison is the bottle neck in our test automation.
File comparison is currently done via buffering line by line comparison.
I was thinking of doing a checksum comparison of the two files (then doing the line by line check if the checksums do not match). Is this the best approach? Is there a public library someone one would like to suggest?
Thanks
Is 10 ms good enough to compare two 260K files? (on Windows laptop)
If so you can use java.security.DigestInputStream to calculate and compare Hash.
Of course doing, check files length before.
If issue is about many files you need to compare, consider using parallel threads to compare each pair.
Sample code:
public static void main(String[] args) {
try {
File file1 = new File("D:\\tmp\\tests\\logs\\test.log");
File file2 = new File("D:\\tmp\\tests\\logs\\test-cp.log");
if (!file1.exists() || !file2.exists()) {
System.out.println("One of the file not found.");
return;
}
if (file1.length() != file2.length()) {
System.out
.println("Files are not identical - not equal length.");
return;
}
long f1Length = file1.length();
long f2Length = file2.length();
System.out.println("Check Digest method:");
FileInputStream fis1 = new FileInputStream(file1);
DigestInputStream dgStream1 = new DigestInputStream(fis1,
MessageDigest.getInstance("MD5"));
FileInputStream fis2 = new FileInputStream(file2);
DigestInputStream dgStream2 = new DigestInputStream(fis2,
MessageDigest.getInstance("MD5"));
// most expensive is dgStream1.getMessageDigest() so do it only at last read
dgStream1.on(false);
dgStream2.on(false);
long f1ReadTotal = 0;
long f2ReadTotal = 0;
long start = System.nanoTime();
int read = 0;
byte[] buff = new byte[1024 * 128];
do {
if ((f1Length - f1ReadTotal) < (1024 * 128)) {
// last read
dgStream1.on(true);
}
read = dgStream1.read(buff);
f1ReadTotal += read > 0 ? read : 0;
} while (read > 0);
read = 0;
do {
if ((f2Length - f2ReadTotal) < (1024 * 128)) {
// last read
dgStream2.on(true);
}
read = dgStream2.read(buff);
f2ReadTotal += read > 0 ? read : 0;
} while (read > 0);
long runTime = System.nanoTime() - start;
if (Arrays.equals(dgStream1.getMessageDigest().digest(), dgStream2
.getMessageDigest().digest())) {
System.out.println("Files are identical. completed in "
+ (runTime / 1000000) + " ms. [" + runTime + " ns.]");
} else {
System.out.println("Files are not identical. completed in "
+ (runTime / 1000000) + " ms. [" + runTime + " ns.]");
}
fis1.close();
fis2.close();
} catch (Exception e) {
e.printStackTrace();
}
}
Main point there that getMessageDigest() is most time expensive operation, so do it once at last read.
BTW: code is just an idea. Real code must be more careful especially about "last read" and definitely can be more optimal.
I have several folders of size >2.5GB on C drive which is SSD. Through Java, I'm moving these folders to another shared drive which also happens to SSD using FileUtils.copyDirectoryToDirectory(sourceDir, destiDir);
It works fine but is slow (taking ~30 mins) when compared to windows default move option which takes 5 mins. I googled around to see if there is a better way to increase the performance of moving directories through my java program but no luck. Can someone suggest me the best way to move these directories?
ok this is what I did
Used a robocopy command within java to copy directories between two locations. Tested with a ~9GB file and was able to copy in ~9 mins. Below is the code snippet
String sourceFolder = new File("C:\\test\\robocopytest\\source\\20170925T213857460").toString();
String destFolder = new File("C:\\test\\robocopytest\\destination\\20170925T213857460").toString();
StringBuffer rbCmd = new StringBuffer();
if ((sourceFolder != null) && (destFolder != null))
{
if (sourceFolder.contains(" ")) {
if (sourceFolder.startsWith("\\")) {
sourceFolder = "/\"" + sourceFolder.substring(1) + "/\"";
} else {
sourceFolder = "\"" + sourceFolder + "\"";
}
}
if (destFolder.contains(" ")) {
if (destFolder.startsWith("\\")) {
destFolder = "/\"" + destFolder.substring(1) + "/\"";
} else {
destFolder = "\"" + destFolder + "\"";
}
}
rbCmd.append("robocopy " + sourceFolder + " " + destFolder);
Process p = Runtime.getRuntime().exec(rbCmd.toString());
}
I'm using proc_open in php to call java application, pass it text to be processed and read output text. Java execution time is quite long and I found the reason for that is reading input takes most of the time. I'm not sure whether it's php's or java's fault.
My PHP code:
$process_cmd = "java -Dfile.encoding=UTF-8 -jar test.jar";
$env = NULL;
$options = ["bypass_shell" => true];
$cwd = NULL;
$descriptorspec = [
0 => ["pipe", "r"], //stdin is a pipe that the child will read from
1 => ["pipe", "w"], //stdout is a pipe that the child will write to
2 => ["file", "java.error", "a"]
];
$process = proc_open($process_cmd, $descriptorspec, $pipes, $cwd, $env, $options);
if (is_resource($process)) {
//feeding text to java
fwrite($pipes[0], $input);
fclose($pipes[0]);
//reading output text from java
$output = stream_get_contents($pipes[1]);
fclose($pipes[1]);
$return_value = proc_close($process);
}
My java code:
public static void main(String[] args) throws Exception {
long start;
long end;
start = System.currentTimeMillis();
BufferedReader br = new BufferedReader(new InputStreamReader(System.in));
String in;
String input = "";
br = new BufferedReader(new InputStreamReader(System.in));
while ((in = br.readLine()) != null) {
input += in + "\n";
}
end = System.currentTimeMillis();
log("Input: " + Long.toString(end - start) + " ms");
start = System.currentTimeMillis();
org.jsoup.nodes.Document doc = Jsoup.parse(input);
end = System.currentTimeMillis();
log("Parser: " + Long.toString(end - start) + " ms");
start = System.currentTimeMillis();
System.out.print(doc);
end = System.currentTimeMillis();
log("Output: " + Long.toString(end - start) + " ms");
}
I'm passing to java html file of 3800 lines (~200KB in size as a standalone file). These are broken down execution times in the log file:
Input: 1169 ms
Parser: 98 ms
Output: 12 ms
My question is this: why does input take 100 times longer than output? Is there a way to make it faster?
Inspect your read block in the Java program: Try to use a StringBuilder to concat the data (instead of using += on a String):
String in;
StringBuilder input = new StringBulider();
br = new BufferedReader(new InputStreamReader(System.in));
while ((in = br.readLine()) != null) {
input.append(in + "\n");
}
Details are covered here: Why using StringBuilder explicitly
Generally speaking, to make it faster, consider using an application server (or a simple socket based server), to have a permanently running JVM. There is always some overhead when you start a JVM, on top of it the JIT needs some time as well to optimize your code. This effort is lost, after the the JVM exits.
As for the PHP program: Try to feed the Java program from the shell, just use cat to pipe the data (on a UNIX system like Linux). As an alternative, rewrite your Java program to accept a command line parameter for the file as well. Then you can judge, if your PHP code pipes the data fast enough.
As for the Java program: If you do performance analysis, consider the recommendations in How do I write a correct micro-benchmark in Java
I'm using process = Runtime.getRuntime().exec(cmd,null,new File(path));
to execute some SQL in file (abz.sql)
Command is:
"sqlplus "+ context.getDatabaseUser() + "/"
+ context.getDatabasePassword() + "#"
+ context.getDatabaseHost() + ":"
+ context.getDatabasePort() + "/"
+ context.getSid() + " #"
+ "\""
+ script + "\"";
String path=context.getReleasePath()+ "/Server/DB Scripts";
It is executing that file but not getting exit. Hence I tried using:
Writer out = new OutputStreamWriter(process.getOutputStream());
out.append("commit;\r\n");
out.append("exit \r\n");
System.out.println("---------"+out);
out.close();
This it complete block that I m using:
if(context.getConnectionField()=="ORACLE")
{
String cmd=
"sqlplus "+ context.getDatabaseUser() + "/"
+ context.getDatabasePassword() + "#"
+ context.getDatabaseHost() + ":"
+ context.getDatabasePort() + "/"
+ context.getSid() + " #"
+ "\""
+ script +"\"";
String path=context.getReleasePath()+ "/Server/DB Scripts";
process = Runtime.getRuntime().exec(cmd,null,new File(path));
out = new OutputStreamWriter(process.getOutputStream());
out.append("commit;\r\n");
out.append("exit \r\n");
System.out.println("---------"+out);
out.close();
Integer result1 = null;
while (result1 == null) {
try {
result1 = process.waitFor();
}
catch (InterruptedException e) {}
}
if(process.exitValue() != 0)
return false;
return true;
}
The code shown fails to read the error stream of the Process. That might be blocking progress. ProcessBuilder was introduced in Java 1.5 and has a handy method to redirectErrorStream() - so that it is only necessary to consume a single stream.
For more general tips, read & implement all the recommendations of When Runtime.exec() won't.
I can see a few issues here. The version of 'exec' that you are using will tokenize the command string using StringTokenizer, so unusual characters in the password (like spaces) or the other parameters being substituted are accidents waiting to happen. I recommend switching to the version
Process exec(String[] cmdarray,
String[] envp,
File dir)
throws IOException
It is a bit more work to use but much more robust.
The second issue that there are all kinds of caveat about whether or not exec will run concurrently with the Java process (see http://download.oracle.com/javase/1.4.2/docs/api/java/lang/Process.html). So you need to say which operating system you're on. If it does not run concurrently then your strategy of writing to the output stream cannot work!
The last bit of the program is written rather obscurely. I suggest ...
for (;;) {
try {
process.waitFor();
return process.exitValue() == 0;
} catch ( InterruptedException _ ) {
System.out.println( "INTERRUPTED!" ); // Debug only.
}
}
This eliminates the superfluous variable result1, eliminates the superfluous boxing and highlights a possible cause of endless looping.
Hope this helps & good luck!