I know this question has been widely discussed in different posts:
java get file size efficiently
How to get file size in Java
Get size of folder or file
https://www.mkyong.com/java/how-to-get-file-size-in-java/
My problem is that I need to obtain the sizes of a large number of files (regular files existing in a HD), and for this I need a solution that provides the best performance. My intuition is that it should be done through a method that reads directly the file system table, not obtaining the size of the file by reading the whole file contents. It is difficult to know which specific method is used by reading the documentation.
As stated in this page:
Files has the size() method to determine the size of the file. This is
the most recent API and it is recommended for new Java applications.
But this is apparently not the best advise, in terms of performance. I have made different measurements of different methods:
file.length();
Files.size(path);
BasicFileAttributes attr = Files.readAttributes(path, BasicFileAttributes.class); attr.size();
And my surprise is that file.length(); is the fastest, having to create a File object instead of using the newer Path. I do not now if this also reads the file system or the contents. So my question is:
What is the fastest, recommended way to obtain file sizes in recent Java versions (9/10/11)?
EDIT
I do not think these details add anything to the question. Basically the benchmark reads like this:
Length: 49852 with previous instanciation: 84676
Files: 3451537 with previous instanciation: 5722015
Length: 48019 with previous instanciation:: 79910
Length: 47653 with previous instanciation:: 86875
Files: 83576 with previous instanciation: 125730
BasicFileAttr: 333571 with previous instanciation:: 366928
.....
Length is quite consistent. Files is noticeable slow on the first call, but it must cache something since later calls are faster (still slower than Length). This is what other people observed in some of the links I reference above. BasicFileAttr was my hope but still is slow.
I am asing what is recommended in modern Java versions, and I considered 9/10/11 as "modern". It is not a dependency, nor a limitation, but I suppose Java 11 is supposed to provide better means to get file sizes than Java 5. If Java 8 released the fastest way, that is OK.
It is not a premature optimisation, at the moment I am optimising a CRC check with an initial size check, because it should be much faster and does not need, in theory, to read file contents. So I can use directly the "old" Length method, and all I am asking is what are the new advances on this respect in modern Java, since the new methods are apparently not as fast as the old ones.
Related
(This is a hypothetical question since it's very broad, and workarounds exist for specific cases.)
Is it possible to atomically write a byte[] to a file (as FileOutputStream or FileWriter?
If writing fails, then it's unacceptable that part of the array is written. For example, if the array is 1,000,000 bytes and the disk is full after 500,000 bytes, then no bytes should be written to the file, or the changes should somehow be rolled back. This should even be the case if a medium is physically disconnected mid-write.
Assume that the maximum size of the array is known.
Atomic writes to files are not possible. Operating systems don't support it, and since they don't, programming language libraries can't do it either.
The best you are going to get with a files in a conventional file system is atomic file renaming; i.e.
write new file into same file system as the old one
use FileDescriptor.sync() to ensure that new file is written
rename the new file over the old one; e.g. using
java.nio.file.Files.move(Path source, Path target,
CopyOption... options)
with CopyOptions ATOMIC_MOVE. According to the javadocs, this may not be supported, but if it isn't supported you should get an exception.
But note that the atomicity is implemented in the OS, and if the OS cannot give strong enough guarantees, you are out of luck.
(One issue is what might happen in the event of a hard disk error. If the disk dies completely, then atomicity is moot. But if the OS is still able to read data from the disk after the failure, then the outcome may depend on the OS'es ability to repair a possibly inconsistent file system.)
I am using FileUtils to compare two identical pdfs. This is the code:
boolean comparison = FileUtils.contentEquals(pdfFile1, pdfFile2);
Despite the fact that both pdf files are identical, I keep getting false. I also noticed that when I execute:
byte[] byteArray = FileUtils.readFileToByteArray(pdfFile1);
byte[] byteArrayTwo = FileUtils.readFileToByteArray(pdfFile2);
System.out.println(byteArray);
System.out.println(byteArrayTwo);
I get the following bytecode for the two pdf files:
[B#3a56f631
[B#233d28e3
So even though both pdf files are absolutely identical visually, their byte-code is different and hence failing the boolean test. Is there any way to test whether the identical pdf files are identical?
Unfortunately for PDF there is a big difference between having "identical files" and having files that are "visually identical". So the first question is what you are looking for.
One very simple example, information in a PDF file can be compressed or not, and can be compressed with different compression filters. Taking a file where some of the content is not compressed, and compressing that content with a ZIP compression filter for example, would give you two files that are very different on a byte level, yet very much the same visually.
So you can do a number of different things to compare PDF files:
1) If you want to check whether you have "the same file", read them in and calculate some sort of checksum as answered before by Peter Petrov.
2) If you want to know whether or know files are visually identical, the most common method is some kind of rendering. Render all pages to images and compare the images. In practice this is not as simple as it sounds and there are both simple (for example callas pdfToolbox) and complex (for example Global Vision DigitalPage) applications that implement some kind of "sameness" algorithm (caution, I'm related to both of those vendors).
So define very well what exactly you need first, then choose carefully which approach would work best.
Yes, generate md5 sum from both files.
See if these sums are identical.
If they are, then your files are identical
too with a certainty which is practically 100%.
If the sums are not identical, then
your files are different for sure.
To generate the md5 sums, on Linux there's an md5sum
command, for Windows there's a small tool called fciv.
http://www.microsoft.com/en-us/download/details.aspx?id=11533
Just to note, the two identifiers you wrote
[B#3a56f631
[B#233d28e3
are different because they belong to two different objects. These are object identifiers, not bytecode. Two objects can be logically equal even if they are not exactly the same objects (e.g. they have different objectIDs).
Otherwise, calculating an MD5 checksum as peter.petrov wrote is a good idea.
We have a file I/O bottleneck. We have a directory which contains lots of JPEG files, and we want to read them in in real time as a movie. Obviously this is not an ideal format, but this is a prototype object tracking system and there is no possibility to change the format as they are used elsewhere in the code.
From each file we build a frame object which basically means having a buffered image and an explicit bytebuffer containing all of the information from the image.
What is the best strategy for this? The data is on a SSD which in theory has read/write rates around 400Mb/s, but in practice is reading no more than 20 files per second (3-4Mb/s) using the naive implementation:
bufferedImg = ImageIO.read(imageFile);[1]
byte[] data = ((DataBufferByte)bufferedImg.getRaster().getDataBuffer()).getData();[2]
imgBuf = ByteBuffer.wrap(data);
However, Java produces lots of possibilities for improving this.
(1) CHannels. Esp File Channels
(2) Gathering/Scattering.
(3) Direct Buffering
(4) Memory Mapped Buffers
(5) MultiThreading - use a bunch of callables to access many files simultaneously.
(6) Wrapping the files in a single large file.
(7) Other things I haven't thought of yet.
I would just like to know if anyone has extensively tested the different options, and knows what is optimal? I assume that (3) is a must, but I would still like to optimise the reading of a single file as far as possible, and am unsure of the best strategy.
Bonus Question: In the code snipped above, when does the JVM actually 'hit the disk' and read in the contents of the file, is it [1] or is that just a file handler which `points' to the object? It kind of makes sense to lazily evaluate but I don't know how the implementation of the ImageIO class works.
ImageIO.read(imageFile)
As it returns BufferedImage, I assume it will hit disk and just not file handler.
When I list files of a directory that has 300,000 files with Java, out of memory occurs.
String[] fileNames = file.list();
What I want is a way that can list all files of a directory incrementally no matter how many files in that specific directory and won't have "out of memory" problem with the default 64M heap limit.
I have Google a while, and cannot find such a way in pure Java.
Please help me!!
Note, JNI is a possible solution, but I hate JNI.
I know you said "with the default 64M heap limit", but let's look at the facts - you want to hold a (potentially) large number of items in memory, using the mechanisms made available to you by Java. So, unless there is some dire reason that you can't, I would say increasing the heap is the way to go.
Here is a link to the same discussion at JavaRanch: http://www.coderanch.com/t/381939/Java-General/java/iterate-over-files-directory
Edit, in response to comment: the reason I said he wants to hold a large number of items in memory is because this is the only mechanism Java provides for listing a directory without using the native interface or platform-specific mechanisms (and the OP said he wanted "pure Java").
The only possible solution for you is Java7 and then you sould use an iterator.
final Path p = FileSystems.getDefault().getPath("Yourpath");
Files.walk(p).forEach(filePath -> {
if (Files.isRegularFile(filePath)) {
//Do something with filePath
}
});
You are a bit out of luck here. In the least there will need to be created 300k strings. With an average length of 8-10 char and 2 bytes per char thats 6Mb in the minimum. Add object pointer overhead per string (8 bytes) and you run into your memory limit.
If you absolutely must have that many files in a single dir, which i would not recommend as your file system will have problems, your best bet is to run a native process (not JNI) via Runtime.exec. Keep in mind that you will tie yourself down to the OS (ls vs dir). You will be able to get a list of files as one large string and will be responsible for post processing it into what you want.
Hope this helps.
Having 300 000 files in a dir is not a good idea - AFAIK filesystems are not good at having that many sub-nodes in a single node. Interesting question, though.
EDIT: THE FOLLOWING DOES NOT HELP, see comments.
I think you could use a FileFilter, reject all files, and process them in the filter.
new File("c:/").listFiles( new FileFilter() {
#Override public boolean accept(File pathname) {
processFile();
return false;
}
});
If you can write your code in Java 7 or up, then following is a good option.
Files.newDirectoryStream(Path dir)
Here is the java doc for the API.
Hope this helps.
I'm writing arbitrary byte arrays (mock virus signatures of 32 bytes) into arbitrary files, and I need code to overwrite a specific file given an offset into the file. My specific question is: is there source code/libraries that I can use to perform this particular task?
I've had this problem with Python file manipulation as well. I'm looking for a set of functions that can kill a line, cut/copy/paste, etc. My assumptions are that these are extremely common tasks, and I couldn't find it in the Java API nor my google searches.
Sorry for not RTFM well; I haven't come across any information, and I've been looking for a while now.
Maybe you are looking for something like the RandomAccessFile class in the standard Java JDK. It supports reads and writes at some offset, as well as byte arrays.
Java's RandomAccessFile is exactly what you want.
It includes methods like seek(long) that allow you to move wherever you need in the file. It also allows for reading and writing at the same time.
As far as I know, Java has primarily lower level functions for manipulating files directly. Here is the best I've come up with
The actions you describe are standard in the Swing world, and for text comes down to manipulating a Document object. These act on data in memory. The class java.nio.channels.FileChannel has similar methods that act directly on a file. Neither fine the end of lines automatically, but other classes in java.io and java.nio do.
Apache Commons has a sandbox library called Flatfile which looks like it does what you want. The problem is that no code has been released yet. You may, however, want to talk to people working on it to get some more ideas. I didn't do a general check on libraries.
Have you looked into File/FileReader/FileWriter/BufferedReader? You can get the contents of the files and manipulate it as you like, you can search the data in the files, you can overwrite files, create new, append to an existing....
I am not sure this is exactly what you are asking for but I use these APIs all the time for logging, RTF editors, text file creation for email, and many other things.
As far as cut/copy/past goes, I have not come across the ability to do that directly, however, you can output the contents of the file and "copy" what part of it you want and "paste" it into a new file, or append it to an existing.
While writing a byte array to a file is a common task, writing to a give file 32-bytes byte array just once is just not something you are going to find in java.io :)
To get started, would the below method and comments look reasonable to you? I bet someone here, maybe even myself, could whip it out quick like.
public static void writeFauxVirusSignature(File file, byte[] bytes, long offset) {
//open file
//move to offset
//write bytes
//close file
}
Questions:
How big could the potential target files be?
Do you need performance?
I ask because clean, easy to read code would use Apache Commons lib's, but large file writes in a performance sensitive environment will necessitate using java.nio libraries