I tried to read lines from a file which maybe large.
To make a better performance, I tried to use mapped file. But when I compare the performance, I find that the mapped file way is even a a little slower than I read from BufferedReader
public long chunkMappedFile(String filePath, int trunkSize) throws IOException {
long begin = System.currentTimeMillis();
logger.info("Processing imei file, mapped file [{}], trunk size = {} ", filePath, trunkSize);
//Create file object
File file = new File(filePath);
//Get file channel in readonly mode
FileChannel fileChannel = new RandomAccessFile(file, "r").getChannel();
long positionStart = 0;
StringBuilder line = new StringBuilder();
long lineCnt = 0;
while(positionStart < fileChannel.size()) {
long mapSize = positionStart + trunkSize < fileChannel.size() ? trunkSize : fileChannel.size() - positionStart ;
MappedByteBuffer buffer = fileChannel.map(FileChannel.MapMode.READ_ONLY, positionStart, mapSize);//mapped read
for (int i = 0; i < buffer.limit(); i++) {
char c = (char) buffer.get();
//System.out.print(c); //Print the content of file
if ('\n' != c) {
line.append(c);
} else {// line ends
processor.processLine(line.toString());
if (++lineCnt % 100000 ==0) {
try {
logger.info("mappedfile processed {} lines already, sleep 1ms", lineCnt);
Thread.sleep(1);
} catch (InterruptedException e) {}
}
line = new StringBuilder();
}
}
closeDirectBuffer(buffer);
positionStart = positionStart + buffer.limit();
}
long end = System.currentTimeMillis();
logger.info("chunkMappedFile {} , trunkSize: {}, cost : {} " ,filePath, trunkSize, end - begin);
return lineCnt;
}
public long normalFileRead(String filePath) throws IOException {
long begin = System.currentTimeMillis();
logger.info("Processing imei file, Normal read file [{}] ", filePath);
long lineCnt = 0;
try (BufferedReader br = new BufferedReader(new FileReader(filePath))) {
String line;
while ((line = br.readLine()) != null) {
processor.processLine(line.toString());
if (++lineCnt % 100000 ==0) {
try {
logger.info("file processed {} lines already, sleep 1ms", lineCnt);
Thread.sleep(1);
} catch (InterruptedException e) {}
} }
}
long end = System.currentTimeMillis();
logger.info("normalFileRead {} , cost : {} " ,filePath, end - begin);
return lineCnt;
}
Test result in Linux with reading a file which size is 537MB:
MappedBuffer way:
2017-09-28 14:33:19.277 [main] INFO com.oppo.push.ts.dispatcher.imei2device.ImeiTransformerOfflineImpl - process imei file ends:/push/file/imei2device-local/20170928/imei2device-13 , lines :12758858 , cost :14804 , lines per seconds: 861852.0670089165
BufferedReader way:
2017-09-28 14:27:03.374 [main] INFO com.oppo.push.ts.dispatcher.imei2device.ImeiTransformerOfflineImpl - process imei file ends:/push/file/imei2device-local/20170928/imei2device-13 , lines :12758858 , cost :13001 , lines per seconds: 981375.1249903854
That is the thing: file IO isn't straight forward and easy.
You have to keep in mind that your operating system has a huge impact on what exactly is going to happen. In that sense: there are no solid rules that would work for all JVM implementations on all platforms.
When you really have to worry about the last bit of performance, doing in-depth profiling on your target platform is the primary solution.
Beyond that, you are getting that "performance" aspect wrong. Meaning: memory mapped IO doesn't magically increase the performance of reading a single file within an application once. Its major advantages go along this path:
mmap is great if you have multiple processes accessing data in a read only fashion from the same file, which is common in the kind of server systems I write. mmap allows all those processes to share the same physical memory pages, saving a lot of memory.
( quoted from this answer on using the C mmap() system call )
In other words: you example is about reading a file contents. In the end, the OS still has to turn to the drive to read all bytes from there. Meaning: it reads disc content and puts it in memory. When you do that the first time ... it really doesn't matter that you do some "special" things on top of that. To the contrary - as you do "special" things the memory-mapped approach might even be slower - because of the overhead compared to an "ordinary" read.
And coming back to my first record: even when you would have 5 process reading the same file, the memory-mapped approach isn't necessarily faster. As the Linux might figure: I already read that file into memory, and it didn't change - so even without explicit "memory mapping" the Linux kernel might cache information.
The memory mapping doesn't really give any advantage, since even though you're bulk loading a file into memory, you're still processing it one byte at a time. You might see a performance increase if you processed the buffer in suitably sized byte[] chunks. Even then the BufferedReader version may perform better or at least almost the same.
The nature of your task is to process a file sequentially. BufferedReader already does this very well and the code is simple, so if I had to choose I'd go with the simplest option.
Also note that your buffer code doesn't work except for single byte encodings. As soon as you get multiple bytes per character, it will fail magnificently.
GhostCat is correct. And in addition to your OS choice, other things that can affect performance.
Mapping a file will place greater demand on physical memory. If physical memory is "tight" that could cause paging activity, and a performance hit.
The OS could use a different read-ahead strategy if you read a file using read syscalls versus mapping it into memory. Read-ahead (into the buffer cache) can make file reading a lot faster.
The default buffer size for BufferedReader and the OS memory page size are likely to be different. This may result in the size of disk read requests being different. (Larger reads often result in greater throughput I/O. At least to a certain point.)
There could also be "artefacts" caused by the way that you benchmark. For example:
The first time you read a file, a copy of some or all of the file will land in the buffer cache (in memory)
The second time you read the same file, parts of it may still be in memory, and the apparent read time will be shorter.
Related
I need help with running parallel operations. The goal of the code is to extract a large amount of small files from the same tar in different folders in a very short time
This is the code:
public void decompress(File archive, File destination) throws RuntimeException {
try (InputStream in = new FileInputStream(archive);
BufferedInputStream buff = new BufferedInputStream(in);
TarArchiveInputStream is = (TarArchiveInputStream) new ArchiveStreamFactory().createArchiveInputStream("tar", buff)
) {
TarArchiveEntry entry;
while ((entry = is.getNextTarEntry()) != null) {
File file = new File(destination, entry.getName());
file.getParentFile().mkdirs();
Files.write(file.toPath(), is.readAllBytes());
}
} catch (IOException | ArchiveException e) {
e.printStackTrace();
}
}
When I execute one time this operation, it takes ~900ms
But when I do something like this to execute the same operation, multiple times in parallel it takes 20000ms:
ExecutorService EXECUTOR_SERVICE = Executors.newFixedThreadPool(20);
File archive = ...;
for (int i = 0; i < 5; i++) {
File directory = new File("Dir_" + i);
EXECUTOR_SERVICE.submit(() -> decompress(archive, directory));
}
or
File archive = ...;
for (int i = 0; i < 5; i++) {
File directory = new File("Dir_" + i);
new Thread(() -> decompress(archive, directory)).start();
}
One suspicion is that the directories contain many files, hence File.mkdirs does needlessly much checks.
The constructor of BufferedInputStream may have a custom buffer size. Never helped much, but it might be with your disk. Also with parallelism it could help to prevent much "disk head movements."
You probably already tried Files.copy but still, it might have a better memory behavior that readAllBytes.
So the version becomes (eschewing File in favor of Path):
public void decompress(File archive, File destination) throws RuntimeException {
final int bufferSize = 1024 * 128;
Path archivePath = archive.toPath();
Path destinationPath = destination.toPath();
try (InputStream in = new FileInputStream(archive);
BufferedInputStream buff = new BufferedInputStream(in, bufferSize);
TarArchiveInputStream is = (TarArchiveInputStream)
new ArchiveStreamFactory().createArchiveInputStream("tar", buff)
) {
Path oldFileParent = destinationPath;
oldFileParent.createDirectories();
TarArchiveEntry entry;
while ((entry = is.getNextTarEntry()) != null) {
Path file = Paths.get(destinationPath, entry.getName());
Path fileParent = file.getParent();
if (!fileParent.equals(oldFileParent)) {
oldFileParent = fileParent;
oldFileParent.createDirectories();
}
Files.copy(is, file);
//Files.write(file, is.readAllBytes());
}
} catch (IOException | ArchiveException e) {
e.printStackTrace();
}
}
Throwing a RuntimeException and capturing the IOException/ArchiveException without throwing it back (as new IllegalStateException(e)) is a matter of taste.
Now to adding parallelism: disk output is probably the bottleneck. Writing two files to the same disk in parallel means skipping back and forth on the disk. Small files might just do.
Better seems to parallelize reading a next file and then in another thread write it.
Two threads might theoretically perform better than many threads with enhightened disk traffic. readAllBytes might then be appropriate, to let the writing thread not use is.
As in the tar entry maybe the file size is kept too, that would allow to check whether readAllBytes is efficient enough - for large files.
Logging was mentioned in this question. It is known, that that can consume much time, and with parallelism becomes even more critical. But you seem to be aware of it. You wrote having written your own logger. For a library System.Logger is actually best. It is a façade that uses any logger the application provides. This would have prevented the logger vulnaribility hidden in library dependencies of the past year.
Ignoring the fact that you are not decompressing the file in parallel here (you are running multiple threads decompressing the same file concurrently, essentially overwriting the result), there may be several reasons for this performance hit. I/O is one, so it depends on the underlying implementation. Also, what is the Logger you are using there? While other parts of your code doesn't seem to be shared among multiple threads, the static call to Logger is something that is shared.
Also note: java.nio uses FileChannels which provide synchronous I/O, so depending on how you create the channels, you may get into similar situations (though I don't believe this applies here).
I'm trying to make a downloader so I can automatically update my program. The following is my code so far:
public void applyUpdate(final CharSequence ver)
{
java.io.InputStream is;
java.io.BufferedWriter bw;
try
{
String s, ver;
alertOf(s);
updateDialogProgressBar.setIndeterminate(true);//This is a javax.swing.JProgressBar which is configured beforehand, and is displayed to the user in an update dialog
is = latestUpdURL.openStream();//This is a java.net.URL which is configured beforehand, and contains the path to a replacement JAR file
bw = new java.io.BufferedWriter(new java.io.FileWriter(new java.io.File(System.getProperty("user.dir") + java.io.File.separatorChar + TITLE + ver + ".jar")));//Creates a new buffered writer which writes to a file adjacent to the JAR being run, whose name is the title of the application, then a space, then the version number of the update, then ".jar"
updateDialogProgressBar.setValue(0);
//updateDialogProgressBar.setMaximum(totalSize);//This is where I would input the total number of bytes in the target file
updateDialogProgressBar.setIndeterminate(false);
{
for (int i, prog=0; (i = is.read()) != -1; prog++)
{
bw.write(i);
updateDialogProgressBar.setValue(prog);
}
bw.close();
is.close();
}
}
catch (Throwable t)
{
//Alert the user of a problem
}
}
As you can see, I'm just trying to make a downloader with a progress bar, but I don't know how to tell the total size of the target file. How can I tell how many bytes are going to be downloaded before the file is done downloading?
A Stream is a flow of bytes, you can't ask it how many bytes are remaining, you just read from it until it says 'i'm done'. Now, depending on how the connection that provides the stream is stablished, perhaps the underlying protocol (HTTP, for example) can know in advance the total length to be sent... perhaps not. For this, see URLConnection.getContentLength(). But it might well return -1 (= 'I don't know').
BTW, your code is not the proper way to read a stream of bytes and write it to a file. For one thing, you are using a Writer, when you should use a OutputStream (you are converting from bytes to characters, and then back to bytes - this hinders performance and might corrupt everything if the received content is binary, or if the encodings don't match). Secondly, its inefficient to read and write one byte at a time.
To get the length of the file, you do this:
new File("System.getProperty("user.dir") + java.io.File.separatorChar + TITLE + ver + ".jar").length()
I have wiki.txt file and its size is 50 MB.
I need to do several things on the file and so I thought that the best way in terms of performance is to load the file to memory, is that correct?
This is the code that I written:
File file = new File("wiki.txt");
FileInputStream fileInputStream = new FileInputStream(file);
FileChannel fileChannel = fileInputStream.getChannel();
MappedByteBuffer mapByteBuffer = fileChannel.map(FileChannel.MapMode.READ_ONLY, 0, file.length());
System.out.println((char)mapByteBuffer.get());
I get error on this code: mapByteBuffer.get().
I tried the get() function a few options but all of them I get error and didn't even get an error on e.getMessage() I just got null.
Another important thing to note, my text file contains English words and actions I need to do is search, if expressed is exist in this text file.
Thank you.
I would suggest using a MemoryMappedFile, to read the file directly from the disk instead of loading it in memory.
RandomAccessFile file = new RandomAccessFile("wiki.txt", "r");
FileChannel channel = file.getChannel();
MappedByteBuffer buf = channel.map(FileChannel.MapMode.READ_WRITE, 0, 1024*50);
And then you can read the buffer as usual.
My answers for point (1):
It depends on what you want to do with the file. If your processing doesn't involve rewind operation (looking what was read behind/before), it's best to just read as a stream and process it in one go (instead of loading all into memory).
Even if you need random access across the file, you may also be interested in doing block file operation, because your solution may not scale well when the file size change to bigger size.
RandomAccessFile if you are on Java 1.4 or above.
For random access, the operating system usually handles the file buffer caching quite well you don't have to handle yourself.
It is important to read the whole error, not just the message. Often the real information is in the exception's name not the text associated with it.
You will get an error if the file is empty as there is no first byte.
Note: the approach you are using assumes ASCII 7-bit characters. If you want to assume ISO-8859-1 characters you can use (char) (byteBuffer.get() & 0xFF)
However, if you have plan text you may find that using strings is simpler to use and not much slower. e.g. you can read a 50 MB file as text in less than a second. I would only use a memory mapped file if this is far too long.
I would suggest to use BufferedReader. It is much faster and requires relatively less resources.
First read number of lines:
InputStream is = new BufferedInputStream(new FileInputStream(filename));
byte[] chars = new byte[1024];
int numberOfChars = 0;
while ((numberOfChars = is.read(chars)) != -1)
{
for (int i = 0; i < numberOfChars; ++i)
{
if (chars[i] == '\n' && numberOfChars - i != 1)
{
++count;
}
}
}
count++
return count; // number of lines
Then read the lines:
BufferedReader in = new BufferedReader(new FileReader(fileName));
for (int i = 0; i < endLine; i++)
{
String oneLine = in.readLine();
}
In this strings you can even do search for what you need.
I want to find out what method is better of two that I have come up with for concatenating my text files in Java. If someone has some insight they can share about what goes on at the kernel level that explains the difference between these methods of writing to a FileChannel, I would greatly appreciate it.
From what I understand from documentation and other Stack Overflow conversations, the allocateDirect allocates space right on the drive, and mostly avoids using RAM. I have a concern that the ByteBuffer created with allocateDirect might have a potential to overflow or not be allocated if the File infile is large, say 1GB. I am guaranteed at this point in the development of our software that the File will be no larger than 2 GB; but there is potential in the future that it might be as big as 10 or 20GB.
I have observed that the transferFrom loop never goes through the loop more than once... so it seems to succeed in writing the entire infile at once; but I haven't tested it with files bigger than 60MB. I looped though, because the documentation specifies that there is no guarantee of how much will be written at once. With transferFrom only able to accept, on my system, an int32 as its count parameter, I won't be able to specify more than 2GB at a time be transferred... Again, kernel expertise would help me understand.
Thanks in advance for your help!!
Using a ByteBuffer:
boolean concatFiles(StringBuffer sb, File infile, File outfile) {
FileChannel inChan = null, outChan = null;
try {
ByteBuffer buff = ByteBuffer.allocateDirect((int)(infile.length() + sb.length()));
//write the stringBuffer so it goes in the output file first:
buff.put(sb.toString().getBytes());
//create the FileChannels:
inChan = new RandomAccessFile(infile, "r" ).getChannel();
outChan = new RandomAccessFile(outfile, "rw").getChannel();
//read the infile in to the buffer:
inChan.read(buff);
// prep the buffer:
buff.flip();
// write the buffer out to the file via the FileChannel:
outChan.write(buff);
inChan.close();
outChan.close();
} catch...etc
}
Using trasferTo (or transferFrom):
boolean concatFiles(StringBuffer sb, File infile, File outfile) {
FileChannel inChan = null, outChan = null;
try {
//write the stringBuffer so it goes in the output file first:
PrintWriter fw = new PrintWriter(outfile);
fw.write(sb.toString());
fw.flush();
fw.close();
// create the channels appropriate for appending:
outChan = new FileOutputStream(outfile, true).getChannel();
inChan = new RandomAccessFile(infile, "r").getChannel();
long startSize = outfile.length();
long inFileSize = infile.length();
long bytesWritten = 0;
//set the position where we should start appending the data:
outChan.position(startSize);
Byte startByte = outChan.position();
while(bytesWritten < length){
bytesWritten += outChan.transferFrom(inChan, startByte, (int) inFileSize);
startByte = bytesWritten + 1;
}
inChan.close();
outChan.close();
} catch ... etc
transferTo() can be far more efficient as there is less data copying, or none if it can all be done in the kernel. And if it isn't on your platform it will still use highly tuned code.
You do need the loop, one day it will iterate and your code will keep working.
I have a large file in windows XP - its 38GB. (a VM image)
I cannot seem to copy it.
Dragging on the desktop - gives error of "Insufficient system resources exist to complete the requested service"
Using Java - FileChannel.transferTo(0, fileSize, dest) fails for all files > 2GB
Using Java - FileChannel.transferTo() in chunks of 100Mb fails after ~18Gb
java.io.IOException: Insufficient system resources exist to complete the requested service
at sun.nio.ch.FileDispatcher.write0(Native Method)
at sun.nio.ch.FileDispatcher.write(FileDispatcher.java:44)
at sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:72)
at sun.nio.ch.IOUtil.write(IOUtil.java:28)
at sun.nio.ch.FileChannelImpl.write(FileChannelImpl.java:198)
at sun.nio.ch.FileChannelImpl.transferToTrustedChannel(FileChannelImpl.java:439)
at sun.nio.ch.FileChannelImpl.transferTo(FileChannelImpl.java:510)
I mean - the computer has 3GB of RAM. A 100GB buffer should be enough!?!?
Apparently the DOS commands "copy" and "xcopy" also fail.
(edit) I've tried COPY & XCOPY - these fail with the same error. XCOPY seems to take a really really long time about it too.
I've heard of Robocopy, but it doesn't copy single files?
I'm really feeling that Windows is for the lose right now. Surely microsoft have heard of files larger than a few GB?
Thanks!
In Java, don't try to copy the whole file in a single operation. The transferTo() method works on chunks of a file; wasn't intended as a high-level file copy method. Invoke transferTo() in a loop, and assume that count bytes of data will be in RAM (i.e., lower that parameter to be comfortable fitting in RAM).
FileChannel src = ...
FileChannel dst = ...
final long CHUNK = 16 * 1024 * 1024; /* 16 Mb */
for (long pos = 0; pos < fileSize; ) {
pos += src.transferTo(pos, CHUNK, dst);
}
The comment in the transferTo() JavaDoc about it being "more efficient than a simple loop" refers to the fact that channel-to-channel communication can be optimized more than channel-to-user-space-to-channel. It doesn't mean that all looping can be avoided.
I am a Vmware ESX user, I have 30 production VM's with the largest being 232GB. I backup my VM instances onto an internal SATA drive and then copy these off once a week to an external eSata. I use teracopy (free), it runs on average at 45MB/s on an XP machine with 3GB.
Hope that helps
Sailen
Well - I've not managed to find a way that works.
None of the packaged tools in windows will copy the file. Drag and drop, COPY, XCOPY, java - all fail to copy the file.
The reason I wanted to copy the file was for a backup before doing an OS upgrade.
In the end i booted into knoppix and copied it.
Take a look at this Hotfix, worth a try as everything I have seen points to this as being a cure for your issue.
EDIT: You can also try XCOPY /Z as pointed out here.
There may be a hardware issue as well..
I suspect you don't have much time, however you may try dumber stream solution and don't set large buffers (8-16MB should be enough):
public static void copy(InputStream input, OutputStream output) throws IOException {
byte[] buffer = new byte[1024 * 1024 * 8]; // 8MB
int n = 0;
while (-1 != (n = input.read(buffer))) {
output.write(buffer, 0, n);
}
}
public static void main(String args[]) {
if (args.length != 2) {
System.err.println("wrong argument count");
System.exit(1);
}
FileInputStream in = null;
FileOutputStream out = null;
try {
in = new FileInputStream(new File(args[0]));
out = new FileOutputStream(new File(args[1]));
copy(in, out);
} catch (Exception e) {
e.printStackTrace();
}
if (in != null) { try { in.close(); } catch (Exception e) {}}
if (out != null) { try { out.close(); } catch (Exception e) {}}
}
are you sure the filesystem is actually able to cope with such big files (FAT32 cannot for example)? Take a look on this link for details http://www.ntfs.com/ntfs_vs_fat.htm
The system is 32 or 64 bit? On 32-bit you may have problems copy-ing files larger that 2-4Gb.
Also, you said that rsync scucks for you. I've had a very nice experience with it, copying between 2 hard drives at near-native speed. I've had lots of small files..you seem to have on big blob instead.
You may also try splitting the big blob into smaller blobs:)
final long CHUNK = 16 * 1024 * 1024; /* 16 Mb */
for (long pos = 0; pos < fileSize; pos++) {
pos += src.transferTo(pos, CHUNK, dst);
}
This does work! just make sure your src and dst are FileChannel objects (input, output respectively)
Another possible answer is Files.copy (java NIO 2), e.g.:
Path sourcePath = Paths.get("big-file.dat");
Path destinationPath = Paths.get("big-file-copy.dat");
try {
Files.copy(sourcePath, destinationPath,
StandardCopyOption.REPLACE_EXISTING);
} catch (IOException e) {
// something else went wrong
e.printStackTrace();
}