We have a Clojure application that takes a dataset (~3000 rows) and writes it to a local file using spit. It works great on the machine where it was written, but on every other machine that pulls down the git code, the write step is agonizingly slow. The process takes seconds on the original machine, but takes upwards of ten minutes on every other machine.
The two primary machines in question (the developer’s machine and mine) are both Manjaro Arch Linux systems with comparable specs and configurations. We are both pulling from the same Git source, and both pulling the same data.
We have confirmed that the code stills runs on my machine, since it completes if I try to write only the first ten lines of the dataset (even that still takes almost a minute).
CPU and RAM are barely touched during the process on both machines and the output filesize is less than a MB.
We get the same problem if we use the Java.io library with clojure.data.csv or dk.ative.docjure.spreadsheet instead of spit.
The abstracted datashape is:
[["Name" "Price"]
["Foo Widget" 100]
["Bar Widget" 200]]
(but of course is greater than 3000 rows)
Any help is appreciated!
Alright, so while we were working on the code example to share, we received some suggestions from another source that have resolved the issue.
We changed the reader to use an input stream instead of reading in the whole file
We wrapped the writer in a doall as was also suggested by #Reut Sharabani
The underlying issue seems to be how each machine was handling laziness
Thanks to everyone that responded!
Related
I am working on an application which has to read and process ~29K files (~500GB) everyday. The files will be in zipped format and available on a ftp.
What I have done: I plan to download and the files from ftp, unzip it and process using multi-threading, which has reduced the processing time significantly (when number of active threads are fixed to a smaller number). I've written some code and tested it for ~3.5K files(~32GB). Details here: https://stackoverflow.com/a/32247100/3737258
However, the estimated processing time, for ~29K files, still seems to be very high.
What I am looking for: Any suggestion/solution which could help me bring the processing time of ~29K files, ~500GB, to 3-4 hours.
Please note that, each files have to be read line by line and each line has to be written to a new file with some modification(some information removed and some new information be added).
You should profile your application and see where the current bottleneck is, and fix that. Proceed until you are at your desired speed or cannot optimize further.
For example:
Maybe you unzip to disk. This is slow, to do it in memory.
Maybe there is a load of garbage collection. See if you can re-use stuff
Maybe the network is the bottleneck.. etc.
You can, for example, use visualvm.
It's hard to provide you one solution for your issue, since it might be that you simply reached the hardware limit.
Some Ideas:
You can parallelize the process which is necessary to process the read information. There you could provide multiple read lines to one thread (out of a pool), which processes these sequentially
Use java.nio instead of java.io see: Java NIO FileChannel versus FileOutputstream performance / usefulness
Use a profiler
Instead of the profiler, simply write log messages and measure the
duration in multiple parts of your application
Optimize the Hardware (use SSD drives, expiriment with block size, filesystem, etc.)
If you are interested in parallel computing then please try Apache spark it is meant to do exactly what you are looking for.
There is a file - stored on an external server which is updated very frequently by a vendor. My application polls this file every minute getting the values out. All I am doing is reading the file.
I am worried that by doing this I could inadvertently lock the file so it cant be written too by the vendor. Is this a possibility?
Kind regards
Further to Eric's answer - you could check the Last Modified Property of the temp file and only merge it with your 'working' file when it changes - that should protect you from read/write conflicts and only merge files just after the vendor has written to the temp. Though this is messy and mrab's comment is valid, a better solution should be found.
I have faced this problem several times, and as Peter Lawrey says there isn't any portable way to do this, and if your environment is Unix this should not be an issue at all as these concurrent access conditions are properly managed by the operating systems. However windows do not handle this at all (yes, that's the consequence of using an amateur OS for production work, lol).
Now that's said, there is a way to solve this if your vendor is flexible enough. They could write to a temp file and when finished move the temp file to the final destination. By doing this you avoid any concurrent access to the file between you and the vendor.
Another way is to precisely (difficult?) know the timing of your vendors update and avoid reading the file during some time frames. For instance if your vendor update the file every hour, avoid reading from five-to-the-hour to five-past-the-hour.
Hope it helps.
There is the Windows Shadow Copy service for volumes. This would allow to read the backup copy.
If the third party software is in java too, and uses a Logger, that should be tweakable: every minute writing to the next from 10 files or so.
I would try to relentlessly read the file (when modified since last read), till something goes wrong. Maybe you can make a test run with hundreds of reads in the weekend or at midnight, when no harm is done.
My answer:
Maybe you need a local watch program, a watch service for a directoryr, that waits till the file is modified, and then makes a fast cooy; after that allowing the copy to be transmitted.
I am getting killed on performance with file / folder deletes in Java.
The code is quite old and I am wondering if Java 7 (which I upgraded to) actually offers performance improvements, or just another syntax. (I don't want to retool everything unless there is a benefit). I regularly need to extract large ZIPs and then delete the contents and the recursion time is brutal.
I am also stuck on Windows.
Thanks
I would to suggest to use some kind of jar already provided by community.
For example, common-io.x-x.jar, spring-core.jar
Eg, org.apache.commons.io.FileUtils;
FileUtils.copyDirectory(from, to);
FileUtils.deleteDirectory(childDir);
FileUtils.forceDelete(springConfigDir);
FileUtils.writeByteArrayToFile(file, data);
org.springframework.util.FileSystemUtils;
FileSystemUtils.copyRecursively(from, to);
FileSystemUtils.deleteRecursively(dir);
File IO is very dependant on the performance of your hardware. Many HDD can perform 80 - 120 IOPS per second. If you want to open a file you can read up to 120 files per second. To delete a file it can require two updates or up to 60 files deleted per second. With these constraints there is almost nothing you can do in software which will make any difference.
If you have an SSD however, these can do 80,000 to 230,000 IOPS per second (more than a thousand fold increase) At this point what you do software might make a difference, but as you are dealing with compressed files, it most like that CPU will be your bottleneck at this point.
I have a multithreaded file converter that I'm working on. On Windows, it puts each file that's being converted in its own thread and uses 100% CPU (on all cores) all the time. It's awesome! On Ubuntu, I get 100% on the first core and ~10% on all the rest. The performance is poor and disappointing.
I'm using Threads, all within a SwingWorker so I don't freeze the GUI. I use thread.join on all threads so I perform a certain task when all threads are complete. I have not changed the code between OS's. Is there a feasible way to fix this?
It was very dumb and I don't quite understand why, but shortly after I posted this, I transferred all of my files to my Ubuntu partition, and it's just as fast (if not faster) than the Windows one. Not sure why moving files would make it go faster? Perhaps my real issue was that since they were on different file systems, my bottleneck was I/O. Converting just one file from the NTFS partition took 3x longer than if I moved it to the ext4 partition. (And Yes, these are all on the same SSD)
I am trying to create 100 files using FileOutputStream/BufferedOutputStream.
I can see the CPU utilization is 100% for 5 to 10 sec. The Directory which i am writing is empty. I am creating PDF files thru iText. Each file having round 1 MB. I am running on Linux.
How can i rewrite the code so that i can minimize the CPU utilization?
Don't guess: profile your application.
If the numbers show that a lot of time is spent in / within write calls, then look at ways to do faster I/O. But if most time is spent in formatting stuff for output (e.g. iText rendering), then that's where you need to focus your efforts.
Is this in a directory which already contains a lot of files? If so, you may well just be seeing the penalty for having a lot of files in a directory - this varies significantly by operating system and file system.
Otherwise, what are you actually doing while you're creating the files? Where does the data come from? Are they big files? One thing you might want to do is try writing to a ByteArrayOutputStream instead - that way you can see how much of the activity is due to the file system and how much is just how you're obtaining/writing the data.
It's a long shot guess, but even if you're using buffered streams make sure you're not writing out a single byte at a time.
The .read(int) and .write(int) methods are CPU killers. You should be using .read(byte[]...) and .write(byte[], int, int) for certain.
A 1MB file to write is large enough to use a java.nio FileChannel and see large performance improvements over java.io. Rewrite your code, and measure it agaist the old stuff. I predict a 2x improvement, at a minimum.
You're unlikely to be able to reduce the CPU load for your task, especially on a Windows system. Java on Linux does support Asynchronous File I/O, however, this can seriously complicate your code. I suspect you are running on Windows, as File I/O generally takes much more time on Windows than it does on Linux. I've even heard of improvements by running Java in a linux VM on Windows.
Take a look at your Task Manager when the process is running, and turn on Show Kernel Times. The CPU time spent in user space can generally be optimized, but the CPU time in kernel space can usually only be reduce by make more efficient calls.
Update -
JSR 203 specifically addresses the need for asynchronous, multiplexed, scatter/gather file IO:
The multiplexed, non-blocking facility introduced by JSR-51 solved much of that problem for network sockets, but it did not do so for filesystem operations.
Until JSR-203 becomes part of Java, you can get true asynchronous IO with the Apache MINA project on Linux.
Java NIO (1) allows you to do Channel based I/O. This is an improvement in performance, but your only doing a buffer of data at a time, and not true async & multiplexed IO.