I've decided to write a recursive program that writes all the files in my C drive into a .txt file, however it is very slow.
I've read online that recursion is slow, but i can't think of any other way. Is there any way i can optimize this ?
EDIT : changed the deepInspect method to use a Stack instead of recursion, which slightly improved performance.
Here is the code
public class FileCount {
static long fCount = 0;
public static void main(String[] args) {
System.out.println("Start....");
long start = System.currentTimeMillis();
File cDir = new File("C:\\");
inspect(cDir);
System.out.println("Operation took : " + (System.currentTimeMillis() - start) + " ms");
}
private static void inspect(File cDir) {
for (File f : cDir.listFiles()) {
deepInspect(f);
}
}
private static void deepInspect(File f) {
Stack<File> stack = new Stack<File>();
stack.push(f);
while (!stack.isEmpty()) {
File current = stack.pop();
if (current.listFiles() != null) {
for (File file : current.listFiles()) {
stack.push(file);
}
}
writeData(current.getAbsolutePath());
}
}
static FileWriter writer = null;
private static void writeData(String absolutePath) {
if (writer == null)
try {
writer = new FileWriter("C:\\Collected\\data.txt");
} catch (IOException e) {}
try {
writer.write(absolutePath);
writer.write("\r\n");//nwline
writer.write("Files : " + fCount);
writer.write("\r\n");
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
}
}
Java 8 provides a stream to process all files.
Files.walk(Paths.get("/"))
.filter(Files::isRegularFile)
.forEach(System.out::println);
You could add "parallel" processing for improved performance
Files.walk(Paths.get("/"))
.parallel()
.filter(Files::isRegularFile)
.forEach(System.out::println);
I tried this under linux, so you would need to replace "/" with "C:" and try it. Besides in my case stops when I try to read I don't have access, so you would need to check that too if you are not running as admin.
Check this out
I don't think the recursion is an issue here. The main issue in your code is the File IO which you are doing at every level. The disk access is extremely costly w.r.t the memory access. If you profile your code you should definitely see huge spike in the disk IO.
So, essentially you want to reduce the disk I/O. To do so you could have a in memory finite size Buffer where you can write the output and when the buffer is full flush the data to the file.
This however considerable more amount of work.
Related
I have an A.txt file of 100,000,000 records from 1 to 100000000, each record is one line. I have to read file A then write to file B and C, provided that even line writes to file B and the odd line writes to file C.
Required read and write time must be less than 40 seconds.
Below is the code that I already have but the runtime takes more than 50 seconds.
Does anyone have any other solution to reduce runtime?
Threading.java
import java.io.*;
import java.util.concurrent.LinkedBlockingQueue;
public class Threading implements Runnable {
LinkedBlockingQueue<String> queue = new LinkedBlockingQueue<>();
String file;
Boolean stop = false;
public Threading(String file) {
this.file = file;
}
public void addQueue(String row) {
queue.add();
}
public void Stop() {
stop = true;
}
public void run() {
try {
BufferedWriter bw = new BufferedWriter(new FileWriter(file));
while(!stop) {
try {
String rĘ” = queue.take();
bw.while(row + "\n");
} catch (Exception e) {
e.printStackTrace();
}
}
bw.close();
} catch (Exception e) {
e.printStackTrace();
}
}
}
ThreadCreate.java
// I used 2 threads to write to 2 files B and C
import java.io.*;
import java.util.List;
public class ThreadCreate {
public void startThread(File file) {
Threading t1 = new Threading("B.txt");
Threading t1 = new Threading("B.txt");
Thread td1 = new Thread(t1);
Thread td1 = new Thread(t1);
td1.start();
td2.start();
try {
BufferedReader br = new BufferedReader(new FileReader(file));
String line;
long start = System.currentTimeMillis();
while ((line = br.readLine()) != null) {
if (Integer.parseInt(line) % 2 == 0) {
t1.addQueue(line);
} else {
t2.addQueue(line);
}
}
t1.Stop();
t2.Stop();
br.close();
long end = System.currentTimeMillis();
System.out.println("Time to read file A and write file B, C: " + ((end - start)/1000) + "s");
} catch (Exception e) {
e.printStackTrace();
}
}
}
Main.java
import java.io.*;
public class Main {
public static void main(String[] args) throws IOException {
File file = new File("A.txt");
//Write file B, C
ThreadCreate t = new ThreadCreate();
t.startThread(file);
}
}
Why are you making threads? That just slows things down. Threads are useful if the bottleneck is either the calculation itself or the blocking nature of the operation, and they only hurt if it is not. Here, it isn't: The CPU is just idling (the bottleneck will be the disk), and the nature of what it is blocking on means that multithreading does not help either: Telling a single SSD to write 2 boatloads of bytes in parallel is probably no faster (only slower, as it needs to bounce back and forth). If the target disk is a spinning disk, it is way slower - the write head cannot make clones of itself to go any faster, and by making it multithreaded, you are wasting a ton of time by asking the write head to bounce back and forth between the different write locations.
There's nothing that immediately strikes me as ripe for significant speedups.
Sometimes, writing a ton of data to a disk just takes 50 seconds. If that's not acceptable, buy a faster disk.
try memory mapped files
byte[] buffer = "foo bar foo bar text\n".getBytes();
int number_of_lines = 100000000;
FileChannel file = new RandomAccessFile("writeFIle.txt", "rw").getChannel();
ByteBuffer wrBuf = file.map(FileChannel.MapMode.READ_WRITE, 0, buffer.length * number_of_lines);
for (int i = 0; i < number_of_lines; i++)
{
wrBuf.put(buffer);
}
file.close();
Took to my computer (Dell, I7 processor, with SSD, 32GB RAM) a little over half a minute to run this code)
I'm searching a directory of files with Java 8 and extracting music files. When I run my code on Linux (Debian Wheezy) it completes in around 20 seconds. However, when I run the identical code in Windows 8.1 (same machine!) it takes an inordinately long time, so long that it's really unusable. I've ascertained that the process is occurring as it should, just very slowly. In the time that the Linux variant finds all 2500 files, the Windows variant has found around 100.
Here is the code:
public int List(String path) throws InterruptedException, IOException {
//Linux Variant
if (HomeScreen.os.equals("Linux")) {
File root = new File(path);
File[] list = root.listFiles();
Arrays.sort(list);
if (list == null) {
return 0;
}
for (File f : list) {
if (f.isDirectory()) {
List(f.getAbsolutePath());
} else if (f.isFile()) {
String outPath = f.getAbsolutePath();
try {
String ext = outPath.substring(outPath.lastIndexOf(".") + 1);
if (ext.equals("wma") || ext.equals("m4a") || ext.equals("mp3")) {
String fulltrack = outPath.substring(outPath.lastIndexOf("Music/") + 6);
lm.addElement(fulltrack);
numbers++;
}
} catch (Exception e) {
System.out.println(outPath + " is not a valid file!!!!!");
}
HomeScreen.Library.setModel(lm);
}
}
//Windows variant
} else if (HomeScreen.os.equals("Windows 8.1")){
System.out.println("Using " + HomeScreen.os + " methods...");
File root = new File(path);
File[] list = root.listFiles();
Arrays.sort(list);
if (list == null) {
return 0;
}
for (File f : list) {
if (f.isDirectory()) {
List(f.getAbsolutePath());
} else if (f.isFile()) {
String outPath = f.getAbsolutePath();
try {
String ext = outPath.substring(outPath.lastIndexOf(".") + 1);
if (ext.equals("wma") || ext.equals("m4a") || ext.equals("mp3")) {
String fulltrack = outPath.substring(outPath.lastIndexOf("Music/") + 9);
lm.addElement(fulltrack);
numbers++;
}
} catch (Exception e) {
System.out.println(outPath + " is not a valid file!!!!!");
}
HomeScreen.Library.setModel(lm);
}
}
}
return numbers;
}
I'm still pretty new to Java, so I'm not sure how to go about optimising the code for Windows. Is there any way this can be sped up, or are Windows users doomed to go for a coffee and wait for the load up?
Incidentally, I've put this method in a thread when using Windows so that other things can be done whilst waiting, but this is most definitely not an ideal solution. The drive being searched is a 7200 rpm HDD and there is 8GB RAM installed.
Try the new Java 8 stream API, it allows you to do all of the actions (sort filter forEach)
In one loop!
AND in parallel:
here is your changed code (you might need to fix some parts i dident have like that HomeScreen)
Arrays.asList(root.listFiles())
.parallelStream()
.filter(file -> file != null)
.forEach(file -> {
if (file.isDirectory())
{
List(file.getAbsolutePath());
}
else if (file.isFile())
{
String outPath = file.getAbsolutePath();
try
{
String ext = outPath.substring(outPath.lastIndexOf(".") + 1);
if (ext.equals("wma") || ext.equals("m4a") || ext.equals("mp3"))
{
String fulltrack = outPath.substring(outPath.lastIndexOf("Music/") + 9);
lm.addElement(fulltrack);
numbers++;
}
} catch (Exception e)
{
System.out.println(outPath + " is not a valid file!!!!!");
}
HomeScreen.Library.setModel(lm);
}
});
As recommended in a comment to the question linked by Lorenzo Boccaccia, I'd go for newDirectoryStream. It returns the files one by one, which should be faster.
I'd also consider using multithreading. With a single thread, you wait for the disk nearly all the time. Modern disks are capable of handling multiple outstanding requests, so using 2-4 threads should help.
A side note: There's no reason to write the code differently for Linux and Windows. There may be minor changes needed, but they should be handled by some small helper method.
In no way write things like
if (HomeScreen.os.equals("Linux")) {
...
} else if (HomeScreen.os.equals("Windows 8.1")) {
...
}
What if it's "Windows 8.2"?
Only one instance of my Java application can run at a time. It runs on Linux. I need to ensure that one thread doesn't modify the file while the other thread is using it.
I don't know which file locking or synchronization method to use. I have never done file locking in Java and I don't have much Java or programming experience.
I looked into java NIO and I read that "File locks are held on behalf of the entire Java virtual machine. They are not suitable for controlling access to a file by multiple threads within the same virtual machine." Right away I knew that I needed expert help because this is production code and I have almost no idea what I'm doing (and I have to get it done today).
Here's a brief outline of my code to upload some stuff (archive files) to a server. It gets the list of files to upload from a file (call it "listFile") -- and listFile can be modified while this method is reading from it. I minimize the chances of that by copying listFile to a temp file and using that temp file thereafter. But I think I need to lock the file during this copy process (or something like that).
package myPackage;
import java.io.BufferedReader;
import java.io.BufferedWriter;
import java.io.File;
import java.io.FileReader;
import java.io.FileWriter;
import java.io.IOException;
import com.example.my.FileHelper;
import com.example.my.Logger;
public class BatchUploader implements Runnable {
private int processUploads() {
File myFileToUpload;
File copyOfListFile = null;
try {
copyOfListFile = new File("/path/to/temp/workfile");
File origFile = new File("/path/to/listFile"); //"listFile" - the file that contains a list of files to upload
DataWriter.copyFile(origFile, copyOfListFile);//see code below
} catch (IOException ex) {
Logger.log(ex);
}
try {
BufferedReader input = new BufferedReader(new FileReader(copyOfListFile));
try {
while (!stopRunning && (fileToUploadName = input.readLine()) != null) {
upload(new File(fileToUploadName));
}
} finally {
input.close();
isUploading = false;
}
}
return filesUploadedCount;
}
}
Here is the code that modifies the list of files to be uploaded used in the above code:
public class DataWriter {
public void modifyListOfFilesToUpload(String uploadedFilename) {
StringBuilder content = new StringBuilder();
try {
File listOfFiles = new File("/path/to/listFile"); //file that contains a list of files to upload
if (!listOfFiles.exists()) {
//some code
}
BufferedReader input = new BufferedReader(new FileReader(listOfFiles));
try {
String line = "";
while ((line = input.readLine()) != null) {
if (!line.isEmpty() && line.endsWith(FILE_EXTENSION)) {
if (!line.contains(uploadedFilename)) {
content.append(String.format("%1$s%n", line));
} else {
//some code
}
} else {
//some code
}
}
} finally {
input.close();
}
this.write("/path/to/", "listFile", content.toString(), false, false, false);
} catch (IOException ex) {
Logger.debug("Error reading/writing uploads logfile: " + ex.getMessage());
}
}
public static void copyFile(File in, File out) throws IOException {
FileChannel inChannel = new FileInputStream(in).getChannel();
FileChannel outChannel = new FileOutputStream(out).getChannel();
try {
inChannel.transferTo(0, inChannel.size(), outChannel);
} catch (IOException e) {
throw e;
} finally {
if (inChannel != null) {
inChannel.close();
}
if (outChannel != null) {
outChannel.close();
}
}
}
private void write(String path, String fileName, String data, boolean append, boolean addNewLine, boolean doLog) {
try {
File file = FileHelper.getFile(fileName, path);
BufferedWriter bw = new BufferedWriter(new FileWriter(file, append));
bw.write(data);
if (addNewLine) {
bw.newLine();
}
bw.flush();
bw.close();
if (doLog) {
Logger.debug(String.format("Wrote %1$s%2$s", path, fileName));
}
} catch (java.lang.Exception ex) {
Logger.log(ex);
}
}
}
My I suggest a slightly different approach. Afair on Linux the file rename (mv) operation is atomic on local disks. No chance for one process to see a 'half written' file.
Let XXX be a sequence number with three (or more) digits. You could let your DataWriter append to a file called listFile-XXX.prepare and write a fixed number N of filenames into it. When N names are written, close the file and rename it (atomic, see above) to listFile-XXX. With the next filename, start writing to listFile-YYY where YYY=XXX+1.
Your BatchUploader may at any time check whether it finds files matching the pattern listFile-XXX, open them, read them upload the named files, close and delete them. There is no chance for the threads to mess up each other's file.
Implementation hints:
Make sure to use a polling mechanism in BatchUploader that waits 1 or more seconds if it does not find a file ready for upload (prevent idle wait).
You may want to make sure to sort the listFile-XXX according to XXX to make sure the uploading is kept in sequence.
Of course you could variate the protocol of when listFile-XXX.prepare is closed. If DataWriter has nothing to do for a longer time, you don't want to have files ready for upload hang around just because there are not yet N ready.
Benefits: no locking (which will be a pain to get right), no copying, easy overview over the work queue and it state in the file system.
Here is a slightly different suggestion. Assuming your file names don't have '\n' characters in them (it's a big assumption on linux, I know, but you can have your writer look up for that), why not only read complete lines and ignore the incomplete ones? By incomplete lines, I mean lines that end with EOF and not with \n.
Edit: see more suggestions in comments below.
I am running a thread to traverse my local directory (no sub directory) and as soon as I am getting a text file, I am starting a new thread which will search a word in that file.
What is wrong in the below code?
Searching and traversing are working fine, separately. But when I am putting it together, some thing is going wrong, it is skipping some files (Not exactly, due to multithreading object sunchronization is not happening properly).
Please help me out.
Traverse.java
public void executeTraversing() {
Path dir = null;
if(dirPath.startsWith("file://")) {
dir = Paths.get(URI.create(dirPath));
} else {
dir = Paths.get(dirPath);
}
listFiles(dir);
}
private synchronized void listFiles(Path dir) {
ExecutorService executor = Executors.newFixedThreadPool(1);
try (DirectoryStream<Path> stream = Files.newDirectoryStream(dir)) {
for (Path file : stream) {
if (Files.isDirectory(file)) {
listFiles(file);
} else {
search.setFileNameToSearch(file);
executor.submit(search);
}
}
} catch (IOException | DirectoryIteratorException x) {
// IOException can never be thrown by the iteration.
// In this snippet, it can only be thrown by
// newDirectoryStream.
System.err.println(x);
}
}
Search.java
/**
* #param wordToSearch
*/
public Search(String wordToSearch) {
super();
this.wordToSearch = wordToSearch;
}
public void run() {
this.search();
}
private synchronized void search() {
counter = 0;
Charset charset = Charset.defaultCharset();
try (BufferedReader reader = Files.newBufferedReader(fileNameToSearch.toAbsolutePath(), charset)) {
// do you have permission to read this directory?
if (Files.isReadable(fileNameToSearch)) {
String line = null;
while ((line = reader.readLine()) != null) {
counter++;
//System.out.println(wordToSearch +" "+ fileNameToSearch);
if (line.contains(wordToSearch)) {
System.out.println("Word '" + wordToSearch
+ "' found at "
+ counter
+ " in "
+ fileNameToSearch);
}
}
} else {
System.out.println(fileNameToSearch
+ " is not readable.");
}
} catch (IOException x) {
System.err.format("IOException: %s%n", x);
}
}
this Search instance that you keep reusing here:
search.setFileNameToSearch(file);
executor.submit(search);
while its actual search() method is synchronized, it appears like by the time it actually gets to searching something setFileNameToSearch() would have been called several times, which would explain the skipping.
create a new instance of Search each time, then you wouldnt need to sync the actual search() function.
You are creating the ExecutorService inside your listFiles method, this is probably not a good idea: because of that you're probably creating too many threads.
On top of that you're not monitoring the state of all these ExecutorServices, some of them might not be started when you application stops
Instead you should create the ExecutorService only once, before starting the recursion. When the recursion is over, call shutdown() on your ExecutorService to wait for all tasks completion
Furthermore you are reusing a Search object and passing it to mutliple tasks while modifying it, you should create a Search for each file you're processing
I am looking to read the contents of a file in Java. I have about 8000 files to read the contents and have it in HashMap like (path,contents). I think using Threads would be a option for doing this to speed up the process.
From what I know having all 8000 files to read their contents in different threads is not possible(we may want to limit the threads),Any comments on it? Also I am new to threading in Java, can any one help on how to get started on this one?
so far I thought this pesudo code, :
public class ThreadingTest extends Thread {
public HashMap<String, String > contents = new HashMap<String, String>();
public ThreadingTest(ArrayList<String> paths)
{
for(String s : paths)
{
// paths is paths to files.
// Have threading here for each path going to get contents from a
// file
//Not sure how to limit and start threads here
readFile(s);
Thread t = new Thread();
t.start();
}
}
public String readFile(String path) throws IOException
{
FileReader reader = new FileReader(path);
StringBuilder sb = new StringBuilder();
BufferedReader br = new BufferedReader(reader);
String line;
while ( (line=br.readLine()) != null) {
sb.append(line);
}
return textOnly;
}
}
Any help in completing the threading process. Thanks
Short answer: Read the files sequentially. Disk I/O doesn't parallelize well.
Long Answer: Threading might improve the read performance if the disks are good at random access (SSD disks are) or if the files are placed on several different disks, but if they're not you're just likely to end up with a lot of cache misses and waiting for the disks to seek the right read position. (You may still end up there even if your disks are good at random access.)
If you want to measure instead of guess, use Executors.newFixedThreadPool to create an ExecutorService which can read your files in parallell. Experiment with different thread counts, but don't be surprised if one reader thread per physical disk gives you the best performance.
This is a typical task for thread pool. See the tutorial here: http://download.oracle.com/javase/tutorial/essential/concurrency/pools.html
import java.io.*;
import java.util.Collections;
import java.util.HashMap;
import java.util.Map;
import java.util.concurrent.*;
public class PooledFileProcessing {
private Map<String, String> contents = Collections.synchronizedMap(new HashMap<String, String>());
// Integer.MAX_VALUE items max
private LinkedBlockingQueue<Runnable> workQueue = new LinkedBlockingQueue<Runnable>();
private ExecutorService executor = new ThreadPoolExecutor(
5, // five workers by default
20, // up to twenty workers
1, TimeUnit.MINUTES, // idle thread dies in one minute
workQueue
);
public void process(final String basePath) {
visit(new File(basePath));
System.out.println(workQueue.size() + " jobs still in queue");
executor.shutdown();
try {
executor.awaitTermination(5, TimeUnit.MINUTES);
} catch (InterruptedException e) {
System.out.println("interrupted while awaiting termination");
}
System.out.println(contents.size() + " files indexed");
}
public void visit(final File file) {
if (!file.exists()) {
return;
}
if (file.isFile()) { // skip the dirs
executor.submit(new RunnablePullFile(file));
}
// traverse children
if (file.isDirectory()) {
final File[] children = file.listFiles();
if (children != null && children.length > 0) {
for (File child : children) {
visit(child);
}
}
}
}
public static void main(String[] args) {
new PooledFileProcessing().process(args.length == 1 ? args[0] : System.getProperty("user.home"));
}
protected class RunnablePullFile implements Runnable {
private final File file;
public RunnablePullFile(File file) {
this.file = file;
}
public void run() {
BufferedReader reader = null;
try {
reader = new BufferedReader(new FileReader(file));
StringBuilder sb = new StringBuilder();
String line;
while (
(line=reader.readLine()) != null &&
sb.length() < 8192 /* remove this check for a nice OOME or swap thrashing */
) {
sb.append(line);
}
contents.put(file.getPath(), sb.toString());
} catch (IOException e) {
System.err.println("failed on file: '" + file.getPath() + "': " + e.getMessage());
if (reader != null) {
try {
reader.close();
} catch (IOException e1) {
// ignore that one
}
}
}
}
}
}
From my experience, threading helps - use a thread pool and play with values around 1..2 threads per core.
Just take care with the hash map - consider putting data to the map via a synchronized method only. I remember I once had some ugly issues in similiar project and they were related to concurrent modifications of a central hash map.
just some quick tips.
First of all, to get you started on threads, you should just look at the Runnable interface, or the Thread class. To make a thread you either have to implement this interface with a class or extend this class with another class. You can also make anonymous threads too, but I dislike the readability of those unless its something SUPER simple.
Next, just some notes on processing text with multiple threads, because it just so happens I have some experience in exactly this! Keep in mind that if the files are large and take a noticeably long time to process a single file that you will want to monitor your CPU. In my experience I was doing lots of calculations and lookups when I was processing which added hugely to my load so in the end I found that I could only make as many threads as I had processors because each thread was so labor intensive. So keep that in mind, you want to monitor the effect each thread has on the processor.
I'm not sure having threads for this would really speed up the process if all the files are on the same physical disk. It could even slow things down because the disk would have to constantly switch from one location to the other.