I am looking to read the contents of a file in Java. I have about 8000 files to read the contents and have it in HashMap like (path,contents). I think using Threads would be a option for doing this to speed up the process.
From what I know having all 8000 files to read their contents in different threads is not possible(we may want to limit the threads),Any comments on it? Also I am new to threading in Java, can any one help on how to get started on this one?
so far I thought this pesudo code, :
public class ThreadingTest extends Thread {
public HashMap<String, String > contents = new HashMap<String, String>();
public ThreadingTest(ArrayList<String> paths)
{
for(String s : paths)
{
// paths is paths to files.
// Have threading here for each path going to get contents from a
// file
//Not sure how to limit and start threads here
readFile(s);
Thread t = new Thread();
t.start();
}
}
public String readFile(String path) throws IOException
{
FileReader reader = new FileReader(path);
StringBuilder sb = new StringBuilder();
BufferedReader br = new BufferedReader(reader);
String line;
while ( (line=br.readLine()) != null) {
sb.append(line);
}
return textOnly;
}
}
Any help in completing the threading process. Thanks
Short answer: Read the files sequentially. Disk I/O doesn't parallelize well.
Long Answer: Threading might improve the read performance if the disks are good at random access (SSD disks are) or if the files are placed on several different disks, but if they're not you're just likely to end up with a lot of cache misses and waiting for the disks to seek the right read position. (You may still end up there even if your disks are good at random access.)
If you want to measure instead of guess, use Executors.newFixedThreadPool to create an ExecutorService which can read your files in parallell. Experiment with different thread counts, but don't be surprised if one reader thread per physical disk gives you the best performance.
This is a typical task for thread pool. See the tutorial here: http://download.oracle.com/javase/tutorial/essential/concurrency/pools.html
import java.io.*;
import java.util.Collections;
import java.util.HashMap;
import java.util.Map;
import java.util.concurrent.*;
public class PooledFileProcessing {
private Map<String, String> contents = Collections.synchronizedMap(new HashMap<String, String>());
// Integer.MAX_VALUE items max
private LinkedBlockingQueue<Runnable> workQueue = new LinkedBlockingQueue<Runnable>();
private ExecutorService executor = new ThreadPoolExecutor(
5, // five workers by default
20, // up to twenty workers
1, TimeUnit.MINUTES, // idle thread dies in one minute
workQueue
);
public void process(final String basePath) {
visit(new File(basePath));
System.out.println(workQueue.size() + " jobs still in queue");
executor.shutdown();
try {
executor.awaitTermination(5, TimeUnit.MINUTES);
} catch (InterruptedException e) {
System.out.println("interrupted while awaiting termination");
}
System.out.println(contents.size() + " files indexed");
}
public void visit(final File file) {
if (!file.exists()) {
return;
}
if (file.isFile()) { // skip the dirs
executor.submit(new RunnablePullFile(file));
}
// traverse children
if (file.isDirectory()) {
final File[] children = file.listFiles();
if (children != null && children.length > 0) {
for (File child : children) {
visit(child);
}
}
}
}
public static void main(String[] args) {
new PooledFileProcessing().process(args.length == 1 ? args[0] : System.getProperty("user.home"));
}
protected class RunnablePullFile implements Runnable {
private final File file;
public RunnablePullFile(File file) {
this.file = file;
}
public void run() {
BufferedReader reader = null;
try {
reader = new BufferedReader(new FileReader(file));
StringBuilder sb = new StringBuilder();
String line;
while (
(line=reader.readLine()) != null &&
sb.length() < 8192 /* remove this check for a nice OOME or swap thrashing */
) {
sb.append(line);
}
contents.put(file.getPath(), sb.toString());
} catch (IOException e) {
System.err.println("failed on file: '" + file.getPath() + "': " + e.getMessage());
if (reader != null) {
try {
reader.close();
} catch (IOException e1) {
// ignore that one
}
}
}
}
}
}
From my experience, threading helps - use a thread pool and play with values around 1..2 threads per core.
Just take care with the hash map - consider putting data to the map via a synchronized method only. I remember I once had some ugly issues in similiar project and they were related to concurrent modifications of a central hash map.
just some quick tips.
First of all, to get you started on threads, you should just look at the Runnable interface, or the Thread class. To make a thread you either have to implement this interface with a class or extend this class with another class. You can also make anonymous threads too, but I dislike the readability of those unless its something SUPER simple.
Next, just some notes on processing text with multiple threads, because it just so happens I have some experience in exactly this! Keep in mind that if the files are large and take a noticeably long time to process a single file that you will want to monitor your CPU. In my experience I was doing lots of calculations and lookups when I was processing which added hugely to my load so in the end I found that I could only make as many threads as I had processors because each thread was so labor intensive. So keep that in mind, you want to monitor the effect each thread has on the processor.
I'm not sure having threads for this would really speed up the process if all the files are on the same physical disk. It could even slow things down because the disk would have to constantly switch from one location to the other.
Related
I have an A.txt file of 100,000,000 records from 1 to 100000000, each record is one line. I have to read file A then write to file B and C, provided that even line writes to file B and the odd line writes to file C.
Required read and write time must be less than 40 seconds.
Below is the code that I already have but the runtime takes more than 50 seconds.
Does anyone have any other solution to reduce runtime?
Threading.java
import java.io.*;
import java.util.concurrent.LinkedBlockingQueue;
public class Threading implements Runnable {
LinkedBlockingQueue<String> queue = new LinkedBlockingQueue<>();
String file;
Boolean stop = false;
public Threading(String file) {
this.file = file;
}
public void addQueue(String row) {
queue.add();
}
public void Stop() {
stop = true;
}
public void run() {
try {
BufferedWriter bw = new BufferedWriter(new FileWriter(file));
while(!stop) {
try {
String rĘ” = queue.take();
bw.while(row + "\n");
} catch (Exception e) {
e.printStackTrace();
}
}
bw.close();
} catch (Exception e) {
e.printStackTrace();
}
}
}
ThreadCreate.java
// I used 2 threads to write to 2 files B and C
import java.io.*;
import java.util.List;
public class ThreadCreate {
public void startThread(File file) {
Threading t1 = new Threading("B.txt");
Threading t1 = new Threading("B.txt");
Thread td1 = new Thread(t1);
Thread td1 = new Thread(t1);
td1.start();
td2.start();
try {
BufferedReader br = new BufferedReader(new FileReader(file));
String line;
long start = System.currentTimeMillis();
while ((line = br.readLine()) != null) {
if (Integer.parseInt(line) % 2 == 0) {
t1.addQueue(line);
} else {
t2.addQueue(line);
}
}
t1.Stop();
t2.Stop();
br.close();
long end = System.currentTimeMillis();
System.out.println("Time to read file A and write file B, C: " + ((end - start)/1000) + "s");
} catch (Exception e) {
e.printStackTrace();
}
}
}
Main.java
import java.io.*;
public class Main {
public static void main(String[] args) throws IOException {
File file = new File("A.txt");
//Write file B, C
ThreadCreate t = new ThreadCreate();
t.startThread(file);
}
}
Why are you making threads? That just slows things down. Threads are useful if the bottleneck is either the calculation itself or the blocking nature of the operation, and they only hurt if it is not. Here, it isn't: The CPU is just idling (the bottleneck will be the disk), and the nature of what it is blocking on means that multithreading does not help either: Telling a single SSD to write 2 boatloads of bytes in parallel is probably no faster (only slower, as it needs to bounce back and forth). If the target disk is a spinning disk, it is way slower - the write head cannot make clones of itself to go any faster, and by making it multithreaded, you are wasting a ton of time by asking the write head to bounce back and forth between the different write locations.
There's nothing that immediately strikes me as ripe for significant speedups.
Sometimes, writing a ton of data to a disk just takes 50 seconds. If that's not acceptable, buy a faster disk.
try memory mapped files
byte[] buffer = "foo bar foo bar text\n".getBytes();
int number_of_lines = 100000000;
FileChannel file = new RandomAccessFile("writeFIle.txt", "rw").getChannel();
ByteBuffer wrBuf = file.map(FileChannel.MapMode.READ_WRITE, 0, buffer.length * number_of_lines);
for (int i = 0; i < number_of_lines; i++)
{
wrBuf.put(buffer);
}
file.close();
Took to my computer (Dell, I7 processor, with SSD, 32GB RAM) a little over half a minute to run this code)
I am trying to read a log file and parse it that consumes only CPU. I have a server that reads a huge text file 230MB/second, just read text file not parse. When i try to parse the text file, using single thread, i can parse the file around 50-70MB/second.
I want to increase my throughput, doing that job concurrency. In this code, i reached 130 MB/second. At the peak, i saw 190MB/second. I tried BlockedQueue, Semaphore, ExecutionService etc. Is there any advice you give me reach at 200MB/second throughput.
public static void fileReaderTestUsingSemaphore(String[] args) throws Exception {
CustomFileReader reader = new CustomFileReader(args[0]);
final int concurrency = Integer.parseInt(args[1]);
ExecutorService executorService = Executors.newFixedThreadPool(concurrency);
Semaphore semaphore = new Semaphore(concurrency,true);
System.out.println("Conccurrency in Semaphore: " + concurrency);
String line;
while ((line = reader.getLine()) != null)
{
semaphore.acquire();
try
{
final String p = line;
executorService.execute(new Runnable() {
#Override
public void run() {
reader.splitNginxLinewithIntern(p); // that is the method which parser string and convert to class.
semaphore.release();
}
});
}
catch (Exception ex)
{
ex.printStackTrace();
}
finally {
semaphore.release();
}
}
executorService.shutdown();
executorService.awaitTermination(Long.MAX_VALUE, TimeUnit.MINUTES);
System.out.println("ReadByteCount: " + reader.getReadByteCount());
}
You might benefit from the Files.lines() method and the Stream paradigm introduced in Java 8. It will use the systems common fork/join pool. Try this pattern:
import java.io.IOException;
import java.nio.file.Files;
import java.nio.file.Paths;
public class LineCounter
{
public static void main(String[] args) throws IOException
{
Files.lines(Paths.get("/your/file/here"))
.parallel()
.forEach(LineCounter::processLine);
}
private static void processLine(String line) {
// do the processing
}
}
Assuming that you don't care about order of lines:
final String MARKER = new String("");
BlockingQueue<String> q = new LinkedBlockingDeque<>(1024);
for (int i = 0; i < concurrency; i++)
executorService.execute(() -> {
for (;;) {
try {
String s = q.take();
if(s == MARKER) {
q.put(s);
return;
}
reader.splitNginxLinewithIntern(s);
} catch (InterruptedException e) {
return;
}
}
});
String line;
while ((line = reader.readLine()) != null) {
q.put(line);
}
q.put(MARKER);
executorService.awaitTermination(10, TimeUnit.MINUTES);
This starts a number of threads that each runs a specific task; that task is to read from the queue and run the split method. The reader just feeds the queue, notifies when it's complete and waits for termination.
If you were to use RxJava2 and rxjava2-extras that would simply be
Strings.from(reader)
.flatMap(str -> Flowable
.just(str)
.observeOn(Schedulers.computation())
.doOnNext(reader::splitNginxLinewithIntern)
)
.blockingSubscribe();
You need to go multi-thread, and you need to have the reader thread delegate the parsing to worker threads, that's clear. The point is how to do this delegating with as little overhead as possible.
#Tassos provided code that looks like a solid improvement.
One more thing you can try is to change the delegation granularity, not delegating every single line individually, but building chunks of e.g. 100 lines, thus reducing the delegating/synchronizing overhead by a factor of 100 (but then needing a String[] array or similar, which shouldn't hurt too much).
I am trying to read a file and add each line to a list.
Simple drawing explaining the goal
Main class -
public class SimpleTreadPoolMain {
public static void main(String[] args) {
ReadFile reader = new ReadFile();
File file = new File("C:\\myFile.csv");
try {
reader.readFile(file);
} catch (IOException e) {
e.printStackTrace();
}
}
}
Reader class -
public class ReadFile {
ExecutorService executor = Executors.newFixedThreadPool(5);//creating a pool of 5 threads
List<String> list = new ArrayList<>();
void readFile(File file) throws IOException {
try (BufferedReader br = new BufferedReader(new FileReader(file))) {
String line;
while ((line = br.readLine()) != "") {
Runnable saver = new SaveToList(line,list);
executor.execute(saver);//calling execute method of ExecutorService
}
}
executor.shutdown();
while (!executor.isTerminated()) { }
}
}
Saver class -
public class SaveToList<E> implements Runnable{
List<E> myList;
E line;
public SaveToList(E line, List<E> list) {
this.line = line;
this.myList = list;
}
public void run() {
//modify the line
myList.add(line);
}
}
I tried to have many saver threads to add in to a same list instead of one saver adding to the list one by one. I want to use threads because I need to modify the data before adding to the list. So I assume modifying the data would take up some time. So paralleling this part would reduce the time consumption, right?
But this doesn't work. I am unable to return a global list which includes all the values from the file. I want to have only one global list of values from the file. So the code definitely should change. If one can guide me it would be greatly appreciated.
Even though adding one by one in a single thread would work, using a thread pool would make it faster, right?
Using multiple threads won't speed anything up here.
You are:
Reading a line from a file, serially.
Creating a runnable and submitting it into a thread pool
The runnable then adds things into a list
Given that you're using an ArrayList, you need to synchronize access to it, because you're mutating it from multiple threads. So, you are adding things into the list serially.
But even without the synchronization, the time taken for the IO will far exceed the time taken to add the string into the list. And adding in multithreading is just going to slow it down more, because it's doing work to construct the runnable, submit it to the thread pool, schedule it, etc.
It's simpler just to miss out the whole middle step:
Read a line from a file, serially.
Add the list to the list, serially.
So:
try (BufferedReader br = new BufferedReader(new FileReader(file))) {
String line;
while (!(line = br.readLine()).isEmpty()) {
list.add(line);
}
}
You should in fact try if it's worth using multi threading in you application, just compare how much time it takes to read the whole file without any processing on rows done, and compare it with the time it takes to process serially the whole file.
If your process is not too complex my guess is it is not worth to use multi threading.
If you find that the time it takes is much more then you can think about using one or more threads to do the computations.
If so, you could use Futures to process batches of input strings or maybe you could use a thread safe Queue to send string to another process.
private static final int BATCH_SIZE = 1000;
public static void main(String[] args) throws IOException {
BufferedReader reader = new BufferedReader(new InputStreamReader(new FileInputStream("big_file.csv"), "utf-8"));
ExecutorService pool = Executors.newFixedThreadPool(8);
String line;
List<String> batch = new ArrayList<>(BATCH_SIZE);
List<Future> results = new LinkedList<>();
while((line=reader.readLine())!=null){
batch.add(line);
if(batch.size()>=BATCH_SIZE){
Future<Object> f = noWaitExec(batch, pool);
results.add(f);
batch = new ArrayList<>(BATCH_SIZE);
}
}
Future<List> f = noWaitExec(batch,pool);
results.add(f);
for (Future future : results) {
try {
Object object = future.get();
// Use your results here
} catch (Exception e) {
// Manage this....
}
}
}
private static Future<List> noWaitExec(final List<String> batch, ExecutorService pool) {
return pool.submit(new Callable<List>() {
public List call() throws Exception {
List result = new ArrayList<>(batch.size());
for (String string : batch) {
result.add(process(string));
}
return result;
}
});
}
private static Object process(String string) {
// Your process ....
return null;
};
There are many other possible solutions (Observables, ParallelStreams, Pipes, CompletableFutures ... you name it), still I think that most of the time spent is the time it takes to read the file, just using a BufferedInputStream to read the file with a big enough buffer could cut your times more then parallel computing.
I have the following code in my application which does two things:
Parse the file which has 'n' number of data.
For each data in the file, there will be two web service calls.
public static List<String> parseFile(String fileName) {
List<String> idList = new ArrayList<String>();
try {
BufferedReader cfgFile = new BufferedReader(new FileReader(new File(fileName)));
String line = null;
cfgFile.readLine();
while ((line = cfgFile.readLine()) != null) {
if (!line.trim().equals("")) {
String [] fields = line.split("\\|");
idList.add(fields[0]);
}
}
cfgFile.close();
} catch (IOException e) {
System.out.println(e+" Unexpected File IO Error.");
}
return idList;
}
When i try parse the file having 1 million lines of record, the java process fails after processing certain amount of data. I got java.lang.OutOfMemoryError: Java heap space error. I can partly figure out that the java process stops because of this huge data being provided. Kindly suggest me how to proceed with this huge data.
EDIT: Will this part of code new BufferedReader(new FileReader(new File(fileName))); parse the whole file and gets affected to the size of the file.
The problem you have is you are accumulating all the data on the list. The best way to approach this is to do it on a streaming fashion. This means do not accumulate all the ids on the list, but call your web service on each row or accumulate a smaller buffer and then do the call.
Opening the file and creating the BufferedReader will have no impact on memory consumption, as the bytes from the file will be read (more or less) line by line. The problem is at this point in the code idList.add(fields[0]);, the list will grow as large as the file as you keep accumulating all of the file data into it.
Your code should do something like this:
while ((line = cfgFile.readLine()) != null) {
if (!line.trim().equals("")) {
String [] fields = line.split("\\|");
callToRemoteWebService(fields[0]);
}
}
Increase your java heap memory size using the -Xms and -Xmx options. If not set explicitly, the jvm sets the heap size to the ergonomic defaults which in your case is not enough. Read this paper to find out more about tuning the memory in jvm: http://www.oracle.com/technetwork/java/javase/tech/memorymanagement-whitepaper-1-150020.pdf
EDIT: Alternative way on doing this in a producer-consumer way to exploit parallel processing. The general idea is to create a producer thread that reads the file and queues tasks for processing and n consumer threads that consume them. A very general idea (for illustrative purposes) is the following:
// blocking queue holding the tasks to be executed
final SynchronousQueue<Callable<String[]> queue = // ...
// reads the file and submit tasks for processing
final Runnable producer = new Runnable() {
public void run() {
BufferedReader in = null;
try {
in = new BufferedReader(new FileReader(new File(fileName)));
String line = null;
while ((line = file.readLine()) != null) {
if (!line.trim().equals("")) {
String[] fields = line.split("\\|");
// this will block if there are not available consumer threads to process it...
queue.put(new Callable<Void>() {
public Void call() {
process(fields);
}
});
}
}
} catch (InterruptedException e) {
Thread.currentThread().interrupt());
} finally {
// close the buffered reader here...
}
}
}
// Consumes the tasks submitted from the producer. Consumers can be pooled
// for parallel processing.
final Runnable consumer = new Runnable() {
public void run() {
try {
while (true) {
// this method blocks if there are no items left for processing in the queue...
Callable<Void> task = queue.take();
taks.call();
}
} catch (InterruptedException e) {
Thread.currentThread().interrupt();
}
}
}
Of course you have to write code that manages the lifecycle of the consumer and producer threads. The right way to do this would be by implementing it using an Executor.
When you want to work with big data, you have 2 choices:
use a big enough heap to fit all the data. this will "work" for a while, but if your data size is unbounded, it will eventually fail.
work with the data incrementally. only keep part of the data (of a bounded size) in memory at any one time. this is the ideal solution as it will scale to any amount of data.
Someone else's process is creating a CSV file by appending a line at a time to it, as events occur. I have no control over the file format or the other process, but I know it will only append.
In a Java program, I would like to monitor this file, and when a line is appended read the new line and react according to the contents. Ignore the CSV parsing issue for now. What is the best way to monitor the file for changes and read a line at a time?
Ideally this will use the standard library classes. The file may well be on a network drive, so I'd like something robust to failure. I'd rather not use polling if possible - I'd prefer some sort of blocking solution instead.
Edit -- given that a blocking solution is not possible with standard classes (thanks for that answer), what is the most robust polling solution? I'd rather not re-read the whole file each time as it could grow quite large.
Since Java 7 there has been the newWatchService() method on the FileSystem class.
However, there are some caveats:
It is only Java 7
It is an optional method
it only watches directories, so you have to do the file handling yourself, and worry about the file moving etc
Before Java 7 it is not possible with standard APIs.
I tried the following (polling on a 1 sec interval) and it works (just prints in processing):
private static void monitorFile(File file) throws IOException {
final int POLL_INTERVAL = 1000;
FileReader reader = new FileReader(file);
BufferedReader buffered = new BufferedReader(reader);
try {
while(true) {
String line = buffered.readLine();
if(line == null) {
// end of file, start polling
Thread.sleep(POLL_INTERVAL);
} else {
System.out.println(line);
}
}
} catch(InterruptedException ex) {
ex.printStackTrace();
}
}
As no-one else has suggested a solution which uses a current production Java I thought I'd add it. If there are flaws please add in comments.
You can register to get notified by the file system if any change happens to the file using WatchService class. This requires Java7, here the link for the documentation http://docs.oracle.com/javase/tutorial/essential/io/notification.html
here the snippet code to do that:
public FileWatcher(Path dir) {
this.watcher = FileSystems.getDefault().newWatchService();
WatchKey key = dir.register(watcher, ENTRY_MODIFY);
}
void processEvents() {
for (;;) {
// wait for key to be signalled
WatchKey key;
try {
key = watcher.take();
} catch (InterruptedException x) {
return;
}
for (WatchEvent<?> event : key.pollEvents()) {
WatchEvent.Kind<?> kind = event.kind();
if (kind == OVERFLOW) {
continue;
}
// Context for directory entry event is the file name of entry
WatchEvent<Path> ev = cast(event);
Path name = ev.context();
Path child = dir.resolve(name);
// print out event
System.out.format("%s: %s file \n", event.kind().name(), child);
}
// reset key and remove from set if directory no longer accessible
boolean valid = key.reset();
}
}
This is not possible with standard library classes. See this question for details.
For efficient polling it will be better to use Random Access. It will help if you remember the position of the last end of file and start reading from there.
Use Java 7's WatchService, part of NIO.2
The WatchService API is designed for applications that need to be notified about file change events.
Just to expand on Nick Fortescue's last entry, below are two classes that you can run concurrently (e.g. in two different shell windows) which shows that a given File can simultaneously be written to by one process and read by another.
Here, the two processes will be executing these Java classes, but I presume that the writing process could be from any other application. (Assuming that it does not hold an exclusive lock on the file-are there such file system locks on certain operating systems?)
I have successfully tested these two classes on both Windoze and Linux. I would very much like to know if there is some condition (e.g. operating system) on which they fail.
Class #1:
import java.io.File;
import java.io.FileWriter;
import java.io.PrintWriter;
public class FileAppender {
public static void main(String[] args) throws Exception {
if ((args != null) && (args.length != 0)) throw
new IllegalArgumentException("args is not null and is not empty");
File file = new File("./file.txt");
int numLines = 1000;
writeLines(file, numLines);
}
private static void writeLines(File file, int numLines) throws Exception {
PrintWriter pw = null;
try {
pw = new PrintWriter( new FileWriter(file), true );
for (int i = 0; i < numLines; i++) {
System.out.println("writing line number " + i);
pw.println("line number " + i);
Thread.sleep(100);
}
}
finally {
if (pw != null) pw.close();
}
}
}
Class #2:
import java.io.BufferedReader;
import java.io.File;
import java.io.FileReader;
public class FileMonitor {
public static void main(String[] args) throws Exception {
if ((args != null) && (args.length != 0)) throw
new IllegalArgumentException("args is not null and is not empty");
File file = new File("./file.txt");
readLines(file);
}
private static void readLines(File file) throws Exception {
BufferedReader br = null;
try {
br = new BufferedReader( new FileReader(file) );
while (true) {
String line = br.readLine();
if (line == null) { // end of file, start polling
System.out.println("no file data available; sleeping..");
Thread.sleep(2 * 1000);
}
else {
System.out.println(line);
}
}
}
finally {
if (br != null) br.close();
}
}
}
Unfortunately, TailInputStream class, which can be used to monitor the end of a file, is not one of standard Java platform classes, but there are few implementations on the web. You can find an implementation of TailInputStream class together with a usage example on http://www.greentelligent.com/java/tailinputstream.
Poll, either on a consistent cycle or on a random cycle; 200-2000ms should be a good random poll interval span.
Check two things...
If you have to watch for file growth, then check the EOF / byte count, and be sure to compare that and the fileAccess or fileWrite times with the lass poll. If ( > ), then the file has been written.
Then, combine that with checking for exclusive lock / read access. If the file can be read-locked and it has grown, then whatever was writing to it has finished.
Checking for either property alone won't necessarily get you a guaranteed state of written++ and actually done and available for use.