I need to process a large file (with columns and same format lines). Since I need to consider the cases that the program crashes during the processing, I need this processing program to be retryable, which means after it crashes and I start the program again, it can continue to process the file starting with the line it failed.
Is there any pattern I can follow or library I can use? Thank you!
Update:
About the crashing cases, it is not just about OOM or some internal issues. It also could be caused by the timeout with other parts or machine crashing. So try/catch can't handle this.
Another update:
About the chunking the file, it is feasible in my case but not that as simple as it sounds. As I said, the file is formatted with several columns and I can split it up into hundreds of files based on one of the column and then process the files one by one. But instead of doing this, I would like to learn more about the common solution about processing big file/data supporting retrying.
How I would do it (though am not a pro)
Create a LineProcessor called on every line in file
class Processor implements LineProcessor> {
private List<String> lines = Lists.newLinkedList();
private int startFrom = 0;
private int lineNumber = 0;
public Processor(int startFrom) {
this.startFrom = startFrom;
}
#Override
public List<String> getResult() {
return lines;
}
#Override
public boolean processLine(String arg0) throws IOException {
lineNumber++;
if (lineNumber < startFrom) {
// do nothing
} else {
if (new Random().nextInt() % 50000 == 0) {
throw new IOException("Randomly thrown Exception " + lineNumber);
}
//Do the hardwork here
lines.add(arg0);
startFrom++;
}
return true;
}
}
Create a Callable for Reading Files that makes use of my LineProcessor
class Reader implements Callable<List<String>> {
private int startFrom;
public Reader(int startFrom) {
this.startFrom = startFrom;
}
#Override
public List<String> call() throws Exception {
return Files.readLines(new File("/etc/dictionaries-common/words"),
Charsets.UTF_8, new Processor(startFrom));
}
}
Wrap the Callable in a Retryer and call it using an Executor
public static void main(String[] args) throws InterruptedException, ExecutionException {
BasicConfigurator.configure();
ExecutorService executor = Executors.newSingleThreadExecutor();
Future<List<String>> lines = executor.submit(RetryerBuilder
.<List<String>> newBuilder()
.retryIfExceptionOfType(IOException.class)
.withStopStrategy(StopStrategies.stopAfterAttempt(100)).build()
.wrap(new Reader(100)));
logger.debug(lines.get().size());
executor.shutdown();
logger.debug("Happily Ever After");
}
You could maintain a checkpoint/commit style logic in your code. So when the program runs again it starts from the same checkpoint.
You can use RandomAccessFile to read the file and use the getFilePointer() as your checkpoint which you preserver. When you execute the program again you start with this checkpoint by calling seek(offset).
Try catch won's save from OOM error. You should process the file in chunks and store the location after every successfull chunck into filesystem/database/what ever place where it remains persistent even if your program crashes. Then you can read the previous point from the place you stored it when you restart your software. You must also cleanup this information when the whole file is processed.
Related
I am reading Network Programming in Java by Elliotte and in the chapter on Threads he gave this piece of code as an example of a computation that can be ran in a different thread
import java.io.*;
import java.security.*;
public class ReturnDigest extends Thread {
private String filename;
private byte[] digest;
public ReturnDigest(String filename) {
this.filename = filename;
}
#Override
public void run() {
try {
FileInputStream in = new FileInputStream(filename);
MessageDigest sha = MessageDigest.getInstance("SHA-256");
DigestInputStream din = new DigestInputStream(in, sha);
while (din.read() != -1) ; // read entire file
din.close();
digest = sha.digest();
} catch (IOException ex) {
System.err.println(ex);
} catch (NoSuchAlgorithmException ex) {
System.err.println(ex);
}
}
public byte[] getDigest() {
return digest;
}
}
To use this thread, he gave an approach which he referred to as the solution novices might use.
The solution most novices adopt is to make the getter method return a
flag value (or perhaps throw an exception) until the result field is
set.
And the solution he is referring to is:
public static void main(String[] args) {
ReturnDigest[] digests = new ReturnDigest[args.length];
for (int i = 0; i < args.length; i++) {
// Calculate the digest
digests[i] = new ReturnDigest(args[i]);
digests[i].start();
}
for (int i = 0; i < args.length; i++) {
while (true) {
// Now print the result
byte[] digest = digests[i].getDigest();
if (digest != null) {
StringBuilder result = new StringBuilder(args[i]);
result.append(": ");
result.append(DatatypeConverter.printHexBinary(digest));
System.out.println(result);
break;
}
}
}
}
He then went on to propose a better approach using callbacks, which he described as:
In fact, there’s a much simpler, more efficient way to handle the
problem. The infinite loop that repeatedly polls each ReturnDigest
object to see whether it’s finished can be eliminated. The trick is
that rather than having the main program repeatedly ask each
ReturnDigest thread whether it’s finished (like a five-year-old
repeatedly asking, “Are we there yet?” on a long car trip, and almost
as annoying), you let the thread tell the main program when it’s
finished. It does this by invoking a method in the main class that
started it. This is called a callback because the thread calls its
creator back when it’s done
And the code for the callback approach he gave is below:
import java.io.*;
import java.security.*;
public class CallbackDigest implements Runnable {
private String filename;
public CallbackDigest(String filename) {
this.filename = filename;
}
#Override
public void run() {
try {
FileInputStream in = new FileInputStream(filename);
MessageDigest sha = MessageDigest.getInstance("SHA-256");
DigestInputStream din = new DigestInputStream( in , sha);
while (din.read() != -1); // read entire file
din.close();
byte[] digest = sha.digest();
CallbackDigestUserInterface.receiveDigest(digest, filename); // this is the callback
} catch (IOException ex) {
System.err.println(ex);
} catch (NoSuchAlgorithmException ex) {
System.err.println(ex);
}
}
}
And the Implementation of CallbackDigestUserInterface and it's usage was given as:
public class CallbackDigestUserInterface {
public static void receiveDigest(byte[] digest, String name) {
StringBuilder result = new StringBuilder(name);
result.append(": ");
result.append(DatatypeConverter.printHexBinary(digest));
System.out.println(result);
}
public static void main(String[] args) {
for (String filename: args) {
// Calculate the digest
CallbackDigest cb = new CallbackDigest(filename);
Thread t = new Thread(cb);
t.start();
}
}
}
But my question (or clarification) is regarding what he said about this method...He mentioned
The trick is
that rather than having the main program repeatedly ask each
ReturnDigest thread whether it’s finished, you let the thread
tell the main program when it’s finished
Looking at the code, the Thread that was created to run a separate computation is actually the one that continues executing the original program. It is not as if it passed the result back to the main thread. It seems it becomes the MAIN Thread!
So it is not as if the Main threads gets notified when the task is done (instead of the main thread polling). It is that the main thread does not care about the result. It runs to its end and it finishes. The new thread would just run another computation when it is done.
Do I understand this correctly?
How does this play with debugging? Does the thread now becomes the Main thread? and would the debugger now treat it as such?
Is there another means to actually pass the result back to the main thread?
I would appreciate any help, that helps in understanding this better :)
It is a common misunderstanding to think that the "main" thread, the one that public static void main is run on, should be considered the main thread for the application. If you write a gui app for instance, the starting thread will likely finish and die well before the program ends.
Also, callbacks are normally called by the thread that they are handed off to. This in true in Swing, and in many other places (including DataFetcher, for example)
None of the other threads become the "main thread". Your main thread is the thread that starts with the main() method. It's job is to start the other threads... then it dies.
At this point, you never return to the main thread, but the child threads have callbacks... and that means that when they are done, they know where to redirect the flow of the program.
That is your receiveDigest() method. Its job is to display the results of the child threads once they complete. Is this method being called from the main thread, or the child threads? What do you think?
It is possible to pass the result back to the main thread. To do this, you need to keep the main thread from terminating, so it will need to have a loop to keep it going indefinitely, and to keep that loop from eating up processor duty, it will need to be put to sleep while the other threads work.
You can read an example of fork and join architecture here:
https://www.tutorialspoint.com/java_concurrency/concurrency_fork_join.htm
The book is misleading you.
First of all, there is no Callback in the example. There is only one function calling another function by name. A true callback is a means for communication between different software modules. It is pointer or reference to a function or object-with-methods that module A provides to module B so that module B can call it when something interesting happens. It has nothing at all to do with threads.
Second of all, the alleged callback communicates nothing between threads. The function call happens entirely in the new thread, after the main() thread has already died.
I have two functions which must run in a critical section:
public synchronized void f1() { ... }
public synchronized void f2() { ... }
Assume that the behavior is as following:
f1 is almost never called. Actually, under normal conditions, this method is never called. If f1 is called anyway, it should return quickly.
f2 is called at a very high rate. It returns very quickly.
These methods never call each other and there is no reentrancy as well.
In other words, there is very low contention. So when f2 is called, we have some overhead to acquire the lock, which is granted immediately in 99,9% of the cases. I am wondering if there are approaches to avoid this overhead.
I came up with the following alternative:
private final AtomicInteger lock = new AtomicInteger(0);
public void f1() {
while (!lock.compareAndSet(0, 1)) {}
try {
...
} finally {
lock.set(0);
}
}
public void f2() {
while (!lock.compareAndSet(0, 2)) {}
try {
...
} finally {
lock.set(0);
}
}
Are there other approaches? Does the java.util.concurrent package offer something natively?
update
Although my intention is to have a generic question, some information regarding my situation:
f1: This method creates a new remote stream, if for some reason the current one becomes corrupt, for example due to a timeout. A remote stream could be considered as a socket connection which consumes a remote queue starting from a given location:
private Stream stream;
public synchronized void f1() {
final Stream stream = new Stream(...);
if (this.stream != null) {
stream.setPosition(this.stream.getPosition());
}
this.stream = stream;
return stream;
}
f2: This method advances the stream position. It is a plain setter:
public synchronized void f2(Long p) {
stream.setPosition(p);
}
Here, stream.setPosition(Long) is implemented as a plain setter as well:
public class Stream {
private volatile Long position = 0;
public void setPosition(Long position) {
this.position = position;
}
}
In Stream, the current position will be sent to the server periodically asynchronously. Note that Stream is not implemented by myself.
My idea was to introduce compare-and-swap as illustrated above, and mark stream as volatile.
Your example isn't doing what you want it to. You are actually executing your code when the lock is being used. Try something like this:
public void f1() {
while (!lock.compareAndSet(0, 1)) {
}
try {
...
} finally {
lock.set(0);
}
}
To answer your question, I don't believe that this will be any faster than using synchronized methods, and this method is harder to read and comprehend.
From the description and your example code, I've inferred the following:
Stream has its own internal position, and you're also tracking the most recent position externally. You use this as a sort of 'resume point': when you need to reinitialize the stream, you advance it to this point.
The last known position may be stale; I'm assuming this based on your assertion that the stream periodically does asynchronously notifies the server of its current position.
At the time f1 is called, the stream is known to be in a bad state.
The functions f1 and f2 access the same data, and may run concurrently. However, neither f1 nor f2 will ever run concurrently against itself. In other words, you almost have a single-threaded program, except for the rare cases when both f1 and f2 are executing.
[Side note: My solution doesn't actually care if f1 gets called concurrently with itself; it only cares that f2 is not called concurrently with itself]
If any of this is wrong, then the solution below is wrong. Heck, it might be wrong anyway, either because of some detail left out, or because I made a mistake. Writing low-lock code is hard, which is exactly why you should avoid it unless you've observed an actual performance issue.
static class Stream {
private long position = 0L;
void setPosition(long position) {
this.position = position;
}
}
final static class StreamInfo {
final Stream stream = new Stream();
volatile long resumePosition = -1;
final void setPosition(final long position) {
stream.setPosition(position);
resumePosition = position;
}
}
private final Object updateLock = new Object();
private final AtomicReference<StreamInfo> currentInfo = new AtomicReference<>(new StreamInfo());
void f1() {
synchronized (updateLock) {
final StreamInfo oldInfo = currentInfo.getAndSet(null);
final StreamInfo newInfo = new StreamInfo();
if (oldInfo != null && oldInfo.resumePosition > 0L) {
newInfo.setPosition(oldInfo.resumePosition);
}
// Only `f2` can modify `currentInfo`, so update it last.
currentInfo.set(newInfo);
// The `f2` thread might be waiting for us, so wake them up.
updateLock.notifyAll();
}
}
void f2(final long newPosition) {
while (true) {
final StreamInfo s = acquireStream();
s.setPosition(newPosition);
s.resumePosition = newPosition;
// Make sure the stream wasn't replaced while we worked.
// If it was, run again with the new stream.
if (acquireStream() == s) {
break;
}
}
}
private StreamInfo acquireStream() {
// Optimistic concurrency: hope we get a stream that's ready to go.
// If we fail, branch off into a slower code path that waits for it.
final StreamInfo s = currentInfo.get();
return s != null ? s : acquireStreamSlow();
}
private StreamInfo acquireStreamSlow() {
synchronized (updateLock) {
while (true) {
final StreamInfo s = currentInfo.get();
if (s != null) {
return s;
}
try {
updateLock.wait();
}
catch (final InterruptedException ignored) {
}
}
}
}
If the stream has faulted and is being replaced by f1, it is possible that an earlier call to f2 is still performing some operations on the (now defunct) stream. I'm assuming this is okay, and that it won't introduce undesirable side effects (beyond those already present in your lock-based version). I make this assumption because we've already established in the list above that your resume point may be stale, and we also established that f1 is only called once the stream is known to be in a bad state.
Based on my JMH benchmarks, this approach is around 3x faster than the CAS or synchronized versions (which are pretty close themselves).
Another approach is to use a timestamp lock which works like a modification count. This works well if you have a high read to write ratio.
Another approach is to have an immutable object which stores state via an AtomicReference. This works well if you have a very high read to write ratio.
Below is diagram that shows what I'm trying to do : it is just 2 programs. One is a simple Child program that writes out integers every 2 seconds, line-by-line .
The other is a Parent program that monitors the log file ( just a very basic text file). If the log file doesn't get modified within 5 seconds, then it should restart the Child program (via a batch file ); then continue normally.
My code for the child class is here:
package fileiotestapplication;
import java.io.*;
import java.io.IOException;
import java.util.*;
public class WriterClass {
#SuppressWarnings("oracle.jdeveloper.java.insufficient-catch-block")
public WriterClass() {
super();
int[] content = {1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15,};
String[] friends = {"bob",};
File file = new File("/C:/Java_Scratch/someFile.txt");
// if file does not exists, then create it
try {
if (!file.exists()) {
file.createNewFile();
}
for (int i = 0 ; i < content.length; i++)
{
PrintStream bw = new PrintStream( new FileOutputStream(file, true) );
System.out.println("testing " + i);
bw.println( String.valueOf(content[i]) );
bw.close();
Thread.sleep(2500);
}
System.out.println("Done");
} catch (IOException ioe) {
// TODO: Add catch code
ioe.printStackTrace();
}
catch (InterruptedException ioe) {
// TODO: Add catch code
ioe.printStackTrace();
}
//someIS.println(i);
System.out.println("This is OK");
}
public static void main(String[] args) {
WriterClass writerClass = new WriterClass();
}
}
The source code
And I linked here my current code for the Parent class.
What I'm now trying to do is add in some logic that catches when the child class stops writing output. What I'd like to do is count all the lines in the log file; and then compare them every 5 seconds, is this a good way (the alternative would be - to keep checking to see if the file got modified at all)?
EDIT: The suggestion below to use waitFor() indeed helps, though I'm still working out details : it is generally like :
try {
/* StackOverflow code */
for ( ; ; ) {
ProcessBuilder pb = new ProcessBuilder("TheBatchFile.bat");
pb.directory(new File("C://Java_Scratch_//Autonomic_Using_Batch//"));
Process p = pb.start();
p.waitFor();
}
/* end - StackOverflow code */
}
catch (IOException i) {
i.printStackTrace();
}
catch (InterruptedException i) {
i.printStackTrace();
}
This will get very slow as the file keeps growing in size. A simpler way would be to simply check the last modification time of the file. Assuming that the reason the child program might stop writing to the file is that the program terminates (rather than e.g. hanging in an infinite loop), it is probably better to directly monitor the child process itself rather than relying on observing the effects of the process. This is particularly convenient if the parent process can be responsible for starting the program in the first place.
This can be done with the ProcessBuilder and Process classes in Java 8. Copying from the documentation, you can start the process like this (if you only want to monitor whether it's running or not):
ProcessBuilder pb = new ProcessBuilder("TheBatchFile.bat", "Argument1", "Argument2");
pb.directory(new File("/path/to/working/dir"));
Process p = pb.start();
Then, you can simply call p.waitFor(); to wait for the process to terminate. Do this in a loop, and you have your automatic-restarting-of-child behavior.
You can use the directory watch service:
https://docs.oracle.com/javase/tutorial/essential/io/notification.html
You can configure a path or a file and register a watcher.
The watcher gets a notification every time a file is changed. You can store this timestamp of a notification for later use.
For details see my link above.
You may then use a Timer or a Thread to check last modification.
While your method of creating a text file, and using a batch script is feasible, there is a better way to approach it. This is a standard problem to approach with multitasking, and by creating a couple threads, it is not too difficult at all.
Using threads has several advantages over going externally "around" the system with batch files and multiple programs. For starters, these may include:
Keeping everything together makes the project much tidier, cleaner,
and marginally easier to distribute.
It is easier to implement. Sure threads may seem confusing if you have never used them, but they are the lesser evil in my opinion, then all the steps involved in going around them. As I hope to show below, implementing this problem with threads is not hard.
Improved performance, as the very expensive operations of file IO, and spawning the batch file are avoided. Threads also have improved performance over processes in most cases because they are easier to spawn, and multithreading sees performance improvements on a wider range of processors than multiprocessing by being less reliant on having several cores.
No sketchy overlap between when one program is reading the file, while the other is writing to it simultaneously. These kind of situations are best avoided when possible.
Maintains Java's impressive cross platform abilities, because you are not using batch which is not cross platform. This might not be important to you for this project, but you may come across something in the future with a similar problem, where this is more important, and so you will have practice implementing it.
You learn better by using threads the "right way" instead of
developing bad habits by using a more hacky approach. If this is a
learning project, you might as well learn it right.
I went ahead and coded up the approach that I would most likely use to solve the problem. My code has a child thread the counts every two seconds, and a parent thread that monitors the child, and restarts it if the child goes five seconds without counting. Let's examine my program to give you a good idea of how it is working.
First, here is the class for the parent:
public class Parent {
private Child child;
public Parent(){
child = new Child(this);
child.start();
}
public void report(int count){ //Starts a new watchdog timer
Watchdog restartTimer = new Watchdog(this, count);
restartTimer.start();
}
public void restartChild(int currentCount){
if (currentCount == child.getCount()){ //Check if the count has not changed
//If it hasn't
child.kill();
child.start();
}
}
public static void main(String[] args){
//Start up the parent function, it spawns the child
new Parent();
}
}
The main function in there can be put somewhere else if you want, but to start everything up, just instantiate a parent. The parent class has an instance of the child class, and it starts up the child thread. The child will report it's counting to the parent with the report method, which spawns a watchdog timer (more on that in a second) that will call restartChild after five seconds with the current count. RestartChild, restarts the child thread, if the count is still the same as the one provided.
Here is the class for the watchdog timer:
class Watchdog implements Runnable { //A timer that will run after five seconds
private Thread t;
private Parent parent;
private int initialCount;
public Watchdog(Parent parent, int count){ //make a timer with a count, and access to the parent
initialCount = count;
this.parent = parent;
}
public void run() { //Timers logic
try {
Thread.sleep(5000); // If you want to change the time requirement, modify it here
parent.restartChild(initialCount);
} catch (InterruptedException e) {
System.out.println("Error in watchdog thread");
}
}
public void start () // start the timer
{
if (t == null)
{
t = new Thread (this);
t.start ();
}
}
}
This watchdog timer is a thread that the parent will run with the start method. The parent sends itself as a parameter so that we can call the restartChild function of the parent.It stores the count, because when it runs after five seconds, restartChild will check if the count has changed.
And finally, here is the child class
public class Child implements Runnable{
private Thread t;
public int counter = 0;
private boolean running;
private Parent parent; // Record the parent function
public Child(Parent parent){
this.parent = parent;
}
private void initializeAll(){
counter = 0;
running = true;
}
public int getCount(){
return counter;
}
#Override
public void run() {
while((counter <= 100)&&(running)){
//The main logic for child
counter +=1;
System.out.println(counter);
parent.report(counter); // Report a new count every two seconds
try {
Thread.sleep(2000); // Wait two seconds
} catch (InterruptedException e) {
System.out.println("Thread Failed");
}
}
}
public void start(){ //Start the thread
initializeAll();
t = new Thread(this);
t.start();
}
public void kill(){ //Kill the thread
running = false;
}
}
This is also a thread, thus it implements runnable, and in that regard acts a lot like the watchdog. Run() is the main method of the child thread, this is where your logic goes that gets called when you start it. Starting the child with start() sets all the variables to their defaults, and then begins the run() logic. The logic in run is wrapped in if(running), because that lets us kill the thread internally by setting running to false.
Currently, all the child does right now is increment it's counter, output it to console, and then report the activity to the parent, 100 times, every two seconds. You will likely want to remove the condition stopping it after count passes 100, but I included it, so that the parent would eventual have cause to restart the child. To change the behavior, look at the child's run method, that is where all the main action is at.
I am using Netty to perform large file upload. It works fine but the RAM used by the client seems to increase with the size of the file. This is not the expected behaviour since everything is piped from the Reading the source file to writing the target file.
At first, I thought about a kind of adaptive buffer growing up until Xmx is reached but setting Xmx to a reasonable value (50M) would lead to an OutOfMemoryError soon after starting upload.
After some research using Eclipse Memory Analyzer, it appears that the object retaining the heap memory is:
org.jboss.netty.channel.socket.nio.NioSocketChannel$WriteRequestQueue
Is there any option for setting a limit to this queue or do I have to code my own queue using ChannelFutures to control the number of bytes and block the pipe when the limit is reached?
Thanks for your help,
Regards,
Renaud
Answer from #normanmaurer on Netty Github
You should use
Channel.isWritable()
to check if the "queue" is full. If so you will need to check if there is enough space to write more. So the effect you see can happen if you write data to quickly to get it send out to the clients.
You can get around this kind of problems when try to write a File via DefaultFileRegion or ChunkedFile.
#normanmaurer thank you I missed this method of the Channel!
I guess I need to read what's happening inside:
org.jboss.netty.handler.stream.ChunkedWriteHandler
UPDATED: 2012/08/30
This is the code I made to solve my problem:
public class LimitedChannelSpeaker{
Channel channel;
final Object lock = new Object();
long maxMemorySizeB;
long size = 0;
Map<ChannelBufferRef, Integer> buffer2readablebytes = new HashMap<ChannelBufferRef, Integer>();
public LimitedChannelSpeaker(Channel channel, long maxMemorySizeB) {
this.channel= channel;
this.maxMemorySizeB = maxMemorySizeB;
}
public ChannelFuture speak(ChannelBuffer buff) {
if (buff.readableBytes() > maxMemorySizeB) {
throw new IndexOutOfBoundsException("The buffer is larger than the maximum allowed size of " + maxMemorySizeB + "B.");
}
synchronized (lock) {
while (size + buff.readableBytes() > maxMemorySizeB) {
try {
lock.wait();
} catch (InterruptedException ex) {
throw new RuntimeException(ex);
}
}
ChannelBufferRef ref = new ChannelBufferRef(buff);
ref.register();
ChannelFuture future = channel.write(buff);
future.addListener(new ChannelBufferRef(buff));
return future;
}
}
private void spoken(ChannelBufferRef ref) {
synchronized (lock) {
ref.unregister();
lock.notifyAll();
}
}
private class ChannelBufferRef implements ChannelFutureListener {
int readableBytes;
public ChannelBufferRef(ChannelBuffer buff) {
readableBytes = buff.readableBytes();
}
public void unregister() {
buffer2readablebytes.remove(this);
size -= readableBytes;
}
public void register() {
buffer2readablebytes.put(this, readableBytes);
size += readableBytes;
}
#Override
public void operationComplete(ChannelFuture future) throws Exception {
spoken(this);
}
}
}
for a Desktop background application
Netty is designed for highly scalable servers e.g. around 10,000 connections. For a desktop application with less than a few hundred connections, I would use plain IO. You may find the code is much simpler and it should use less than 1 MB.
i have algorithm that will go through a large data set read some text files and search for specific terms in those lines. I have it implemented in Java, but I didnt want to post code so that it doesnt look i am searching for someone to implement it for me, but it is true i really need a lot of help!!! This was not planned for my project, but data set turned out to be huge, so teacher told me I have to do it like this.
EDIT(i did not clarified i previos version)The data set I have is on a Hadoop cluster, and I should make its MapReduce implementation
I was reading about MapReduce and thaught that i first do the standard implementation and then it will be more/less easier to do it with mapreduce. But didnt happen, since algorithm is quite stupid and nothing special, and map reduce...i cant wrap my mind around it.
So here is shortly pseudo code of my algorithm
LIST termList (there is method that creates this list from lucene index)
FOLDER topFolder
INPUT topFolder
IF it is folder and not empty
list files (there are 30 sub folders inside)
FOR EACH sub folder
GET file "CheckedFile.txt"
analyze(CheckedFile)
ENDFOR
END IF
Method ANALYZE(CheckedFile)
read CheckedFile
WHILE CheckedFile has next line
GET line
FOR(loops through termList)
GET third word from line
IF third word = term from list
append whole line to string buffer
ENDIF
ENDFOR
END WHILE
OUTPUT string buffer to file
Also, as you can see, each time when "analyze" is called, new file has to be created, i understood that map reduce is difficult to write to many outputs???
I understand mapreduce intuition, and my example seems perfectly suited for mapreduce, but when it comes to do this, obviously I do not know enough and i am STUCK!
Please please help.
You can just use an empty reducer, and partition your job to run a single mapper per file. Each mapper will create its own output file in your output folder.
Map Reduce is easily implemented using some nice Java 6 concurrency features, especially Future, Callable and ExecutorService.
I created a Callable that will analyse a file in the way you specified
public class FileAnalyser implements Callable<String> {
private Scanner scanner;
private List<String> termList;
public FileAnalyser(String filename, List<String> termList) throws FileNotFoundException {
this.termList = termList;
scanner = new Scanner(new File(filename));
}
#Override
public String call() throws Exception {
StringBuilder buffer = new StringBuilder();
while (scanner.hasNextLine()) {
String line = scanner.nextLine();
String[] tokens = line.split(" ");
if ((tokens.length >= 3) && (inTermList(tokens[2])))
buffer.append(line);
}
return buffer.toString();
}
private boolean inTermList(String term) {
return termList.contains(term);
}
}
We need to create a new callable for each file found and submit this to the executor service. The result of the submission is a Future which we can use later to obtain the result of the file parse.
public class Analayser {
private static final int THREAD_COUNT = 10;
public static void main(String[] args) {
//All callables will be submitted to this executor service
//Play around with THREAD_COUNT for optimum performance
ExecutorService executor = Executors.newFixedThreadPool(THREAD_COUNT);
//Store all futures in this list so we can refer to them easily
List<Future<String>> futureList = new ArrayList<Future<String>>();
//Some random term list, I don't know what you're using.
List<String> termList = new ArrayList<String>();
termList.add("terma");
termList.add("termb");
//For each file you find, create a new FileAnalyser callable and submit
//this to the executor service. Add the future to the list
//so we can check back on the result later
for each filename in all files {
try {
Callable<String> worker = new FileAnalyser(filename, termList);
Future<String> future = executor.submit(worker);
futureList.add(future);
}
catch (FileNotFoundException fnfe) {
//If the file doesn't exist at this point we can probably ignore,
//but I'll leave that for you to decide.
System.err.println("Unable to create future for " + filename);
fnfe.printStackTrace(System.err);
}
}
//You may want to wait at this point, until all threads have finished
//You could maybe loop through each future until allDone() holds true
//for each of them.
//Loop over all finished futures and do something with the result
//from each
for (Future<String> current : futureList) {
String result = current.get();
//Do something with the result from this future
}
}
}
My example here is far from complete, and far from efficient. I haven't considered the sample size, if it's really huge you could keep looping over the futureList, removing elements that have finished, something similar to:
while (futureList.size() > 0) {
for (Future<String> current : futureList) {
if (current.isDone()) {
String result = current.get();
//Do something with result
futureList.remove(current);
break; //We have modified the list during iteration, best break out of for-loop
}
}
}
Alternatively you could implement a producer-consumer type setup where the producer submits callables to the executor service and produces a future and the consumer takes the result of the future and discards then future.
This would maybe require the produce and consumer be threads themselves, and a synchronized list for adding/removing futures.
Any questions please ask.