Need help implementing this algorithm with map Hadoop MapReduce

Need help implementing this algorithm with map Hadoop MapReduce - java

i have algorithm that will go through a large data set read some text files and search for specific terms in those lines. I have it implemented in Java, but I didnt want to post code so that it doesnt look i am searching for someone to implement it for me, but it is true i really need a lot of help!!! This was not planned for my project, but data set turned out to be huge, so teacher told me I have to do it like this.
EDIT(i did not clarified i previos version)The data set I have is on a Hadoop cluster, and I should make its MapReduce implementation
I was reading about MapReduce and thaught that i first do the standard implementation and then it will be more/less easier to do it with mapreduce. But didnt happen, since algorithm is quite stupid and nothing special, and map reduce...i cant wrap my mind around it.
So here is shortly pseudo code of my algorithm
LIST termList (there is method that creates this list from lucene index)
FOLDER topFolder
INPUT topFolder
IF it is folder and not empty
list files (there are 30 sub folders inside)
FOR EACH sub folder
GET file "CheckedFile.txt"
analyze(CheckedFile)
ENDFOR
END IF
Method ANALYZE(CheckedFile)
read CheckedFile
WHILE CheckedFile has next line
GET line
FOR(loops through termList)
GET third word from line
IF third word = term from list
append whole line to string buffer
ENDIF
ENDFOR
END WHILE
OUTPUT string buffer to file
Also, as you can see, each time when "analyze" is called, new file has to be created, i understood that map reduce is difficult to write to many outputs???
I understand mapreduce intuition, and my example seems perfectly suited for mapreduce, but when it comes to do this, obviously I do not know enough and i am STUCK!
Please please help.

You can just use an empty reducer, and partition your job to run a single mapper per file. Each mapper will create its own output file in your output folder.

Map Reduce is easily implemented using some nice Java 6 concurrency features, especially Future, Callable and ExecutorService.
I created a Callable that will analyse a file in the way you specified
public class FileAnalyser implements Callable<String> {
private Scanner scanner;
private List<String> termList;
public FileAnalyser(String filename, List<String> termList) throws FileNotFoundException {
this.termList = termList;
scanner = new Scanner(new File(filename));
}
#Override
public String call() throws Exception {
StringBuilder buffer = new StringBuilder();
while (scanner.hasNextLine()) {
String line = scanner.nextLine();
String[] tokens = line.split(" ");
if ((tokens.length >= 3) && (inTermList(tokens[2])))
buffer.append(line);
}
return buffer.toString();
}
private boolean inTermList(String term) {
return termList.contains(term);
}
}
We need to create a new callable for each file found and submit this to the executor service. The result of the submission is a Future which we can use later to obtain the result of the file parse.
public class Analayser {
private static final int THREAD_COUNT = 10;
public static void main(String[] args) {
//All callables will be submitted to this executor service
//Play around with THREAD_COUNT for optimum performance
ExecutorService executor = Executors.newFixedThreadPool(THREAD_COUNT);
//Store all futures in this list so we can refer to them easily
List<Future<String>> futureList = new ArrayList<Future<String>>();
//Some random term list, I don't know what you're using.
List<String> termList = new ArrayList<String>();
termList.add("terma");
termList.add("termb");
//For each file you find, create a new FileAnalyser callable and submit
//this to the executor service. Add the future to the list
//so we can check back on the result later
for each filename in all files {
try {
Callable<String> worker = new FileAnalyser(filename, termList);
Future<String> future = executor.submit(worker);
futureList.add(future);
}
catch (FileNotFoundException fnfe) {
//If the file doesn't exist at this point we can probably ignore,
//but I'll leave that for you to decide.
System.err.println("Unable to create future for " + filename);
fnfe.printStackTrace(System.err);
}
}
//You may want to wait at this point, until all threads have finished
//You could maybe loop through each future until allDone() holds true
//for each of them.
//Loop over all finished futures and do something with the result
//from each
for (Future<String> current : futureList) {
String result = current.get();
//Do something with the result from this future
}
}
}
My example here is far from complete, and far from efficient. I haven't considered the sample size, if it's really huge you could keep looping over the futureList, removing elements that have finished, something similar to:
while (futureList.size() > 0) {
for (Future<String> current : futureList) {
if (current.isDone()) {
String result = current.get();
//Do something with result
futureList.remove(current);
break; //We have modified the list during iteration, best break out of for-loop
}
}
}
Alternatively you could implement a producer-consumer type setup where the producer submits callables to the executor service and produces a future and the consumer takes the result of the future and discards then future.
This would maybe require the produce and consumer be threads themselves, and a synchronized list for adding/removing futures.
Any questions please ask.

Related

Synchronizing searches and modifications

What's a good way of allowing searches from multiple threads on a list (or other data structure), but preventing searches on the list and edits to the list on different threads from interleaving? I tried using synchronized blocks in the searching and editing methods, but that can cause unnecessary blocking when trying to run searches in multiple threads.
EDIT: The ReadWriteLock is exactly what I was looking for! Thanks.

Usually, yes ReadWriteLock is good enough.
But, if you're using Java 8 you can get a performance boost with the new StampedLock that lets you avoid the read lock. This applies when you have much more frequent reads(searches) compared with writes(edits).
private StampedLock sl = new StampedLock();
public void edit() { // write method
long stamp = sl.writeLock();
try {
doEdit();
} finally {
sl.unlockWrite(stamp);
}
}
public Object search() { // read method
long stamp = sl.tryOptimisticRead();
Object result = doSearch(); //first try without lock, search ideally should be fast
if (!sl.validate(stamp)) { //if something has modified
stamp = sl.readLock(); //acquire read lock and search again
try {
result = doSearch();
} finally {
sl.unlockRead(stamp);
}
}
return result;
}

For loop in SwingWorker

just want to make clear an understanding on using for loops inside a SwingWorker doInbackground method.
For example, I have a list of files stored in Files ( File[] Files = ... ).
scanFiles = new SwingWorker<Object, Object>(){
public Object doInBackground(){
for( File f : Files ){
// process file f
}
}
}
....
scanFiles.execute();
In the above, is it alright to use a for loop inside the doInBackGround() method to go through a list of files , or is it better to bring the for loop outside the doInBackground() method, as in something like this:
for ( File f: Files ){
processFile(f);
}
private void processFile(File f){
scanFiles = new SwingWorker<Object, Object>(){
public Object doInBackground(){
// do something with f
}
}
}
The above are skeleton code and not actual working code. Just for illustration of what I want to do only. That is, I don't want my program to scan files one by one. I want to do something like parallel processing of files...
thanks

As mentioned in some of the comments: The appropriate solution heavily depends on how many files you want to process, and what processFile actually does.
The main difference between your approaches is (as MadProgrammer already said)
The first one creates one background thread that processes all the files
The second one creates many background threads, each processing one file
The border cases where either of the approaches is not appropriate are analogously:
The first one may be better when there many files, and processFile is a simple operation
The second one may be better when there are few files and processFile is a complex operation
But this is only a rough classification, and which one is the "best" approach still depends on other factors.
However, I'd like to propose another solution, that allows you to rather flexibly "shift" between the two extremes: You could create a List containing the File objects, and split this list into a specified number of "chunks" to let them be processed by the SwingWorker.
Sketched here, to show the basic idea: You create a method that processes a list of files with a SwingWorker:
private void processFiles(final List<File> files) {
SwingWorker<Object, Object> scanFiles = new SwingWorker<Object, Object>(){
#Override
public Object doInBackground(){
// do something with files
}
}
}
Then, at the call site, you can do the following:
// Obtain the list of files to process
File files[] = ...
List<File> fileList = Arrays.asList(files);
// Define the number of workers that should be used
int numWorkers = 10;
// Compute how many files each worker will process
int chunkSize = (int)Math.ceil((double)fileList.size() / numWorkers);
for (int i=0; i<numWorkers; i++) {
// Compute the part of the "fileList" that the worker will process
int minIndex = i * chunkSize;
int maxIndex = i * chunkSize + chunkSize;
maxIndex = Math.min(maxIndex, fileList.size());
List<File> chunk = fileList.sublist(minIndex, maxIndex);
// Start the worker
processFiles(chunk);
}
(This is only a sketch. There may be some index-hassle involved. If desired, I can post a more elaborate version of this. Until now, it only shows the basic idea)
Then, you can define how many worker threads you would like to use (maybe even depending on the number of Runtime.getRuntime().availableProcessors()).

If you want to process files parallely you must spawn some thread workers so the second sample should be your choice. You can inform the UI, or other components of your program, about the progress of processing files using following methods : protected void process(List<V> chunks), protected final void publish(V... chunks)
private void processFile(File f){
scanFiles = new SwingWorker<Object, Object>(){
public Object doInBackground(){
publish(V... chunks)
}
}
}
protected void process(List<V> chunks) {
//do something with intermediate data, for example show progress in the ui
}

How to use Multithreading to effectively

I want to do a task that I've already completed except this time using multithreading. I have to read a lot of data from a file (line by line), grab some information from each line, and then add it to a Map. The file is over a million lines long so I thought it may benefit from multithreading.
I'm not sure about my approach here since I have never used multithreading in Java before.
I want to have the main method do the reading, and then giving the line that has been read to another thread which will format a String, and then give it to another thread to put into a map.
public static void main(String[] args)
{
//Some information read from file
BufferedReader br = null;
String line = '';
try {
br = new BufferedReader(new FileReader("somefile.txt"));
while((line = br.readLine()) != null) {
// Pass line to another task
}
// Here I want to get a total from B, but I'm not sure how to go about doing that
}
public class Parser extends Thread
{
private Mapper m1;
// Some reference to B
public Parse (Mapper m) {
m1 = m;
}
public parse (String s, int i) {
// Do some work on S
key = DoSomethingWithString(s);
m1.add(key, i);
}
}
public class Mapper extends Thread
{
private SortedMap<String, Integer> sm;
private String key;
private int value;
boolean hasNewItem;
public Mapper() {
sm = new TreeMap<String, Integer>;
hasNewItem = false;
}
public void add(String s, int i) {
hasNewItem = true;
key = s;
value = i;
}
public void run() {
while (!Thread.currentThread().isInterrupted()) {
try {
if (hasNewItem) {
// Find if street name exists in map
sm.put(key, value);
newEntry = false;
}
} catch (InterruptedException e) {
Thread.currentThread().interrupt();
}
}
// I'm not sure how to give the Map back to main.
}
}
I'm not sure if I am taking the right approach. I also do not know how to terminate the Mapper thread and retrieve the map in the main. I will have multiple Mapper threads but I have only instantiated one in the code above.
I also just realized that my Parse class is not a thread, but only another class if it does not override the run() method so I am thinking that the Parse class should be some sort of queue.
And ideas? Thanks.
EDIT:
Thanks for all of the replies. It seems that since I/O will be the major bottleneck there would be little efficiency benefit from parallelizing this. However, for demonstration purpose, am I going on the right track? I'm still a bit bothered by not knowing how to use multithreading.

Why do you need multiple threads? You only have one disk and it can only go so fast. Multithreading it won't help in this case, almost certainly. And if it does, it will be very minimal from a user's perspective. Multithreading isn't your problem. Reading from a huge file is your bottle neck.

Frequently I/O will take much longer than the in-memory tasks. We refer to such work as I/O-bound. Parallelism may have a marginal improvement at best, and can actually make things worse.
You certainly don't need a different thread to put something into a map. Unless your parsing is unusually expensive, you don't need a different thread for it either.
If you had other threads for these tasks, they might spend most of their time sitting around waiting for the next line to be read.
Even parallelizing the I/O won't necessarily help, and may hurt. Even if your CPUs support parallel threads, your hard drive might not support parallel reads.
EDIT:
All of us who commented on this assumed the task was probably I/O-bound -- because that's frequently true. However, from the comments below, this case turned out to be an exception. A better answer would have included the fourth comment below:
Measure the time it takes to read all the lines in the file without processing them. Compare to the time it takes to both read and process them. That will give you a loose upper bound on how much time you could save. This may be decreased by a new cost for thread synchronization.

You may wish to read Amdahl's Law. Since the majority of your work is strictly serial (the IO) you will get negligible improvements by multi-threading the remainder. Certainly not worth the cost of creating watertight multi-threaded code.
Perhaps you should look for a new toy-example to parallelise.

Java concurrency pattern to parallel parts of task

I read lines from file, in one thread of course. Lines was sorted by key.
Then I collect lines with same key (15-20 lines), make parsing, big calculation, etc, and push resulting object to statistic class.
I want to paralell my programm to read in one thread, make parsing and calc in many threads, and join results in one thread to write to stat class.
Is any ready pattern or solution in java7 framework for this problem?
I realize it with executor for multithreading, pushing to blockingQueue, and reading queue in another thread, but i think my code sucks and will produce bugs
Many thanks
upd:
I can't map all file in memory - it's very big

You already have the main classes of approaches in mind. CountDownLatch, Thread.join, Executors, Fork/Join. Another option is the Akka framework, which has message passing overheads measured in 1-2 microseconds and is open source. However let me share another approach that often out performs the above approaches and is simpler, this approach is born from working on batch file loads in Java for a number of companies.
Assuming that your goal of splitting the work up is performance, rather than learning. Performance as measured by how long it takes from start to finish. Then it is often difficult to make it faster than memory mapping the file, and processing in a single thread that has been pinned to a single core. It is also gives much simpler code too. A double win.
This may be counter intuitive, however the speed of processing files is nearly always limited by how efficient the file loading is. Not how parallel the processing is. Hence memory mapping the file is a huge win. Once memory mapped we want the algorithm to have low contention with the hardware as it performs the file load. Modern hardware tend to have the IO controller and the memory controller on the same socket as the CPU; which when combined with the prefetchers within the CPU itself lead to a hell of a lot of efficiency when processing the file in a orderly fashion from a single thread. This can be so extreme that going parallel may actually be a lot slower. Pinning a thread to a core usually speeds up memory bound algorithms by a factor of 5. Which is why the memory mapping part is so important.
If you have not already, give it a try.

Without facts and numbers it is hard to give you advices. So let's start from the beginning:
You must identify the bottleneck. Do you really need to perform the computation in parallel or is your job IO bound ? Avoid concurrency if possible, it could be faster.
If computations must be done in parallel you must decide how fine or coarse grained your tasks must be. You need to measure your computations and tasks to be able to size them. Avoid to create too many tasks
You should have a IO thread, several workers, and a "data gatherer" thread. No mutable data.
Be sure to not slow down the IO thread because of task submission. Otherwise you should use more coarse grained tasks or use a better task dispatcher (who said disruptor ?)
The "Data gatherer" thread should be the only one to mutate the final state
Avoid unnecessary data copy and object creation. Quite often, when iterating on large files the bottleneck is the GC. Last week, I achieved a 6x speedup replacing a standard scala object by a flyweight pattern. You should also try to pre-allocate everything and use large buffers (page sized).
Avoid disk seeks.
Having that said you should be one the right track. You can start with an Executor using properly sized tasks. Tasks write into a data structure, like your blocking queue, shared between workers and the "data gatherer" thread. This threading model is really simple, efficient and hard to get wrong. It is usually efficient enough. If you still require better performances then you must profile your application and understand the bottleneck. Then you can decide the way to go: refine your task size, use faster tools like the disruptor/Akka, improve IO, create fewer objects, tune your code, buy a bigger machine or faster disks, move to Hadoop etc. Pinning each thread to a core (require platform specific code) could also provide a significant boost.

You can use CountDownLatch
http://docs.oracle.com/javase/6/docs/api/java/util/concurrent/CountDownLatch.html
to synchronize the starting and joining of threads. This is better than looping on the set of threads and calling join() on each thread reference.

Here is what I would do if asked to split work as you are trying to:
public class App {
public static class Statistics {
}
public static class StatisticsCalculator implements Callable<Statistics> {
private final List<String> lines;
public StatisticsCalculator(List<String> lines) {
this.lines = lines;
}
#Override
public Statistics call() throws Exception {
//do stuff with lines
return new Statistics();
}
}
public static void main(String[] args) {
final File file = new File("path/to/my/file");
final List<List<String>> partitionedWork = partitionWork(readLines(file), 10);
final List<Callable<Statistics>> callables = new LinkedList<>();
for (final List<String> work : partitionedWork) {
callables.add(new StatisticsCalculator(work));
}
final ExecutorService executorService = Executors.newFixedThreadPool(Math.min(partitionedWork.size(), 10));
final List<Future<Statistics>> futures;
try {
futures = executorService.invokeAll(callables);
} catch (InterruptedException ex) {
throw new RuntimeException(ex);
}
try {
for (final Future<Statistics> future : futures) {
final Statistics statistics = future.get();
//do whatever to aggregate the individual
}
} catch (InterruptedException | ExecutionException ex) {
throw new RuntimeException(ex);
}
executorService.shutdown();
try {
executorService.awaitTermination(1, TimeUnit.DAYS);
} catch (InterruptedException ex) {
throw new RuntimeException(ex);
}
}
static List<String> readLines(final File file) {
//read lines
return new ArrayList<>();
}
static List<List<String>> partitionWork(final List<String> lines, final int blockSize) {
//divide up the incoming list into a number of chunks
final List<List<String>> partitionedWork = new LinkedList<>();
for (int i = lines.size(); i > 0; i -= blockSize) {
int start = i > blockSize ? i - blockSize : 0;
partitionedWork.add(lines.subList(start, i));
}
return partitionedWork;
}
}
I have create a Statistics object, this holds the result of the work done.
There is a StatisticsCalculator object which is a Callable<Statistics> - this does the calculation. It is given a List<String> and it processes the lines and creates the Statistics.
The readLines method I leave to you to implement.
The most important method in many ways is the partitionWork method, this divides the incoming List<String> which is all the lines in the file into a List<List<String>> using the blockSize. This essentially decides how much work each thread should have, tuning of the blockSize parameter is very important. As if each work is only one line then the overheads would probably outweight the advantages whereas if each work of ten thousand lines then you only have one working Thread.
Finally the meat of the opertation is the main method. This calls the read and then partition methods. It spawns an ExecutorService with a number of threads equal to the number of bits of work but up to a maximum of 10. You may way to make this equal to the number of cores you have.
The main method then submits a List of all the Callables, one for each chunk, to the executorService. The invokeAll method blocks until the work is done.
The method now loops over each returned List<Future> and gets the generated Statistics object for each; ready for aggregation.
Afterwards don't forget to shutdown the executorService as it will prevent your application form exiting.
EDIT
OP wants to read line by line so here is a revised main
public static void main(String[] args) throws IOException {
final File file = new File("path/to/my/file");
final ExecutorService executorService = Executors.newFixedThreadPool(10);
final List<Future<Statistics>> futures = new LinkedList<>();
try (final BufferedReader reader = new BufferedReader(new FileReader(file))) {
List<String> tmp = new LinkedList<>();
String line = null;
while ((line = reader.readLine()) != null) {
tmp.add(line);
if (tmp.size() == 100) {
futures.add(executorService.submit(new StatisticsCalculator(tmp)));
tmp = new LinkedList<>();
}
}
if (!tmp.isEmpty()) {
futures.add(executorService.submit(new StatisticsCalculator(tmp)));
}
}
try {
for (final Future<Statistics> future : futures) {
final Statistics statistics = future.get();
//do whatever to aggregate the individual
}
} catch (InterruptedException | ExecutionException ex) {
throw new RuntimeException(ex);
}
executorService.shutdown();
try {
executorService.awaitTermination(1, TimeUnit.DAYS);
} catch (InterruptedException ex) {
throw new RuntimeException(ex);
}
}
This streams the file line by line and, after a given number of lines fires a new task to process the lines to the executor.
You would need to call clear on the List<String> in the Callable when you are done with it as the Callable instances are references by the Futures they return. If you clear the Lists when you're done with them that should reduce the memory footprint considerably.
A further enhancement may well be to use the suggestion here for a ExecutorService that blocks until there is a spare thread - this will guranatee that there are never more than threads*blocksize lines in memory at a time if you clear the Lists when the Callables are done with them.

Multithreading -- matching instances

I want to run two XPath-Expressions concurrently on two revisions of a database which both return results from an Iterator/Iterable and match resulting nodes with nodes in a List.
I think the best thing is to run both queries in two threads from an executorservice and save results from both threads in a BlockingQueue, whereas another Thread is going to sort the results from the BlockingQueue or actually saves the incoming nodes or nodeKeys in the right position.
Then it's trivial to get the intersection of the resulting sorted List and another sorted List.
Any other suggestions? I'm also free to use whatever technology I like (preferably Java). Guava is in the classpath, but I already thought about using Actors from Akka.
Edit: An additional related question would be if it's faster to use InsertionSort in a pipeline manner (to process the generated XPath results right when they are received) or to wait until the whole result has been generated and use QuickSort or MergeSort. I think InsertionSort should be preferable regardless of the resulting number of elements.
In general I hope sorting and then computing the intersection of two lists is faster than O(n^2) for the search of each item in the XPath result list, even if the list is divided by the number of CPU processors available.
Edit:
I've currently implemented the first part:
final ExecutorService executor = Executors.newFixedThreadPool(2);
final AbsTemporalAxis axis =
new NextRevisionAxis.Builder(mSession).setRevision(mRevision)
.setIncludeSelf(EIncludeSelf.YES).build();
for (final IReadTransaction rtx : axis) {
final ListenableFuture<Void> future =
Futures.makeListenable(executor.submit(new XPathEvaluation(rtx, mQuery)));
future.addListener(new Runnable() {
#Override
public void run() {
try {
mSemaphore.acquire();
} catch (final InterruptedException e) {
LOGWRAPPER.error(e.getMessage(), e);
}
}
}, executor);
}
executor.shutdown();
final ExecutorService sameThreadExecutor = MoreExecutors.sameThreadExecutor();
sameThreadExecutor.submit(new XPathResult());
sameThreadExecutor.shutdown();
return null;
The semaphore is initialized to 2 and in XPathEvaluation the resulting nodeKeys are added to a LinkedBlockingQueue.
Then I'm going to sort the XPathResults denoted with the comment, which isn't implemented yet:
private final class XPathResult implements Callable<Void> {
#Override
public Void call() throws AbsTTException, InterruptedException {
while (true) {
final long key = mQueue.take();
if (key == -1L) {
break;
}
if (mSemaphore.availablePermits() == 0) {
mQueue.put(-1L);
}
// Do InsertionSort.
}
return null;
}
}
Without any JavaDoc, but I think at least it should work, what do you think? Do you have any preferable solutions or do I have made some mistakes so far?
kind regards,
Johannes

Are you sure you need to do this concurrently? Can't you just build the two lists consecutively and after that perform your sorting/intersecting? - That would take a lot of complexity from the subject.
I assume that intersecting cannot be done until both lists are filled completely, am I correct? Then, no queue or synchronization would be needed, just fill two lists/sets and, once done, process both full lists.
But maybe I'm not quite getting your point...

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.