Have a long running set of discrete tasks: parsing 10s of thousands of lines from a text file, hydrating into objects, manipulating, and persisting.
If I were implementing this in Java, I suppose I might add a new task to an Executor for each line in the file or task per X lines (i.e. chunks).
For .Net, which is what I am using, I'm not so sure. I have a suspicion maybe CCR might be appropriate here, but I'm not familiar enough with it, which is why I pose this question.
Can CCR function in an equivalent fashion to Java Executors, or is there something else available?
Thanks
You may want to look at the Task Parallel Library.
As of C# 5 this is built into the language using the async and await keywords.
If you're going to ask a bunch of .NET people what's closest to being equivalent to Java Excecutors, it might not hurt to describe the distinguishing features of Java Executors. The person who knows your answer may not be any more familiar with Java than you are with .NET.
That said, if the already-mentioned Task Parallel Library is overkill for your needs, or you don't want to wait for .NET 4.0, perhaps ThreadPool.QueueUserWorkItem() would be what you're looking for.
Maybe this is related: Design: Task Parallel Library explored.
See 10-4 Episode 6: Parallel Extensions as a quick intro.
For older thread-based approach, there's ThreadPool for pooling.
The BackgroundWorker class is probably what you're looking for. As the name implies, it allows you to run background tasks, with automatically managed pooling, and status update events.
For anyone looking for a more contemporary solution (as I was), check out the EventLoopScheduler class.
Related
As far as I know Stream API is intended to be applied on collections. But I like the idea of them so much that I try to apply them when I can and when I shouldn't.
Originally my app had two threads communicating through BlockingQueue. First would populate new elements. Second make transformations on them and save on disk. Looked like a perfect stream oportunity for me at a time.
Code I ended up with:
Stream.generate().flatten().filter().forEach()
I'd like to put few maps in there but turns out I have to drag one additional field till forEach. So I either have to create meaningless class with two fields and obscure name or use AbstractMap.SimpleEntry to carry both fields through, which doesn't look like a great deal to me.
Anyway I'd rewritten my app and it even seems to work. However there are some caveats. As I have infinite stream 'the thing' can't be stopped. For now I'm starting it on daemon thread but this is not a solution. Business logic (like on connection loss/finding, this is probably not BL) looks alienated. Maybe I just need proxy for this.
On the other hand there is free laziness with queue population. One thread instead of two (not sure how good is this). Hopefully familiar pattern for other developers.
So my question is how viable is using of Stream API for application flow organising? Is there more underwather roks? If it's not recomended what are alternatives?
I don't understand why not to use TypedActors in Akka. Using reflection (well.. instanceof) to compensate for the lack of pattern matching in Java is quite ugly.
As far as I understand, TypedActors should be like a gate between the "Akka world" and the "Non Akka world" of your software. But why won't we just throw all OO principals and just use reflection!
Why wouldn't you want to use an actor and know exactly what it should respond to? Or for Akka's sake of keeping the actor model, why not create a message hierarchy that uses double-dispatch in order to activate the right method in the actor (and I know you shouldn't pass Actors as parameters and use ActorRef instead).
DISCLAIMER: I'm new to Akka and this model, and I haven't wrote a single line of code using Akka, but just reading the documentation is giving me a headache.
Before we get started: The question is about the deprecated "typed actors" module. Which will soon be replaced with akka-typed, a far superior take on the problem, which avoids the below explained shortcomings - please do have a look at akka-typed if you're interested in typed actors!
I'll enumerate a number of downsides of using the typed actors implementation you refer to. Please do note however that we have just merged a new akka-typed module, which brings in type safety back to the world of akka actors. For the sake of this post, I will not go in depth into the reasons developing the typed version was such a tough challenge, let's for now answer the question of "why not use the (old) typed actors".
Firstly, they were never designed to be the core of the toolkit. They are built on top of the messaging infrastructure Akka provides. Please note that thanks to that messaging infrastructure we're able to achieve location transparency, and Akka's well known performance. They heavily use reflection and JDK proxies to translate to and from methods to message sends. This is very expensive (time wise), and downgrades the performance around 10-fold in contrast to plain Akka Actors, see below for a "ping pong" benchmark (implemented using both styles, sender tells to actor, actor replies - 100.000 times):
Unit = ops/ms
Benchmark Mode Samples Mean Mean error Units
TellPingPongBenchmark.tell_100000_msgs thrpt 20 119973619.810 79577253.299 ops/ms
JdkProxyTypedActorTellPingPongBenchmark.tell_100000_msgs thrpt 20 16697718.988 406179.847 ops/ms
Unit = us/op
Benchmark Mode Samples Mean Mean error Units
TellPingPongBenchmark.tell_100000_msgs sample 133647 1.223 0.916 us/op
JdkProxyTypedActorTellPingPongBenchmark.tell_100000_msgs sample 222869 12.416 0.045 us/op
(Benchmarks are kept in akka/akka-bench-jmh and run using the OpenJDK JMH tool, via the sbt-jmh plugin.)
Secondly, using methods to abstract over distributed systems is just not a good way of going about it (oh, how I remember RMI... let's not go there again). Using such "looks like a method" makes you stop thinking about message loss, reordering and all the things which can and do happen in distributed systems. It also encourages (makes it "too easy to do the wrong thing") using signatures like def getThing(id: Int): Thing - which would generate blocking code - which is horrible for performance! You really do want to stay asynchronous and responsive, which is why you'd end up with loads of futures when trying to work properly with these (proxy based) typed actors.
Lastly, you basically lose one of the main Actor capabilities. The 3 canonical operations an Actor can perform are 1) send messages 2) start child actors 3) change it's own behaviour based on received messages (see Carl Hewitt's original paper on the Actor Model). The 3rd capability is used to beautifully model state machines. For example you can say (in plain akka actors) become(active) and then become(allowOnlyPrivileged), to switch between receive implementations - making finite state machine implementations (we also have a DSL for FSMs) a joy to work with. You can not express this nicely in JDK proxied typed actors, because you can not change the set of exposed methods. This is a major down side once you get into the thinking and modeling using state machines.
A New Hope (Episode 1): Please do have a look at the upcoming akka-typed module authored by Roland Kuhn (preview to be included in the 2.4 release soon), I'm pretty sure you'll like what you'll find there typesafety wise. And also, that implementation will eventually be even faster than the current untyped actors (omitting impl details here as the answer got pretty long already - short version: basically we'll remove a load of allocations thanks to the new implementation).
I hope you'll enjoy this thorough answer. Feel free to ask follow up questions in comments here or on akka-user - our official mailing list. Happy Hakking!
Typed Actors provide you with a static contract defined in the terms of your domain-- you can name their messages (which will be delegated to an underlying implementation and executed asynchronously) actions which make sense in your domain, avoiding the use of reflection on your part (TypedActors use JDK Proxies under the hood, so there is still reflection going on, you just don't have to worry about it, and you gain type-checking in terms of the arguments passed to the active object/typed actor and its return types. The documention is pretty clear on this, but I know for those new to actor-based concurrency, additional examples always help, so feel free to ask additional questions/comments if you are still having troubling groking the difference.
But do you guys realice that you have a huge number of companies where they don’t have the expertise developers, but a big Infra to scale horizontally as much as we need, so performance not always is the best “go for it” but instead be responsive, Message driven, elastic and resilient, which right now thanks to typed actors we have, being used by developers that don’t know anything about Akka or Reactive
Programing.
Don’t get me wrong, I’m use pure Akka typed in my day by day, but for delivery teams we have this framework that use typed actors and our consumers use as POJO without know that they are coding in a reactive system. And that’s awesome feature.
As I read here http://mechanitis.blogspot.fr/2011/06/dissecting-disruptor-how-do-i-read-from.html
"for every individual item, the Consumer simply says "Let me know when you've got more than this number", and is told in return how many more entries it can grab."
Doesn't this relates to Rx Framework concept as exposed by Erik Meijer
http://www.youtube.com/watch?v=8Mttjyf-8P4 ?
If yes could Rx Framework be helpfull to implement similar piece of software ?
Nice question, I've been wondering about this myself, for one of my current projects.
I don't feel greatly qualified to give a definitive answer, however:
They are designed to scratch different itches.
Disruptor is clearly designed for performance first, as close to the metal as possible. It doesn't do anything fancy apart from what it does.
Rx is higher level, it is 'Linq to events', it allows you to do nice things with 'events' that you couldn't with normal framework events (you can't filter a standard event and then continue propagating it as an event).
Semantic differences
As the originator of Disruptor.Net pointed out here:
The interface matches but I think the semantic behind RX does not:
an exception (OnError) terminates the stream, this in not the case with the disruptor
you can not subscribe to the disruptor while it's hot: observers would have to be setup before "starting" the disruptor, this does not
work very well with operators like retry for instance which will re-
subscribe in case of error
lots of operators do not make sense with the disruptor or would just not work
Having said that, he was (at least at one time) thinking about integration between Disruptor.Net, TPL Dataflow and Rx.
Here is another page where someone asks the same question, the page concludes with:
Disruptor is in fact more like TPL DataFlow in my opinion.
Without know the Rx framework, you could be right. However Disruptor.Net is designed to be a port of the Java version so it will be as similar as possible. Given the original doesn't use Rx, it would add lots of rework and possibly performance issues to use a different library.
Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 8 years ago.
Improve this question
Can anyone point me at a simple, open-source Map/Reduce framework/API for Java? There doesn't seem to much evidence of such a thing existing, but someone else might know different.
The best I can find is, of course, Hadoop MapReduce, but that fails the "simple" criteria. I don't need the ability to run distributed jobs, just something to let me run map/reduce-style jobs on a multi-core machine, in a single JVM, using standard Java5-style concurrency.
It's not a hard thing to write oneself, but I'd rather not have to.
Have you check out Akka? While akka is really a distributed Actor model based concurrency framework, you can implement a lot of things simply with little code. It's just so easy to divide work into pieces with it, and it automatically takes full advantage of a multi-core machine, as well as being able to use multiple machines to process work. Unlike using threads, it feels more natural to me.
I have a Java map reduce example using akka. It's not the easiest map reduce example, since it makes use of futures; but it should give you a rough idea of what's involved. There are several major things that my map reduce example demonstrates:
How to divide the work.
How to assign the work: akka has a really simple messaging system was well as a work partioner, whose schedule you can configure. Once I learned how to use it, I couldn't stop. It's just so simple and flexible. I was using all four of my CPU cores in no time. This is really great for implementing services.
How to know when the work is done and the result is ready to process: This is actually the portion that may be the most difficult and confusing to understand unless you're already familiar with Futures. You don't need to use Futures, since there are other options. I just used them because I wanted something shorter for people to grok.
If you have any questions, StackOverflow actually has an awesome akka QA section.
I think it is worth mentioning that these problems are history as of Java 8. An example:
int heaviestBlueBlock =
blocks.filter(b -> b.getColor() == BLUE)
.map(Block::getWeight)
.reduce(0, Integer::max);
In other words: single-node MapReduce is available in Java 8.
For more details, see Brian Goetz's presentation about project lambda
I use the following structure
int procs = Runtime.getRuntime().availableProcessors();
ExecutorService es = Executors.newFixedThreadPool(procs);
List<Future<TaskResult>> results = new ArrayList();
for(int i=0;i<tasks;i++)
results.add(es.submit(new Task(i)));
for(Future<TaskResult> future:results)
reduce(future);
I realise this might be a little after the fact but you might want to have a look at the JSR166y ForkJoin classes from JDK7.
There is a back ported library that works under JDK6 without any issues so you don't have to wait until the next millennium to have a go with it. It sits somewhere between an raw executor and hadoop giving a framework for working on map reduce job within the current JVM.
I created a one-off for myself a couple years ago when I got an 8-core machine, but I wasn't terribly happy with it. I never got it to be as simple to used as I had hoped, and memory-intensive tasks didn't scale well.
If you don't get any real answers I can share more, but the core of it is:
public class LocalMapReduce<TMapInput, TMapOutput, TOutput> {
private int m_threads;
private Mapper<TMapInput, TMapOutput> m_mapper;
private Reducer<TMapOutput, TOutput> m_reducer;
...
public TOutput mapReduce(Iterator<TMapInput> inputIterator) {
ExecutorService pool = Executors.newFixedThreadPool(m_threads);
Set<Future<TMapOutput>> futureSet = new HashSet<Future<TMapOutput>>();
while (inputIterator.hasNext()) {
TMapInput m = inputIterator.next();
Future<TMapOutput> f = pool.submit(m_mapper.makeWorker(m));
futureSet.add(f);
Thread.sleep(10);
}
while (!futureSet.isEmpty()) {
Thread.sleep(5);
for (Iterator<Future<TMapOutput>> fit = futureSet.iterator(); fit.hasNext();) {
Future<TMapOutput> f = fit.next();
if (f.isDone()) {
fit.remove();
TMapOutput x = f.get();
m_reducer.reduce(x);
}
}
}
return m_reducer.getResult();
}
}
EDIT: Based on a comment, below is a version without sleep. The trick is to use CompletionService which essentially provides a blocking queue of completed Futures.
public class LocalMapReduce<TMapInput, TMapOutput, TOutput> {
private int m_threads;
private Mapper<TMapInput, TMapOutput> m_mapper;
private Reducer<TMapOutput, TOutput> m_reducer;
...
public TOutput mapReduce(Collection<TMapInput> input) {
ExecutorService pool = Executors.newFixedThreadPool(m_threads);
CompletionService<TMapOutput> futurePool =
new ExecutorCompletionService<TMapOutput>(pool);
Set<Future<TMapOutput>> futureSet = new HashSet<Future<TMapOutput>>();
for (TMapInput m : input) {
futureSet.add(futurePool.submit(m_mapper.makeWorker(m)));
}
pool.shutdown();
int n = futureSet.size();
for (int i = 0; i < n; i++) {
m_reducer.reduce(futurePool.take().get());
}
return m_reducer.getResult();
}
I'll also note this is a very distilled map-reduce algorithm, including a single reduce worker which does both the reduce and merge operation.
I like to use Skandium for parallelism in Java. The framework implements certain patterns of parallelism (namely Master-Slave, Map/Reduce, Pipe, Fork and Divide & Conquer) for multi-core machines with shared memory. This technique is called "algorithmic skeletons". The patterns can be nested.
In detail there are skeletons and muscles. Muscles do the actual work (split, merge, execute and condition). Skeletons represent the patterns of parallelism, except for "While", "For" and "If", which can be useful when nesting patterns.
Examples can be found inside the framework. I needed a bit to understand how to use the muscles and skeletons, but after getting over this hurdle I really like this framework. :)
Have you had a look at GridGain ?
You might want to take a look at the project website of Functionals 4 Java: http://f4j.rethab.ch/ It introduces filter, map and reduce to java versions before 8.
A MapReduce API was introduced into v3.2 of Hazelcast (see the MapReduce API section in the docs). While Hazelcast is intended to be used in a distributed system, it works perfectly well in a single node setup, and it's fairly lightweight.
You can try LeoTask : a parallel task running and results aggregation framework
It is free and open-source: https://github.com/mleoking/leotask
Here is a brief introduction showing Its API: https://github.com/mleoking/leotask/blob/master/leotask/introduction.pdf?raw=true
It is a light weight framework working on a single computer using all its availble CPU-cores.
It has the following features:
Automatic & parallel parameter space exploration
Flexible & configuration-based result aggregation
Programming model focusing only on the key logic
Reliable & automatic interruption recovery
and Utilities:
Dynamic & cloneable networks structures.
Integration with Gnuplot
Network generation according to common network models
DelimitedReader: a sophisticated reader that explores CSV (Comma-separated values) files like a database
Fast random number generator based on the Mersenne Twister algorithm
An integrated CurveFitter from the ImageJ project
I'm working on a Java application which should allow users to optimize their daily schedule. For that, I need a framework that helps calculate optimal times for "tasks" taking note of:
Required resources and resource usage limits
Dependencies between tasks (can do with only F->S relations though)
Earliest and latest start-finish times, slack times
Baseline vs. actual times - allowing to report actual start and finish times, updating the rest of the tasks accordingly
Some clarifications: I am not looking for neither a framework to draw these gantts, nor a framework that deals with one specific problem domain (such as classrooms), and definitely not a framework that deals with thread scheduling.
Thanks!
I don't think there is a framework that will suit your needs out of the box. I know you said you're not looking for a job/thread scheduler, but I think your best bet is probably to roll your own optimization/prioritization code around a "dumb" job/thread scheduling framework like Quartz (or whatever you have in place). If you go with Quartz, the API can probably provide you with some information useful for items 3 and 4 of your optimization criteria. Additionally, Quartz has a job "priority" concept, so once you've computed the optimized priority, it should make scheduling the execution easy.
If you do find a framework that does what you ask, please post back here -- I'm sure there are others who could use something similar.
You could check for a project management software. It seems you need it written in java with the ability to modify the code. It really narrows down the list but I made a quick scan and I see at least 2 of them which could help (Endeavour and Project.net).
Perhaps what you need is something like evolutionary/genetic algorithm to generate an optimized schedule?
If yes, you may have a look at this Watchmaker Framework:
http://watchmaker.uncommons.org/
With evolutionary/genetic algorithm, it randomly generate a pool of schedule. Your main focus will be defining the scoring criteria to evaluate each schedule generated. Then let it(the schedules generated) evolve from generation to generation until it is optimum enough for you.