Simple Java Map/Reduce framework [closed] - java

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 8 years ago.
Improve this question
Can anyone point me at a simple, open-source Map/Reduce framework/API for Java? There doesn't seem to much evidence of such a thing existing, but someone else might know different.
The best I can find is, of course, Hadoop MapReduce, but that fails the "simple" criteria. I don't need the ability to run distributed jobs, just something to let me run map/reduce-style jobs on a multi-core machine, in a single JVM, using standard Java5-style concurrency.
It's not a hard thing to write oneself, but I'd rather not have to.

Have you check out Akka? While akka is really a distributed Actor model based concurrency framework, you can implement a lot of things simply with little code. It's just so easy to divide work into pieces with it, and it automatically takes full advantage of a multi-core machine, as well as being able to use multiple machines to process work. Unlike using threads, it feels more natural to me.
I have a Java map reduce example using akka. It's not the easiest map reduce example, since it makes use of futures; but it should give you a rough idea of what's involved. There are several major things that my map reduce example demonstrates:
How to divide the work.
How to assign the work: akka has a really simple messaging system was well as a work partioner, whose schedule you can configure. Once I learned how to use it, I couldn't stop. It's just so simple and flexible. I was using all four of my CPU cores in no time. This is really great for implementing services.
How to know when the work is done and the result is ready to process: This is actually the portion that may be the most difficult and confusing to understand unless you're already familiar with Futures. You don't need to use Futures, since there are other options. I just used them because I wanted something shorter for people to grok.
If you have any questions, StackOverflow actually has an awesome akka QA section.

I think it is worth mentioning that these problems are history as of Java 8. An example:
int heaviestBlueBlock =
blocks.filter(b -> b.getColor() == BLUE)
.map(Block::getWeight)
.reduce(0, Integer::max);
In other words: single-node MapReduce is available in Java 8.
For more details, see Brian Goetz's presentation about project lambda

I use the following structure
int procs = Runtime.getRuntime().availableProcessors();
ExecutorService es = Executors.newFixedThreadPool(procs);
List<Future<TaskResult>> results = new ArrayList();
for(int i=0;i<tasks;i++)
results.add(es.submit(new Task(i)));
for(Future<TaskResult> future:results)
reduce(future);

I realise this might be a little after the fact but you might want to have a look at the JSR166y ForkJoin classes from JDK7.
There is a back ported library that works under JDK6 without any issues so you don't have to wait until the next millennium to have a go with it. It sits somewhere between an raw executor and hadoop giving a framework for working on map reduce job within the current JVM.

I created a one-off for myself a couple years ago when I got an 8-core machine, but I wasn't terribly happy with it. I never got it to be as simple to used as I had hoped, and memory-intensive tasks didn't scale well.
If you don't get any real answers I can share more, but the core of it is:
public class LocalMapReduce<TMapInput, TMapOutput, TOutput> {
private int m_threads;
private Mapper<TMapInput, TMapOutput> m_mapper;
private Reducer<TMapOutput, TOutput> m_reducer;
...
public TOutput mapReduce(Iterator<TMapInput> inputIterator) {
ExecutorService pool = Executors.newFixedThreadPool(m_threads);
Set<Future<TMapOutput>> futureSet = new HashSet<Future<TMapOutput>>();
while (inputIterator.hasNext()) {
TMapInput m = inputIterator.next();
Future<TMapOutput> f = pool.submit(m_mapper.makeWorker(m));
futureSet.add(f);
Thread.sleep(10);
}
while (!futureSet.isEmpty()) {
Thread.sleep(5);
for (Iterator<Future<TMapOutput>> fit = futureSet.iterator(); fit.hasNext();) {
Future<TMapOutput> f = fit.next();
if (f.isDone()) {
fit.remove();
TMapOutput x = f.get();
m_reducer.reduce(x);
}
}
}
return m_reducer.getResult();
}
}
EDIT: Based on a comment, below is a version without sleep. The trick is to use CompletionService which essentially provides a blocking queue of completed Futures.
public class LocalMapReduce<TMapInput, TMapOutput, TOutput> {
private int m_threads;
private Mapper<TMapInput, TMapOutput> m_mapper;
private Reducer<TMapOutput, TOutput> m_reducer;
...
public TOutput mapReduce(Collection<TMapInput> input) {
ExecutorService pool = Executors.newFixedThreadPool(m_threads);
CompletionService<TMapOutput> futurePool =
new ExecutorCompletionService<TMapOutput>(pool);
Set<Future<TMapOutput>> futureSet = new HashSet<Future<TMapOutput>>();
for (TMapInput m : input) {
futureSet.add(futurePool.submit(m_mapper.makeWorker(m)));
}
pool.shutdown();
int n = futureSet.size();
for (int i = 0; i < n; i++) {
m_reducer.reduce(futurePool.take().get());
}
return m_reducer.getResult();
}
I'll also note this is a very distilled map-reduce algorithm, including a single reduce worker which does both the reduce and merge operation.

I like to use Skandium for parallelism in Java. The framework implements certain patterns of parallelism (namely Master-Slave, Map/Reduce, Pipe, Fork and Divide & Conquer) for multi-core machines with shared memory. This technique is called "algorithmic skeletons". The patterns can be nested.
In detail there are skeletons and muscles. Muscles do the actual work (split, merge, execute and condition). Skeletons represent the patterns of parallelism, except for "While", "For" and "If", which can be useful when nesting patterns.
Examples can be found inside the framework. I needed a bit to understand how to use the muscles and skeletons, but after getting over this hurdle I really like this framework. :)

Have you had a look at GridGain ?

You might want to take a look at the project website of Functionals 4 Java: http://f4j.rethab.ch/ It introduces filter, map and reduce to java versions before 8.

A MapReduce API was introduced into v3.2 of Hazelcast (see the MapReduce API section in the docs). While Hazelcast is intended to be used in a distributed system, it works perfectly well in a single node setup, and it's fairly lightweight.

You can try LeoTask : a parallel task running and results aggregation framework
It is free and open-source: https://github.com/mleoking/leotask
Here is a brief introduction showing Its API: https://github.com/mleoking/leotask/blob/master/leotask/introduction.pdf?raw=true
It is a light weight framework working on a single computer using all its availble CPU-cores.
It has the following features:
Automatic & parallel parameter space exploration
Flexible & configuration-based result aggregation
Programming model focusing only on the key logic
Reliable & automatic interruption recovery
and Utilities:
Dynamic & cloneable networks structures.
Integration with Gnuplot
Network generation according to common network models
DelimitedReader: a sophisticated reader that explores CSV (Comma-separated values) files like a database
Fast random number generator based on the Mersenne Twister algorithm
An integrated CurveFitter from the ImageJ project

Related

Non-Toy Software Transactional Memory for C or Java

I'm thinking about the possibility of teaching the use of Software Transactional Memory through 1 or 2 guided laboratories for a university course. I only know about Haskell's STM, but the students of the course probably never heard a word about it.
I already found some lists of such libraries online or in other questions (e.g., http://en.wikipedia.org/wiki/Software_transactional_memory#C.2FC.2B.2B). I'm checking them out as you read this, but many of them do not seem to have a very nice documentation (most are research prototypes only vaguely described in papers, and I would rather teach about something more used and well documented).
Furthermore, many of the links provided by wikipedia are dangling.
To sum it up, are there STM implementations aimed to industrial projects (or at least non-toy ones, to ensure a certain level of quality) and well documented (to give some good pointers to the students)?
EDIT: I'm not the teacher of the course, I just help him with the laboratories. Of course the students will be taught basics of concurrency and distributed algorithms before. This was just an idea to propose something different towards the end of the course.
Production-quality STM-Libraries are not intended as a teaching tool, not even as "best practice". What is worth learning for any college/university-course is maybe 1% of the code; the remaining 99% is nitty-gritty platform-dependent intrinsic corner-cases. The 1% that is interesting is not highlighted in any way so you have no way of finding it.
What I recommend for a college/university-course (no matter if introductory or advanced) is to implement STM-buildingblocks yourself (and only for 1 platform).
Start by introducing the problems: concurrency, cache...
Then introduce the atomic helpers we have: cas/cmpxchg, fence.
Then build examples together with your students, first easy, then harder and more complex.
Start by introducing the problems: concurrency, cache...
Leading on from eznme, some good problems that I covered while at University for concurrency.
Dining philosophers problem
In computer science, the dining philosophers problem is an example problem often used in concurrent algorithm design to illustrate synchronization issues and techniques for resolving them.
(source: wikimedia.org)
Using the same implementation from here, by Je Magee and Je Kramer, and solving the problem using monitors.
Most shared memory applications are more efficient with Integers than Strings (due to AtomicInteger class for Java). So the best way to demonstrate shared memory in my opinion is to get the students to write an application that uses a threadpool to calculate prime numbers, or to calculate some integral.
Or a good example of threads and shared memory is the Producer-consumer problem.
The producer-consumer problem (also known as the bounded-buffer problem) is a classic example of a multi-process synchronization problem.
(source: csusb.edu)
Implementation found here, there is also an implementation from Massey from the professor in Software Eng Jenz Dietrich.
For distributed algorithms MapReduce and Hadoop are highly documented distributed data structures. And for Distributed Programming Libraries look into MPI (Message Passing Interface) and OpenMP (or Pragma for C++). There is also implementations of Dijkstra shortest path algorithm in parallel too.
There are three good ways to do STM today.
The first way is to use gcc and do TM in C or C++. As of gcc 4.7, transactional memory is supported via the -fgnu-tm flag. The gcc maintainers have done a lot of work, and as of the 4.9 (trunk) branch, you can even use hardware TM (e.g., Intel Haswell TSX). There is a draft specification for the interface to the TM at http://justingottschlich.com/tm-specification-for-c-v-1-1/, which is not too painful. You can also find use cases of gcc's TM from the TM community (see, for example, the application track papers from transact 2014: http://transact2014.cse.lehigh.edu).
The implementation itself is a bit complex, but that's what it takes to be correct. There's a lot of literature on the things that can go wrong, especially in a type-unsafe language like C or C++. GCC gets all of these things right. Really.
The second way is to use Java. You can either use DeuceSTM, which is very easy to extend (type safety makes TM implementation much easier!), or use Scala's Akka library for STM. I prefer Deuce, because it's easier to extend and easier to use (you just annotate a method as #Atomic, and Deuce's java agents do the rest).
The third way is to use Scala. I've not done much in this space, but researchers seem to love Akka. If you're affiliated with a parallel/distributed class, you might even be using Scala already.

Profiling a Java EE applications - What to look for and what changes to make?

I am a bit new to profiling applications for improving performance. I have selected YourKit as my profiler. There is no doubt that YourKit provides very interesting statistics. Where I am getting stuck is what to do with these statistics.
For instance, Consider a method that operates on a JAXB POJO. The method iterates through the POJO to access a tag/element that is deeply nested inside the XML. This requires 4 layers of for loops to get to the element/tag as shown below :
List<Bundle> bundles = null;
List<Item> items = null;
for(Info info : data) {
bundles = info.getBundles();
for(Bundle bundle : bundles) {
items = bundle.getItems();
//.. more loops like this till we get to the required element
}
}
YourKit tells me that the above code is a 'hot-spot' and 80 objects are getting garbage collected for each call to the method containing this code. The above code is just an example and not the only part where I am getting stuck. Most of the times I have no clue about what to do with the information given by the profiler. What can I possibly do to reduce the number of temporary objects in the above code? Are there any well defined principles for imporoving the performance of an application? What statistics to look for when profiling an application and what implications does each kind of statistics have?
Edit :
The main objective for profiling the application is to increase the throughput and response time. The current throughput is only 10 percent of the required throughput!
Focus on the statistics relevant to your performance goal. You are interested in minimal response time, so look at how much each method contributes to response time, and focus on those that contribute much (for single threaded processing, that's simply elapsed time during method call, summed over all invocation of that method). I am not sure what YourKit defines as hot spots (check the docs), but it's probably the methods with highest cummulative elapsed time, so hot spots are a good thing to look at. In constrast, object allocation has no direct impact on response time, and is irrelavant in your case (unless you have identified that the garbage collector contributes a significant proportion of cpu time, which it usually doesn't).
I absolutely agree with the given answers.
I would like to add that considering your concrete example, you actually can make an improvement by using xpath api to access the specific location in the XML.
In situations where you don't need to actually iterate the entire DOM, this should be your first choice since it is declarative and hence more expressive and less error prone.
It would often give you superior performance as well (For very complex queries it may not be the case, but you seem to have a simple scenario).
A way to improve the loop would be to change your schema and essentially flatten the model, of course this depends on whether you can change the schema. This way the generated Java will not require 4 layers of looping. Of course at the end of the day you need to ask yourself is the code really a problem - 80 objects are getting GCed so? Is your application running slow? Are you experiencing memory issues? Remember premature optimization is the root of all evil!
Profiling and optimization is a complex beast and depends on may things (Java version, 32 vs 64 bit os, etc...). Furthermore the optimization might not always require code changes, for example you could resolve problems by changing your GC policy on the JVM - for example there are GC policies that are more effective in situations where your code is creating many small objects that need to be GCed frequently. If you had specifics maybe it would be easier to help you however your question seems too broad. In fact there are many books written on topic which might be worth a read.

Lightweight microbenchmark library with graph output (Java)

Is there a good Java library for taking the legwork out of writing good micro-benchmarks? I'm thinking something which can provide (with a minimum of hassle) provide text (CSV or HTML, take your pick) output of results and maybe graphs summarizing results. Ideally, it should be something that plays nicely with JUnit or equivalent, and should be simple to configure benchmarks with variable parameters.
I've looked at japex, but found it too heavyweight (25 MB of libraries to include?!) and frankly it was just a pain to work with. Virtually nonexistent documentation, mucking about with ant, XML, and paths... etc.
A few of us from the Google Collections team are in the early days of building something that satisfies your needs. Here's the code to measure how long foo() takes:
public class Benchmark1 extends SimpleBenchmark {
public void timeFoo(int reps) {
for (int i = 0; i < reps; i++) {
foo();
}
}
}
Neither the API nor the tool itself is particularly stable. We aren't even ready to receive bug reports or feature requests! If I haven't scared you off yet, I invite you to take Caliper for a spin.
Oracle now has JMH. Not only is it written by members of the JIT team (who will take out much of the legwork of writing good micro-benchmarks), but it also has other neat features like pluggable profilers (including those that will print assembly of your hotspots with per-line cpu time).
It prints out tables. Not sure about graphs. The benchmarks can be configured with variable parameters. The documentation is fairly good.
It is easy to set up and get going. I've got it integrated with JUnit, but the developers provide a Maven archetype to get started.

Building a Java based stock trading application, need pointers for technologies to use

I am building an application in Java (with a jQuery frontend) that needs to talk to a third party application. it needs to update the interface every two seconds at the most.
Would it be a good idea to use comets? If so, how do they fit into the picture?
What other means/technologies can I use to make the application better?
The application will poll stock prices from a third party app, write it to a database and then push it to the front end every second, for the polling, I have a timer that runs every second to call the third party app for data, I then have to display it to the front end using JSP or something,
well at this point im not sure if I should use a servlet to write this out to the front end, what would you recommend? how should I go about it?
is there any new technology that I can use instead of servlets?
I am also using Berkeley db to store the data, do you think its a good option? what would be the drawbacks, if any for using berkeley..
im absolutely clueless so any advice will be much appreciated.
Thanks!
edit : I am planning to do this so that a deskop app constantly polls from the thrid part and writes to the database and a web app only reads and displays from the database, this will reduce the load on the web app and all it has to do is read from db.
Take a look at using a web application framework instead of Servlets - unless it's a really basic project with one screen. There are lots in the Java world unfortunately and it can be a bit of a minefield. Stick with maybe SpringMVC or Struts 2, the worst part is setting these up, but take a look at a sample application plus a tutorial or two and work from there.
http://www.springsource.org/about
http://struts.apache.org/2.x/index.html
Another option to look at is using a template framework such as Appfuse to get yourself up and running without having to integrate a lot of the framework together, see:
http://appfuse.org/display/APF/AppFuse+QuickStart
It provides you with a template to setup SpringMVC with MySQL as a database plus Spring as an POJO framework. It may be a quick way to get started and up and building a prototype.
Judging by your latency requirement of 2 seconds it would be wise to look at some sort of AJAX framework - JQuery or Prototype/Scriptaculous are both good places to start.
http://jquery.com/
http://www.prototypejs.org/
In terms of other technoloqies to make things better you will want to consider a build system, Ant/Maven are fine with Maven the slightly more complex of the two.
http://ant.apache.org/
http://maven.apache.org/download.html
Also, consider JUnit for testing the application. You might want to consider Selenium for functional testing of the front end.
http://www.junit.org
http://seleniumhq.org/
Is this really a stock trading application? Or just a stock price display application? I am asking because from your description it sounds like the latter.
How critical is it that data is polled every second? Specifically would it matter if some polls are a second or two late?
If you are building a stock trading application (where the timing is absolutely critical), or if you cannot afford to be delayed on your polling, I'd recommend you have a look at one of the Java Real Time solutions:
Sun Java Real-Time System (http://java.sun.com/javase/technologies/realtime/index.jsp)
WebSphere Real Time (http://www-01.ibm.com/software/webservers/realtime/)
Oracle JRockit Real Time (http://download.oracle.com/docs/cd/E13150_01/jrockit_jvm/jrockit/docs30/index.html)
Other than that, my only advice is that you stick to good OO design practices. For instance, use a DAO to write to your database, this way, if you find that Berkeley DB isn't quite for you, you can switch to a relational database system with relative ease. It also makes it easy for you to move on to some database partitioning solutions (e.g., Hibernate Shards) if you decide you need it.
While I may have my own technology preferences (for instance, I'd choose Spring MVC for the front end as others have mentioned, I'd try and use Hibernate for persistance), I really cannot claim that these would be better than other technologies out there. Go with something you are familiar with, if it fits the bill.
I think you should focus on your architectural design before picking technologies with a focus on scalability and extendability. Once an architectural design is in place you can look to see what's available and what you need to build, all of which should be pretty obvious.
While not directly comparable look at how Google, eBay and YouTube deal with the scalability problems they face. While a trading system won't have the issues these guys have with sheer numbers of users, you'll get similar problems with data volumes and being able to process price ticks in a timely fashion.
The LSE has getting on for 3000 names, multiply this by the 10 or so popular exchanges round the world and you've got a lot of data being updated continuously over the period each market is open. To give you an idea of what involved in capturing data from a single exchange take a look at http://kx.com/.
From a database perspective you've going to need something industrial strength that allows clustering and has reliable replication - for me this means Oracle. You also want to look at a Time-series Database Design, which in my experience is the best way to build this sort of system.
The same scaling and reliability requirements will apply to your app servers, with JBoss being the logical choice there, although I'd also consider the OSGi Spring Server (http://www.springsource.com/products/dmserver) as its lightweight nature could make it faster.
You'll also want Apache servers for load balancing and to serve static content - a quick Google will turn up stacks of information on that so I won't repeat it here.
Also forget polling, it doesn't scale. Look at using messaging and consumer processes for the cross-process communication, events and worker threads for the in-process communication. Both techniques achieve a natural load balancing effect that can be tuned by increasing the number of consumer processes or worker threads as need be.
Also a static front-end isn't going to cut the mustard, IMHO. Take a look at what's out in the market already - CNC Markets, IG Index, etc all have pretty impressive real-time trading apps.
As an aside, assuming this is a commercial project and not meaning to put a downer on the whole thing, companies like CNC Markets, IG Index, etc make their money from trading fees, the software being a means to an end, which you get access to for free simply by having an account. The other target for the trading software is commercial institutions such as the banks, investment managers, etc. I'd want a pretty watertight plan for how I was going to break into either of these markets before expending too much time and effort.
PostgreSQL is probably the right database. It's a little more enterprisy than MySQL. As for the front-end, there's lots of stuff that can go "on top" of servlets, SpringMVC, Tapestry, and so on and so forth. The actual servlet implementation will be hidden from you.
Many will suggest, and it's probably not a bad suggestion to use Spring to configure the application and to do any dependency injection.
If you're looking for something a little more lightweight, you might consider grails. It's quick to develop with and becoming mature.
Really though, it's kind of hard to recommend things without knowing what kind of "production" environment this would be. Are we talking lots of transactions? (sure, it's a stock trading program, but is it a simulation with a small number of users etc...) It's fun to suggest things, but if you're serious, I'm not sure I would start a major project like this. There are lots of ways to do this, and lots of ways to do this wrong.
Your intention is to build a web UI which shows realtime data eg: time, market data etc...
One of the technologies I have personally used is Web Firm Framework, an opensource framework under Apache License 2.0. It is a java server side framework to build web UI. For each and every tag & attribute there is a corresponding java class. We are just building the UI with Java code instead of pure HTML and JavaScript. The advantage is whatever changes we are making in the server tag & attribute objects will be reflected to the browser page without any explicit trigger from the client. In your case we can simply use ScheduledExecutorService to make data changes in the UI.
Eg:
AtomicReference<BigDecimal> oneUSDToOneGBPRef = new AtomicReference<>(new BigDecimal("0.77"));
SharedTagContent<BigDecimal> amountInBaseCurrencyUSD = new SharedTagContent<>(BigDecimal.ZERO);
Div usdToGBPDataDiv = new Div(null).give(dv -> {
//the second argument is formatter
new Span(dv).subscribeTo(amountInBaseCurrencyUSD, content -> {
BigDecimal amountInUSD = content.getContent();
if (amountInUSD != null) {
return new SharedTagContent.Content<>(amountInUSD.toPlainString(), false);
}
return new SharedTagContent.Content<>("-", false);
});
new Span(dv).give(spn -> {
new NoTag(spn, " USD to GBP: ");
});
new Span(dv).subscribeTo(amountInBaseCurrencyUSD, content -> {
BigDecimal amountInUSD = content.getContent();
if (amountInUSD != null) {
BigDecimal oneUSDToOneGBP = oneUSDToOneGBPRef.get();
BigDecimal usdToGBP = amountInUSD.multiply(oneUSDToOneGBP);
return new SharedTagContent.Content<>(usdToGBP.toPlainString(), false);
}
return new SharedTagContent.Content<>("-", false);
});
});
amountInBaseCurrencyUSD.setContent(BigDecimal.ONE);
//just to test
// will print <div><span>1</span><span> USD to GBP: </span><span>0.77</span></div>
System.out.println(usdToGBPDataDiv.toHtmlString());
ScheduledExecutorService scheduledExecutorService =
Executors.newScheduledThreadPool(1);
Runnable task = () -> {
//dynamically get USD to GBP exchange value
oneUSDToOneGBPRef.set(new BigDecimal("0.77"));
//to update latest converted value
amountInBaseCurrencyUSD.setContent(amountInBaseCurrencyUSD.getContent());
};
ScheduledFuture scheduledFuture = scheduledExecutorService.schedule(task, 1, TimeUnit.SECONDS);
//to cancel the realtime update
//scheduledFuture.cancel(false);
For displaying time in real-time you can use SharedTagContent<Date> and ContentFormatter<Date> to show time in specific timezone. You can watch this video for better understanding. You can also download sample projects from this github repository.

Is there a .Net equivalent to java.util.concurrent.Executor?

Have a long running set of discrete tasks: parsing 10s of thousands of lines from a text file, hydrating into objects, manipulating, and persisting.
If I were implementing this in Java, I suppose I might add a new task to an Executor for each line in the file or task per X lines (i.e. chunks).
For .Net, which is what I am using, I'm not so sure. I have a suspicion maybe CCR might be appropriate here, but I'm not familiar enough with it, which is why I pose this question.
Can CCR function in an equivalent fashion to Java Executors, or is there something else available?
Thanks
You may want to look at the Task Parallel Library.
As of C# 5 this is built into the language using the async and await keywords.
If you're going to ask a bunch of .NET people what's closest to being equivalent to Java Excecutors, it might not hurt to describe the distinguishing features of Java Executors. The person who knows your answer may not be any more familiar with Java than you are with .NET.
That said, if the already-mentioned Task Parallel Library is overkill for your needs, or you don't want to wait for .NET 4.0, perhaps ThreadPool.QueueUserWorkItem() would be what you're looking for.
Maybe this is related: Design: Task Parallel Library explored.
See 10-4 Episode 6: Parallel Extensions as a quick intro.
For older thread-based approach, there's ThreadPool for pooling.
The BackgroundWorker class is probably what you're looking for. As the name implies, it allows you to run background tasks, with automatically managed pooling, and status update events.
For anyone looking for a more contemporary solution (as I was), check out the EventLoopScheduler class.

Categories