Parallelize search in a Java set - java

I have a List<String> called lines and a huge (~3G) Set<String> called voc. I need to find all lines from lines that are in voc. Can I do this multithreaded way?
Currently I have this straightforward code:
for(String line: lines) {
if (voc.contains(line)) {
// Great!!
}
}
Is there a way to search for few lines at the same time? May be there are existing solutions?
PS: I am using javolution.util.FastMap, because it behaves better during filling up.

Here is a possible implementation. Please note that error/interruption handling has been omitted but this might give you a starting point. I included a main method so you could copy and paste this into your IDE for a quick demo.
Edit: Cleaned things up a bit to improve readability and List partitioning
import java.util.ArrayList;
import java.util.HashSet;
import java.util.List;
import java.util.Set;
import java.util.concurrent.Callable;
import java.util.concurrent.CompletionService;
import java.util.concurrent.ExecutionException;
import java.util.concurrent.ExecutorCompletionService;
import java.util.concurrent.ExecutorService;
import java.util.concurrent.Executors;
public class ParallelizeListSearch {
public static void main(String[] args) throws InterruptedException, ExecutionException {
List<String> searchList = new ArrayList<String>(7);
searchList.add("hello");
searchList.add("world");
searchList.add("java");
searchList.add("debian");
searchList.add("linux");
searchList.add("jsr-166");
searchList.add("stack");
Set<String> targetSet = new HashSet<String>(searchList);
Set<String> matchSet = findMatches(searchList, targetSet);
System.out.println("Found " + matchSet.size() + " matches");
for(String match : matchSet){
System.out.println("match: " + match);
}
}
public static Set<String> findMatches(List<String> searchList, Set<String> targetSet) throws InterruptedException, ExecutionException {
Set<String> locatedMatchSet = new HashSet<String>();
int threadCount = Runtime.getRuntime().availableProcessors();
List<List<String>> partitionList = getChunkList(searchList, threadCount);
if(partitionList.size() == 1){
//if we only have one "chunk" then don't bother with a thread-pool
locatedMatchSet = new ListSearcher(searchList, targetSet).call();
}else{
ExecutorService executor = Executors.newFixedThreadPool(threadCount);
CompletionService<Set<String>> completionService = new ExecutorCompletionService<Set<String>>(executor);
for(List<String> chunkList : partitionList)
completionService.submit(new ListSearcher(chunkList, targetSet));
for(int x = 0; x < partitionList.size(); x++){
Set<String> threadMatchSet = completionService.take().get();
locatedMatchSet.addAll(threadMatchSet);
}
executor.shutdown();
}
return locatedMatchSet;
}
private static class ListSearcher implements Callable<Set<String>> {
private final List<String> searchList;
private final Set<String> targetSet;
private final Set<String> matchSet = new HashSet<String>();
public ListSearcher(List<String> searchList, Set<String> targetSet) {
this.searchList = searchList;
this.targetSet = targetSet;
}
#Override
public Set<String> call() {
for(String searchValue : searchList){
if(targetSet.contains(searchValue))
matchSet.add(searchValue);
}
return matchSet;
}
}
private static <T> List<List<T>> getChunkList(List<T> unpartitionedList, int splitCount) {
int totalProblemSize = unpartitionedList.size();
int chunkSize = (int) Math.ceil((double) totalProblemSize / splitCount);
List<List<T>> chunkList = new ArrayList<List<T>>(splitCount);
int offset = 0;
int limit = 0;
for(int x = 0; x < splitCount; x++){
limit = offset + chunkSize;
if(limit > totalProblemSize)
limit = totalProblemSize;
List<T> subList = unpartitionedList.subList(offset, limit);
chunkList.add(subList);
offset = limit;
}
return chunkList;
}
}

Simply splitting lines among different threads would (in Oracle JVM at least) spread the work into all CPUs if you are looking for this.
I like using CyclicBarrier, makes those threads controlled in an easier way.
http://javarevisited.blogspot.cz/2012/07/cyclicbarrier-example-java-5-concurrency-tutorial.html

It's absolutely possible to parallelize this using multiple threads. You could do the following:
Break up the list into a different "blocks," one per thread that will do the search.
Have each thread look over its block, checking whether each string is in the set, and if so adding the string to the resulting set.
For example, you might have the following thread routine:
public void scanAndAdd(List<String> allStrings, Set<String> toCheck,
ConcurrentSet<String> matches, int start, int end) {
for (int i = start; i < end; i++) {
if (toCheck.contains(allStrings.get(i))) {
matches.add(allStrings.get(i));
}
}
}
You could then spawn off as many threads as you needed to run the above method and wait for all of them to finish. The resulting matches would then be stored in matches.
For simplicity, I've had the output set be a ConcurrentSet, which automatically eliminates race conditions due to writes. Since you are only doing reads on the list of strings and set of strings to check for, no synchronization is required when reading from allStrings or performing lookups in toCheck.
Hope this helps!

Another option would be to use Akka, it does these kinds of things quite simply.
Actually, having done some search work with Akka, one of the things I can tell you about this too is that it supports two ways of parallelizing such things: through Composable Futures or Agents. For what you want, the Composable Futures would be completely sufficient. Then, Akka is actually not adding that much: Netty is providing the massively parallel io infrastructure, and Futures are part of the jdk, but Akka does make it super simple to put these two together and extend them when/if needed.

Related

Functions instead of static utility methods

Despite Functions being around in Java since Java 8, I started playing with them only recently. Hence this question may sound a little archaic, kindly excuse.
At the outset, I am talking of a pure function written completely in conformance of Functional Programming definition: deterministic and immutable.
Say, I have a frequent necessity to prepend a string with another static value. Like the following for example:
private static Function<String, String> fnPrependString = (s) -> {
return "prefix_" + s;
};
In the good old approach, the Helper class and its static methods would have been doing this job for me.
The question now is, whether I can create these functions once and reuse them just like helper methods.
One threat is that of thread-safety. And I used a simple test to check this with this JUnit test:
package com.me.expt.lt.test;
import static org.junit.jupiter.api.Assertions.assertEquals;
import static org.junit.jupiter.api.Assertions.assertTrue;
import java.util.HashMap;
import java.util.Map;
import java.util.concurrent.ConcurrentHashMap;
import java.util.concurrent.ConcurrentMap;
import java.util.function.Consumer;
import java.util.function.Function;
import org.junit.jupiter.api.Test;
import com.vmlens.api.AllInterleavings;
public class TestFunctionThreadSafety {
private static Function<String, String> fnPrepend = (s) -> {
System.out.println(s);
return new StringBuffer("prefix_").append(s).toString();
};
#Test
public void testThreadSafety() throws InterruptedException {
try (AllInterleavings allInterleavings = new AllInterleavings(
TestFunctionThreadSafety.class.getCanonicalName());) {
ConcurrentMap<String, Integer> resultMap = new ConcurrentHashMap<String, Integer>();
while (allInterleavings.hasNext()) {
int runSize = 5;
Thread[] threads = new Thread[runSize];
ThreadToRun[] ttrArray = new ThreadToRun[runSize];
StringBuffer sb = new StringBuffer("0");
for (int i = 0; i < runSize; i++) {
if (i > 0)
sb.append(i);
ttrArray[i] = new ThreadToRun();
ttrArray[i].setS(sb.toString());
threads[i] = new Thread(ttrArray[i]);
}
for (int j = 0; j < runSize; j++) {
threads[j].start();
}
for (int j = 0; j < runSize; j++) {
threads[j].join();
}
System.out.println(resultMap);
StringBuffer newBuffer = new StringBuffer("0");
for (int j = 0; j < runSize; j++) {
if(j>0)
newBuffer.append(j);
assertEquals("prefix_" + newBuffer, ttrArray[j].getResult(), j + " fails");
}
}
}
}
private static class ThreadToRun implements Runnable {
private String s;
private String result;
public String getS() {
return s;
}
public void setS(String s) {
this.s = s;
}
public String getResult() {
return result;
}
#Override
public void run() {
this.result = fnPrepend.apply(s);
}
}
}
I am using vmlens. I can tune my test by changing the runSize variable by as good a number as I choose so that the randomness can be checked. The objective is to see if these multiple threads using the same function mix up their inputs because of concurrent access. The test did not return any negative results. Please also do comment on whether the test meets the goals.
I also tried to understand the internal VM end of how lambdas are executed from here. Even as I look for somewhat simpler articles that I can understand these details faster, I did not find anything that says "Lambdas will have thread safety issues".
Assuming the test case meets my goal, the consequential questions are:
Can we replace the static helper classes with function variables immutable and deterministic functions like fnPrepend? The objective is to simply provide more readable code and also of course to move away from the "not so Object oriented" criticism about static methods.
Is there is a source of simpler explanation to how Lambdas work inside the vm?
Can the results above with a Function<InputType, ResultType> be applied to a Supplier<SuppliedType> and a Consumer<ConsumedType> also?
Some more familiarity with functions and the bytecode will possibly help me answer these questions. But a knowledge exchange forum like this may get me an answer faster and the questions may trigger more ideas for the readers.
Thanks in advance.
Rahul
I really don't think you, as a user, need to go to such lengths to prove the JVM's guarantees about lambdas. Basically, they are just like any other method to the JVM with no special memory or visibility effects :)
Here's a shorter function definition:
private static Function<String, String> fnPrepend = s -> "prefix_" + s;
this.result = fnPrepend.apply(s);
... but don't use a lambda just for the sake of it like this - it's just extra overhead for the same behaviour. Assuming the real usecase has a requirement for a Function, we can use Method References to call the static method. This gets you the best of both worlds:
// Available as normal static method
public static String fnPrepend(String s) {
return "prefix_" + s;
}
// Takes a generic Function
public static void someMethod(UnaryOperator<String> prefixer) {
...
}
// Coerce the static method to a function
someMethod(Util::fnPrepend);

Parallel counting - Java

I do not have a background in CS. I am really new to parallel programming and I do not know how exactly the hardware works when running a program. However, I have noticed the following. Say I have:
public class Counter {
private static int parallelCount = 0;
private static int sequentialCount = 0;
public static void main(String[] args) {
int n = 1000;
// I count in parallel:
IntStream.range(0, n).parallel().forEach(i -> {
parallelCount++;
});
// I count sequentially:
for (int i = 0; i < n; i++) {
sequentialCount++;
}
System.out.println("parallelCount = " + parallelCount);
System.out.println("sequentialCount = " + sequentialCount);
}
}
why I may get:
parallelCount = 984
sequentialCount = 1000
I guess this has to do with the hardware and the way the compiler access memory. I am really interested to know why this happens. And what is one possible solution?
Whenever more than one threads can access a value that is mutable then the system goes out of sync meaning the kind of problem that you are facing. No one can be sure what the result will be, and many a times the result will be wrong. You cannot guarantee which thread will write the value last.
Therefore, you need to synchronize the access to the shared resource (the integer you are incrementing) so that all threads get the latest updated value and the answer is always correct.
Coming to your program you can try making the parallelCount variable an Atomic Integer like AtomicInteger parallelCount = new AtomicInteger(); An Atomic Integer is thread safe meaning that they can be concurrently updated without running the system out of sync.
import java.util.concurrent.atomic.AtomicInteger;
import java.util.stream.IntStream;
public class Counter {
private static AtomicInteger parallelCount = new AtomicInteger();
private static int sequentialCount = 0;
public static void main(String[] args) {
int n = 1000;
// I count in parallel:
IntStream.range(0, n).parallel().forEach(i -> {
parallelCount.getAndIncrement();
});
// I count sequentially:
for (int i = 0; i < n; i++) {
sequentialCount++;
}
System.out.println("parallelCount = " + parallelCount);
System.out.println("sequentialCount = " + sequentialCount);
}
}
As you can expect standard for loop will increment sequentialCount 1000 times
Regarding parallel stream, the application will try to open multiple threads which need to execute your function on parallel. In this situation, multiple threads can increment value at the same time and store value to int.
For example, suppose that we have two threads that working parallel and want to increment value from variable parallelCount. If parallelCount contains value 50. Both threads will read value 50 and calculate the new value 51 and store it to memory.
This approach can produce other concurrent problems. In order to solve this problem, you can use synchronization, locking, atomic classes, or another approach.
Multiple theads do an operation that is not atomic (incrementing a value).
The code you wrote translates to byte code and might cause something like this:
To avoid this, you need to synchronize the access to that critical code.
But note, that if all of your code is critical code, then it's redundant to use multiple threads.
AtomicInteger
We can make use of AtomicInteger class from Java concurrency package while working with parallel streams as the behavior can be unpredictable while using primitive data type
import java.util.stream.IntStream;
import java.util.concurrent.atomic.AtomicInteger;
public class Main
{
private static AtomicInteger parallelCount = new AtomicInteger();
private static int sequentialCount = 0;
public static void main(String[] args) {
System.out.println("Hello World");
int n = 100000;
// I count in parallel:
IntStream.range(0, n).parallel().forEach(i -> {
parallelCount.incrementAndGet();
});
// I count sequentially:
for (int i = 0; i < n; i++) {
sequentialCount++;
}
System.out.println("parallelCount = " + parallelCount);
System.out.println("sequentialCount = " + sequentialCount);
}
}

Java Multithreading return value

I have a code that loop a process, the code is like this:
for (int z = 0; z < m_ID.length; z++) {
expretdata = expret.Get_Expected_Return(sStartDate, sEndDate, m_ID[z], sBookName, nHistReturn,nMarketReturn, nCustomReturn);
m_Alpha[z] = expretdata;
}
Get_Expected_Return() is an expensive method that take too long. So if M_ID.length more than 200, it will take a hour to complete the task.
I want to optimize it with multithread. I tried to save the return value to Map static global variable, and reorder it with key. Because I need data to be ordered by index of M_ID.length.
But, when I try to run the multithread some of threads return value = NULL, it looks like the thread doesn't run the method.
Is multithread the right way to do it? or give me any advice to optimize it.
Multithreaded can be very useful if your expensive methods are independent and don't use too much of a shared singular resource such as a single hard drive.
Your use case of ordered results can be solved using Callables and Futures:
import java.util.ArrayList;
import java.util.List;
import java.util.concurrent.ExecutionException;
import java.util.concurrent.ExecutorService;
import java.util.concurrent.Executors;
import java.util.concurrent.Future;
class Test
{
static final int CALLS = 10;
static int slowMethod(int n) throws InterruptedException
{
Thread.sleep(1000);
return n+1;
}
public static void main(String[] args) throws InterruptedException, ExecutionException
{
ExecutorService executor = Executors.newCachedThreadPool();
List<Future<Integer>> futures = new ArrayList<>();
for (int i = 0; i < CALLS ; i++)
{
final int finali = i;
futures.add(executor.submit(()->slowMethod(finali)));
}
for(Future<Integer> f: futures) {System.out.print(f.get());}
executor.shutdown();
}
}
If you are using Java 8 or above version, you can use parallelStream.
m_Alpha = m_ID.parallelStream()
.map( z => {
return expret.Get_Expected_Return(sStartDate, sEndDate, m_ID[z],
sBookName, nHistReturn,nMarketReturn, nCustomReturn);
})
.toArray(Integer[]::new);
Type of constructor that is provided toArray method should be the same with the type of m_Alpha.
You can use CompletableFuture to execute this task in parallel. Here is an example.
// lets define a wrapper class which is responsible to put calculated data into the array->
private void longExecution(int index, DataType m_Alpah, ... sStartDate, sEndDate, m_ID_index_z, sBookName, nHistReturn,nMarketReturn, nCustomReturn){
m_Alpha[index] = expret.Get_Expected_Return(sStartDate, sEndDate, m_ID_index_z, sBookName, nHistReturn,nMarketReturn, nCustomReturn);
}
// Now from your code:
...
CompletableFuture[] futures = new CompletableFuture[m_ID.length];
for (int z = 0; z < m_ID.length; z++) {
CompletableFuture.supplyAsync(() ->
longExecution(z, m_Alpah, sStartDate, sEndDate, m_ID[z], sBookName, nHistReturn,nMarketReturn, nCustomReturn));
);
}
// waiting for completing all of the futures.
CompletableFuture.allOf(futures).join();
// After this line:
//m_Alpha <- array will hold the result.

Updating a variable & adding numbers to two different lists

I have just started doing threading this week and I'm kind of stuck on one of the exercises.
import java.util.ArrayList;
import java.util.List;
import java.util.concurrent.ArrayBlockingQueue;
import java.util.logging.Level;
import java.util.logging.Logger;
public class RandomNumberConsumer implements Runnable {
ArrayBlockingQueue<Integer> numbersProduced;
public RandomNumberConsumer(ArrayBlockingQueue<Integer> numbersProduced) {
this.numbersProduced = numbersProduced;
}
//Should eventually hold the sum of all random number consumed
int sumTotal = 0;
List<Integer> below50 = new ArrayList();
List<Integer> aboveOr50 = new ArrayList();
#Override
public void run() {
//In this exercise, we start four threads, each producing 100 numbers, so we know how much to consume
for (int i = 0; i < 400; i++) {
try {
System.out.println("first" + numbersProduced.take());
System.out.println("second" + numbersProduced.take());
System.out.println("third" + numbersProduced.take());
System.out.println("fourth" + numbersProduced.take());
} catch (InterruptedException ex) {
Logger.getLogger(RandomNumberConsumer.class.getName()).log(Level.SEVERE, null, ex);
}
}
}
public int getSumTotal() {
return sumTotal;
}
public List<Integer> getBelow50() {
return below50;
}
public List<Integer> getAboveOr50() {
return aboveOr50;
}
}
Basically what I don't understand is how to update sumTotal variable to show all consumed random numbers since they're stored in ArrayBlockingQueue<> and also how to insert them into either the below 50 or above Or 50 Lists.
What about don't use numbersProduced.take() in the run method and define a new function, even in RandomNumberConsumer class?
int foo() throws InterruptedException{ //because of take function
int a=numbersProduced.take();
if(a<50)
below50.add(a);
else
over50.add(a);
sumTotal=sumtTotal+a;
return a;
}
now in run call foo and not numbersProduced.take()
I must tell you, i hope you know it: if you use threads which modify the same variable you must use Semaphore or synchronized methods.
Ps: sorry for bad English :)

google-guava MapMaker .softValues() - values don't get GC-ed, OOME: HeapSpace follows

I am having trouble using the MapMaker from google-guava. Here is the code:
package test;
import java.lang.ref.SoftReference;
import java.util.Map;
import java.util.Random;
import com.google.common.collect.MapEvictionListener;
import com.google.common.collect.MapMaker;
public class MapMakerTest {
private static Random RANDOM = new Random();
private static char[] CHARS =
("abcdefghijklmnopqrstuvwxyz" +
"ABCDEFGHIJKLMNOPQRSTUVWXYZ" +
"1234567890-=!##$%^&*()_+").toCharArray();
public static void main(String[] args) throws Exception {
MapEvictionListener<String, String> listener = new MapEvictionListener<String, String>() {
#Override
public void onEviction(String key, String value) {
System.out.println(">>>>> evicted");
}
};
Map<String, String> map = new MapMaker().
concurrencyLevel(1).softValues().
evictionListener(listener).makeMap();
while (true) {
System.out.println(map.size());
String s = getRandomString();
map.put(s, s);
Thread.sleep(50);
}
}
private static String getRandomString() {
int total = 50000;
StringBuilder sb = new StringBuilder();
for (int i = 0; i < total; ++i) {
sb.append(CHARS[RANDOM.nextInt(CHARS.length)]);
}
return sb.toString();
}
}
When java is called like: java -Xms2m -Xmx2m -cp guava-r09.jar:. test.MapMakerTest (the heap settings are so small intentionally to easier see what happens) around the 60th iteration it explodes with OutOfMemoryError: HeapSpace.
However, when the map is Map<String, SoftReference<String>> (and according changes in the rest of the code: the listener, and the put), I can see the evictions taking place, and the code simply works, and the values get garbage collected.
In all of the documentation, including this one: http://guava-libraries.googlecode.com/svn/tags/release09/javadoc/index.html, there is no mention of SoftReferences explicitly. Isn't the Map implementation supposed to wrap the values in SoftReference when put is called? I am really confused about the supposed usage.
I am susing guava r09.
Could anyone maybe explain what I am doing wrong, and why my assumptions are wrong?
Best regards,
wujek
You use the same object for key and value, therefore it is strongly reachable as a key and is not eligible for garbage collection despite the fact that value is softly reachable:
map.put(s, s);
Try to use different instances:
map.put(s, new String(s));

Categories