Lucene: Multithread document duplication

Lucene: Multithread document duplication - java

I have multiple threads which perform search in the lucene index. Before each search, there is a check whether the content is already indexed and if not it is then added to the index. If two parallel searches on unindexed content occure at the same time, there will be duplicated documents and guess the results of the search will be messed up.
I have found the following method: IndexWriter.updateDocument
but I think this does not solve the multithread problem I am facing.
Any suggestions how to resolve this are appreciated.

First Make sure there is only one method(IndexWriter#updateDocument()) call call at a time, you would to achieve it with a shared object belong to your threads, like this
class Search implements Runnable{
private Object lock=new Object();
private volatile boolean found=false;
public void run(){
//business
if(<<found something!>> && !found){
synchronized(lock){/*call the related-method*/found=true;}
}
//business
}
}
Second you need to track every keys have found during the search to avoid duplication, maybe checking the key or using a simple boolean check.
and please beware of useless process by signalling another threads about aborting their process for searching, IF you just need the very first founded keys, it's dependents on business.

If you're not able to modify the source of your updates/additions to be smarter about avoiding duplicates, then you'll have to create a choke point somewhere. The goal is simply to do it with the least amount of contention possible.
One way to do it would be to have a request queue, a work queue and a ConcurrentHashMap for lookups. All new requests are added to the request queue which is processed by a single "gatekeeper" thread. The gatekeeper can take one request at a time or drain the queue and process all pending requests in a loop to reduce contention on that end.
In order to process a request, the gatekeeper does putIfAbsent on the ConcurrentHashMap. If the return value is null, the update/insert request can be added to the actual work queue. If the value was already in the map, then.... see #2 below. Realistically you could use more than 1 gatekeeper since putIfAbsent is atomic, but it'd just increase contention on the HashMap. The gatekeeper's actual processing time is so low that you don't really gain anything by throwing more of them at the request queue.
The work queue threads will be able to process multiple updates/insertions concurrently as long as they don't modify the same record. When the work queue threads finish processing a request, they remove the value from the ConcurrentHashMap so that the gatekeeper knows it's safe to modify that record again.
--
Some things to think about:
1) How do you want to define what can be done simultaneously? It probably shouldn't be hashing the full request because you wouldn't want two different requests to modify the same document at the same time, would you?
2) What do you do with requests that cannot currently be processed because they have duplicates in the queue already (or requests that modify the same doc, as in point #1)? Throw them out? Put them in a secondary updating queue that tries again periodically? How do you respond to the original requester if its request is in an indefinite holding pattern?
3) Does the order in which requests are processed matter?

Related

Will the value of a variable updated by one thread eventually seen by another thread if not synchronized in Java?

I know if the value of a variable is updated by one thread A and then read by another thread B, the new value may not be seen by thread B and B might get a stale value. My question is as computers are very fast, may I assume this possible latency issue is of the order of milliseconds and B will eventually see the new value if we are taking about time scales of say minutes or even hours?
Why I'm asking about this is in my code I have a map keeping some records and works as the following:
user add a record to the map;
user go do some work;
user go back and remove the record;
A lot of users do this concurrently. Step 1 and step 3 is very fast and done in the same thread (the AKKA actor thread to be specific), step 2 takes time and is done in separate work threads. Now as step2 the work fails sometimes, step 3 may never get executed and forgotten records may accumulate in the map, I set a scheduler thread to check the map for forgotten records and remove them to avoid memory leakage, the check period is several hours as the failure happens very rarely, under this scenario, is it practically OK to use a non-concurrent map?

If a single akka actor thread is responsible for the map access and the map is memory local to that actor, then it should not be necessary to use a concurrent map. This is because there is no concurrent access, and no chance of any race condition.
The issue is if the map is shared memory between the update/remove actor and the cleanup actor/thread. In this case it WOULD be necessary to synchronize or else there could be data races.
This could be avoided by the update/remove actor also supporting some sort of cleanup message (which will have performance/throughput implications for your actor but would be a simple and safe first approach), where it would block in order to iterate the map and cleanup orphaned records.
To answer your question it is NOT OK to use a non-concurrent (unsynchronized) map if the map is shared between multiple actors/threads.

Java Parallel Network Requests and List Updates

I am limited to a 1-core machine on AWS, but after measuring the time to complete all of my http requests and check their results, two of them together require as much time combined to fetch data as the remaining fifty requests (roughly 2 minutes).
I don't want to bloat my code more than I have to, but I know parallelism and asynchrony can seriously cut down the execution time for this task. I want to launch the two big requests on their own threads so they can go out while the others are running, but I store the results of these http requests in a list currently.
Can you access different (guaranteed) elements of a list at the same time as long as the data is initialized beforehand? I've seen the concurrent list and parallel list, but the one isn't parallel, and the other reallocates the entire list on every modification, so neither is a particularly sane option.
What can I do in this situation?

There is no such thing as a concurrent list in Java. I'm assuming that you are referring to a concurrent hash set (using newSetFromMap) and your "parallel list" refers to a CopyOnWriteArrayList.
You most definitely can use the former option to store update data.
A better way to solve your problem of updating data asynchronously is to just simply use a non-thread-safe collection for your worker thread and then push them all at once when you're done to a thread-safe collection that you use to aggregate all your requests.
So something like:
Set<Response> aggregate = Collections.newSetFromMap(...);
executor.execute(...);
...
// Workers
Set<Response> local = new HashSet<>();
populate(local);
aggregate.addAll(local);
You might want to use various synchronizers if you want your response data to be ordered in a specific way, such as having all your responses from Request 1 to be together. If you only need to move one request from each worker, use a thread safe transfer or a singleton collection.

Background a task then end connection before task completion in Java (8)

I've spent a lot of time looking at this and there are a tonne of ways to background in Java (I'm specifically looking at Java 8 solutions, it should be noted).
Ok, so here is my (generic) situation - please note this is an example, so don't spend time over the way it works/what it's doing:
Someone requests something via an API call
The API retrieves some data from a datastore
However, I want to cache this aggregated response in some caching system
I need to call a cache API (via REST) to cache this response
I do not want to wait until this call is done before returning the response to the original API call
Some vague code structure:
#GET
# // api definitions
public Response myAPIMethod(){
// get data from datastore
Object o = getData();
// submit request to cache data, without blocking
saveDataToCache();
// return the response to the Client
return Response.ok(data).build();
}
What is the "best" (optimal, safest, standard) way to run saveDataToCache in the background without having to wait before returning data? Note that this caching should not occur too often (maybe a couple of times a second).
I attempted this a couple of ways, specifically with CompletableFutures but when I put in some logging it seemed that it always waited before returning the response (I did not call get).
Basically the connection from the client might close, before that caching call has finished - but I want it to have finished :) I'm not sure if the rules are the same as this is during the lifetime of a client connection.
Thanks in advance for any advice, let me know if anything is unclear... I tried to define it in a way understandable to those without the domain knowledge of what I'm trying to do (which I cannot disclose).

You could consider adding the objects to cache into a BlockingQueue and have a separate thread taking from the queue and storing into cache.

As per the comments, the cache API is already asynchronous (it actually returns a Future). I suppose it creates and manages an internal ExecutorService or receives one at startup.
My point is that there's no need to take care of the objects to cache, but of the returned Futures. Asynchronous behavior is actually provided by the cache client.
One option would be to just ignore the Future returned by this client. The problem with this approach is that you loose the chance to take a corrective action in case an error occurrs when attempting to store the object in the cache. In fact, you would never know that something went wrong.
Another option would be to take care of the returned Future. One way is with a Queue, as suggested in another answer, though I'd use a ConcurrentLinkedQueue instead, since it's unbounded and you have mentioned that adding objects to the cache would happen as much as twice a second. You could offer() the Future to the queue as soon as the cache client returns it and then, in another thread, that would be running an infinite loop, you could poll() the queue for a Future and, if a non null value is returned, invoke isDone() on it. (If the queue returns null it means it's empty, so you might want to sleep for a few milliseconds).
If isDone() returns true, you can safely invoke get() on the future, surrounded by a try/catch block that catches any ExecutionException and handles it as you wish. (You could retry the operation on the cache, log what happened, etc).
If isDone() returns false, you could simply offer() the Future to the queue again.
Now, here we're talking about handling errors from asynchronous operations of a cache. I wouldn't do anything and let the future returned by the cache client go in peace. If something goes wrong, the worst thing that may happen is that you'd have to go to the datastore again to retrieve the object.

How can I obtain some order in and multi-thread reading queue

In my app, I would receive some user data, putting them into an ArrayBlockingQueue, and then put them into a database. Here several threads are used for 'getting the data from the queue and putting it into database'. Then an issue came up.
The database is used to store each user's current status, thus the data's time sequence is very important. But when using multi threads to 'get and put', the order can not be ensured.
So I came up with an idea, it's like 'field grouping': for different users' data, multi-threads is fine, the order between them can be ignored; but each user's data must be retrieved by the same thread.
Now the question is, how can I do that?

Is the number of Users limited? Then you can simply cache a thread across each user.
// thread cache
Map<Sting, Thread> threadcache = new HashMap<String,Thread>();
threadcache.put("primary_key", t);
// when accessing the daya
Thread torun = threadcache.get(queue.peek());
torun.start();
else
Java thread takes name Thread.setName()/getName. Use that to identify a thread, still reuse is something you have to handle according to your business logic.

Try using PriorityBlockingQueue<E> . <E> should be comparable. Implement logic such that that each user's data is individually sorted as per required attributes. Also use threadpools instead of managing threads discretely .

producer-consumer: how to know inform that prodcution completed

i have the following situation:
Read data from database
do work "calculation"
write result to database
I have a thread that reads from the database and puts the generated objects into a BlockingQueue. These objects are extremely heavy weight hence the queue to limit amount of objects in memory.
A multiple threads take objects from the Queue, performs work and put the results in a second queue.
The final thread takes results from second queue and saves result to database.
The problem is how to prevent deadlocks, eg. the "calculation threads" need to know when no more objects will be put into the queue.
Currently I achieve this by passing a references of the threads (callable) to each other and checking thread.isDone() before a poll or offer and then if the element is null. I also check the size of the queue, as long as there are elements in it, the must be consumed. Using take or put leads to deadlocks.
Is there a simpler way to achieve this?

One of the ways to accomplish would be to put a "dummy" or "poison" message as the last message on the queue when you are sure that no more tasks are going to arrive on the queue.. for example after putting the message related to the last row of the db query. So the producer puts a dummy message on the queue, the consumer on receiving this dummy message knows that no more meaningful work is expected in this batch.

Maybe you should take a look at CompletionService
It is designed to combine executor and a queue functionality in one.
Tasks which completed execution will be available from the completions service via
completionServiceInstance.take()
You can then again use another executor for 3. i.e. fill DB with the results, which you will feed with the results taken from the completionServiceInstance.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.