Using a PriorityBlockingQueue to feed in logged objects for processing

Using a PriorityBlockingQueue to feed in logged objects for processing - java

I have an application that reads in objects from multiple serialized object logs and hands them off to another class for processing. My question focuses on how to efficiently and cleanly read in the objects and send them off.
The code was pulled from an older version of the application, but we ended up keeping it as is. It hasn't really been used much until the past week, but I recently started looking at the code more closely to try and improve it.
It opens N ObjectInputStreams, and reads one object from each stream to store them in an array (assume inputStreams below is just an array of ObjectInputStream objects that corresponds to each log file):
for (int i = 0; i < logObjects.length; i++) {
if (inputStreams[i] == null) {
continue;
}
try {
if (logObjects[i] == null) {
logObjects[i] = (LogObject) inputStreams[i].readObject();
}
} catch (final InvalidClassException e) {
LOGGER.warn("Invalid object read from " + logFileList.get(i).getAbsolutePath(), e);
} catch (final EOFException e) {
inputStreams[i] = null;
}
}
The objects that were serialized to file are LogObject objects. Here is the LogObject class:
public class LogObject implements Serializable {
private static final long serialVersionUID = -5686286252863178498L;
private Object logObject;
private long logTime;
public LogObject(Object logObject) {
this.logObject = logObject;
this.logTime = System.currentTimeMillis();
}
public Object getLogObject() {
return logObject;
}
public long getLogTime() {
return logTime;
}
}
Once the objects are in the array, it then compares the log time and sends off the object with the earliest time:
// handle the LogObject with the earliest log time
minTime = Long.MAX_VALUE;
for (int i = 0; i < logObjects.length; i++) {
logObject = logObjects[i];
if (logObject == null) {
continue;
}
if (logObject.getLogTime() < minTime) {
index = i;
minTime = logObject.getLogTime();
}
}
handler.handleOutput(logObjects[index].getLogObject());
My first thought was to create a thread for each file that reads in and puts the objects in a PriorityBlockingQueue (using a custom comparator that uses the LogObject log time to compare). Another thread could then be taking the values out and sending them off.
The only issue here is that one thread could put an object on the queue and have it taken off before another thread could put one on that may have an earlier time. This is why the objects were read in and stored in an array initially before checking for the log time.
Does this constraint prohibit me from implementing a multi-threaded design? Or is there a way I can tweak my solution to make it more efficient?

As far as I understand your problem you need to process LogObjects strictly in order. In that case initial part of your code is totally correct. What this code does is merge sort of several input streams. You need to read one object for each stream (this is why temporary array is needed) then take appropriate (minimum/maximum) LogObject and handle to processor.
Depending on your context you might be able to do processing in several threads. The only thing you need to change is to put LogObjects in ArrayBlockingQueue and processors might runs on several independent threads. Another option is to send LogObjects for processing in ThreadPoolExecutor. Last option is more simple and straightforward.
But be aware of several pitfalls on the way:
for this algorithm to work correctly individual streams must be already sorted. Otherwise your program is broken;
when you do processing in parallel message processing order is strictly speaking is not defined. So proposed algorithms only guarantees message processing start order (dispatch order). It might be not what you want.
So now you should face several questions:
Do processing order is really required?
If so, does global order required (over all messages) or local one (over independent group of messages)?
Answer to those question will have great impact on your ability to do parallel processing.
If the answer on first question is yes, sadly, parallel processing is not an option.

I agree with you. Throw this away and use a PriorityBlockingQueue.
The only issue here is that if Thread 1 has read an object from File 1 in and put it in the queue (and the object File 2 was going to read in has an earlier log time), the reading Thread could take it and send it off, resulting in a log object with a later time being sent first
This is exactly like the merge phase of a balanced merge (Knuth ACP vol 3). You must read the next input from the same file as you got the previous lowest element from.
Does this constraint prohibit me from implementing a multi-threaded design?
It isn't a constraint. It's imaginary.
Or is there a way I can tweak my solution to make it more efficient?
Priority queues are already pretty efficient. In any case you should certainly worry about correctness first. Then add buffering ;-) Wrap the ObjectInputStreams around BufferedInputStreams, and ensure there is a BufferedOutputStream in your output stack.

Related

ConcurrentMap on demand loading java

I'm working on an on demand cache that needs to be thread-safe. I have data for about 30K + items (in one file) that I want to obtain only when needed for my mult-threaded game. However I'm not sure if my approach is how ConcurrentMap's computeIfAbsent is supposed to be used, and if it isn't what alternative is there for me to lazily load contents from a single file without worrying about threading issues? I want to avoid locking if the object exists in my map, which I've read using CHM does on reads.
I've pre-cached file names (which are IDs) that I want to load to ensure they exist to avoid constant checking via the headers hash map. The headers map is read-only and will only be loaded once upon starting of my program.
this is what I've done:
private static final ConcurrentMap<Integer, ItemData> items = new ConcurentHashMap<>();
private static final HashMap<Integer, Byte> headers = new HashMap<>(); // pre loaded file names to avoid checking if file exists
public static ItemData getItem(int itemID) {
var item = items.get(itemID);
if (item != null) {
return item;
}
// if item doesn't exist in map, check if it exists in file on disk
if (!headers.containsKey(itemID)) {
return null;
}
// if item exists in file add it to cache
return items.computeIfAbsent(itemID, k -> {
try (var dis = new DataInputStream(new FileInputStream("item.bin"))) {
var data = new ItemData(itemID);
data.load(dis); // obtains only data for one item
return item;
} catch (IOException e) {
// ommited for brevity. logging goes here.
return null;
}
});
}
Update: Pre-loading isn't an option for me, I agree doing that would solve threading-issues as it will only be read-only. But my game assets combined have a total size of over 2GB. I don't want to load everything during start up as some items in the files may never be used. Thus I'm looking for an approach to load them only when needed.

You wrote
I want to avoid locking if the object exists in my map, which I've read using CHM does on reads.
I don’t know where you read that but it’s definitely wrong. It’s not even an outdated statement as even the very first version specifies:
Retrieval operations (including get) generally do not block…
The general structure of your approach is fine. In case of concurrent first time accesses for a key, it’s possible that multiple threads pass the first check but only one will do the actual retrieval in computeIfAbsent and all of them will use the result. Subsequent accesses to an already loaded item may benefit from the first plain get access.
There’s still something to improve.
return items.computeIfAbsent(itemID, k -> {
try (var dis = new DataInputStream(new FileInputStream("item.bin"))) {
var data = new ItemData(k);
data.load(dis); // obtains only data for one item
return item;
} catch (IOException e) {
// may still do logging here
throw new UncheckIOException(e);
}
});
First, while it’s a good approach to do logging (which you omitted for brevity), returning null and forcing the calling code to deal with null is not a good idea. You already have the headers.containsKey(…) check that tells us that the resource is supposed to be there, so the application likely has no way to deal with the absence, so we’re talking about an exceptional situation.
Further, you can use the k parameter passed to the function rather than accessing itemID from the surrounding scope. Limiting access scopes is not only cleaner, in this case, it turns the lambda expression into a non-capturing one, which means that it doesn’t require to create a new object each time, that would otherwise be needed to hold the captured value.
If you really read the same item.bin file for all ItemData, you may consider using memory mapped I/O to share the data, instead of reading it with a DataInputStream. The ByteBuffer representation of a memory mapped file offers almost the same methods to get compound items, it even supports little endian processing that DataInputStream doesn’t support.

Accurate data on a map accessed by many threads

I am trying to sort objects into five separate groups depending on a weight given to them at instantiation.
Now, I want to sort these objects into the five groups by their weights. In order to do this, each one must be compared to the other.
Now the problem I'm having is these objects are added to the groups on separate worker threads. Each one is sent to the synchronized sorting function, which compares against all members currently in the three groups, after an object has completed downloading a picture.
The groups have been set up as two different maps. The first being a Hashtable, which crashes the program throwing an unknown ConcurrencyIssue. When I use a ConcurrentHashMap, the data is wrong because it doesn't remove the entry in time before the next object is compared against the ConcurrentHashmap. So this causes a logic error and yields groups that are sorted correctly only half of the time.
I need the hashmap to immediately remove the entry from the map before the next sort occurs... I thought synchronizing the function would do this but it still doesn't seem to work.
Is there a better way to sort objects against each other that are being added to a datastructure by worker threads? Thanks! I'm a little lost on this one.
private synchronized void sortingHat(Moment moment) {
try {
ConcurrentHashMap[] helperList = {postedOverlays, chanl_2, chanl_3, chanl_4, chanl_5};
Moment moment1 = moment;
//Iterate over all channels going from highest channel to lowest
for (int i = channelCount - 1; i > 0; i--) {
ConcurrentHashMap<String, Moment> table = helperList[i];
Set<String> keys = table.keySet();
boolean mOverlap = false;
double width = getWidthbyChannel(i);
//If there is no objects in table, don't bother trying to compare...
if (!table.isEmpty()) {
//Iterate over all objects currently in the hashmap
for (String objId : keys) {
Moment moment2 = table.get(objId);
//x-Overlap
if ((moment2.x + width >= moment1.x - width) ||
(moment2.x - width <= moment1.x + width)) {
//y-Overlap
if ((moment2.y + width >= moment1.y - width) ||
(moment2.y - width <= moment1.y + width)) {
//If there is overlap, only replace the moment with the greater weight.
if (moment1.weight >= moment2.weight) {
mOverlap = true;
table.remove(objId);
table.put(moment1.id, moment1);
}
}
}
}
}
//If there is no overlap, add to channel anyway
if (!mOverlap) {
table.put(moment1.id, moment1);
}
}
} catch (Exception e) {
Log.d("SortingHat", e.toString());
}
}
The table.remove(objId) is where the problems occur. Moment A gets sent to sorting function, and has no problems. Moment B is added, it overlaps, it compares against Moment A. If Moment B is less weight than Moment A, everything is fine. If Moment B is weighted more and A has to be removed, then when moment C gets sorted moment A will still be in the hashmap along with moment B. And so that seems to be where the logic error is.

You are having an issue with your synchronization.
The synchronize you use, will synchronize using the "this" lock. You can imagine it like this:
public synchronized void foo() { ... }
is the same as
public void foo() {
synchronized(this) {
....
}
}
This means, before entering, the current Thread will try to acquire "this object" as a lock. Now, if you have a worker Thread, that also has a synchronized method (for adding stuff to the table), they won't totally exclude each other. What you wanted is, that one Thread has to finish with his work, before the next one can start its work.
The first being a Hashtable, which crashes the program throwing an unknown ConcurrencyIssue.
This problem accourse because it may happen, that 2 Threads call something at the same time. To illustrate, imagine one Thread calling put(key, value) on it and another Thread calling remove(key). If those calls get executed at the same time (like by different cores) what will be the resulting HashTable? Because noone can say for sure, a ConcurrentModificationException will be thrown. Note: This is a verry simplyfied explanation!
When I use a ConcurrentHashMap, the data is wrong because it doesn't remove the entry in time before the next object is compared against the ConcurrentHashmap
The ConcurrentHashMap is a utility, for avoiding said concurrency issues, it is not magical, multi functional, unicorn hunting, butter knife. It snynchronizes the mehtod calls, which results in the fact, that only one Thread can either add to or remove from or do any other work on the HashMap. It does not have the same functionallity as a Lock of some sort, which would result in the access over the map being allocated to on Thread.
There could be one Thread that wants to call add and one that want to call remove. The ConcurrentHashMap only limits those calls in the matter, that they can't happen at the same time. Which comes first? You have power over that (in this scenario). What you want is, that one thread has to finish with his work, before the next one can do its work.
What you realy need is up to you. The java.util.concurrent package brings a whole arsenal of classes you could use. For example:
You could use a lock for each Map. With that, each Thread (either sorting/removing/adding or whatever) could first fetch the Lock for said Map and than work on that Map, like this:
public Worker implements Runnable {
private int idOfMap = ...;
#Override
public void run() {
Lock lock = getLock(idOfMap);
try {
lock.lock();
// The work goes here
//...
} finally {
lock.unlock();
}
}
}
The line lock.lock() would ensure, that there is no other Thread, that is currently working on the Map and modifing it, after the method call returns and this Thread will therefore have the mutial access over the Map. No one sort, before you are finished removing the right element.
Of course, you would somehow have to hold said locks, like in a data-object. With that being said, you could also utilize the Semaphore, synchronized(map) in each Thread or formulating your work on the Map in the form of Runnables and passing those to another Thread that calls all Runnables he received one by one. The possibilities are nearly endless. I personally would recommend on starting with the lock.

ParallelStreams in java

I'm trying to use parallel streams to call an API endpoint to get some data back. I am using an ArrayList<String> and sending each String to a method that uses it in making a call to my API. I have setup parallel streams to call a method that will call the endpoint and marshall the data that comes back. The problem for me is that when viewing this in htop I see ALL the cores on the db server light up the second I hit this method ... then as the first group finish I see 1 or 2 cores light up. My issue here is that I think I am truly getting the result I want ... for the first set of calls only and then from monitoring it looks like the rest of the calls get made one at a time.
I think it may have something to do with the recursion but I'm not 100% sure.
private void generateObjectMap(Integer count){
ArrayList<String> myList = getMyList();
myList.parallelStream().forEach(f -> performApiRequest(f,count));
}
private void performApiRequest(String myString,Integer count){
if(count < 10) {
TreeMap<Integer,TreeMap<Date,MyObj>> tempMap = new TreeMap();
try {
tempMap = myJson.getTempMap(myRestClient.executeGet(myString);
} catch(SocketTimeoutException e) {
count += 1;
performApiRequest(myString,count);
}
...
else {
System.exit(1);
}
}

This seems an unusual use for parallel streams. In general the idea is that your are informing the JVM that the operations on the stream are truly independent and can run in any order in one thread or multiple. The results will subsequently be reduced or collected as part of the stream. The important point to remember here is that side effects are undefined (which is why variables changed in streams need to be final or effectively final) and you shouldn't be relying on how the JVM organises execution of the operations.
I can imagine the following being a reasonable usage:
list.parallelStream().map(item -> getDataUsingApi(item))
.collect(Collectors.toList());
Where the api returns data which is then handed to downstream operations with no side effects.
So in conclusion if you want tight control over how the api calls are executed I would recommend you not use parallel streams for this. Traditional Thread instances, possibly with a ThreadPoolExecutor will serve you much better for this.

multiple threads accessing an ArrayList

i have an ArrayList that's used to buffer data so that other threads can read them
this array constantly has data added to it since it's reading from a udp source, and the other threads constantly reading from that array.Then the data is removed from the array.
this is not the actual code but a simplified example :
public class PacketReader implements Runnable{
pubic static ArrayList<Packet> buffer = new ArrayList() ;
#Override
public void run(){
while(bActive){
//read from udp source and add data to the array
}
}
public class Player implements Runnable(){
#Override
public void run(){
//read packet from buffer
//decode packets
// now for the problem :
PacketReader.buffer.remove(the packet that's been read);
}
}
The remove() method removes packets from the array and then shifts all the packets on the right to the left to cover the void.
My concern is : since the buffer is constantly being added to and read from by multiple threads , would the remove() method make issues since its gonna have to shift packets to the left?
i mean if .add() or .get() methods get called on that arraylist at the same time that shift is being done would it be a problem ?
i do get index out of bounds exception sometimes and its something like :
index : 100 size 300 , which is strange cuz index is within size , so i want to know if this is what may possibly be causing the problem or should i look for other problems .
thank you

It sounds like what you really want is a BlockingQueue. ArrayBlockingQueue is probably a good choice. If you need an unbounded queue and don't care about extra memory utilization (relative to ArrayBlockingQueue), LinkedBlockingQueue also works.
It lets you push items in and pop them out, in a thread-safe and efficient way. The behavior of those pushes and pops can differ (what happens when you try to push to a full queue, or pop from an empty one?), and the JavaDocs for the BlockingQueue interface have a table that shows all of these behaviors nicely.
A thread-safe List (regardless of whether it comes from synchronizedList or CopyOnWriteArrayList) isn't actually enough, because your use case uses a classic check-then-act pattern, and that's inherently racy. Consider this snippet:
if(!list.isEmpty()) {
Packet p = list.remove(0); // remove the first item
process(p);
}
Even if list is thread-safe, this usage is not! What if list has one element during the "if" check, but then another thread removes it before you get to remove(0)?
You can get around this by synchronizing around both actions:
Pattern p;
synchronized (list) {
if (list.isEmpty()) {
p = null;
} else {
p = list.remove(0);
}
}
if (p != null) {
process(p); // we don't want to call process(..) while still synchronized!
}
This is less efficient and takes more code than a BlockingQueue, though, so there's no reason to do it.

Yes there would be problems because ArrayList is not thread-safe, the internal state of the ArrayList object would be corrupted and eventually you would have some incorrect output or runtime exceptions appearing. You can try using synchronizedList(List list), or if it's a good fit you could try using a CopyOnWriteArrayList.
This issue is the Producer–consumer problem. You can see how much people fix it by using a lock of some kind taking turns extracting an object out of a buffer (a List in your case). There are thread safe buffer implementations you could look at as well if you don't necessarily need a List.

Best approach to use in Java 6 for a List being accessed concurrently

I have a List object being accessed by multiple threads. There is mostly one thread, and in some conditions two threads, that updates the list. There are one to five threads that can read from this list, depending on the number of user requests being processed.
The list is not a queue of tasks to perform, it is a list of domain objects that are being retrieved and updated concurrently.
Now there are several ways to make the access to this list thread-safe:
-use synchronized block
-use normal Lock (i.e. read and write ops share same lock)
-use ReadWriteLock
-use one of the new ConcurrentBLABLBA collection classes
My question:
What is the optimal approach to use, given that the cricital sections typically do not contain a lot of operations (mostly just adding/removing/inserting or getting elements from the list)?
Can you recommend another approach, not listed above?
Some constrains
-optimal performance is critical, memory usage not so much
-it must be an ordered list (currently synchronizing on an ArrayList), although not a sorted list (i.e. not sorted using Comparable or Comparator, but according to insertion order)
-the list will is big, containing up to 100000 domain objects, thus using something like CopyOnWriteArrayList not feasible
-the write/update ciritical sections are typically very quick, doing simple add/remove/insert or replace (set)
-the read operations will do primarily a elementAt(index) call most of the time, although some read operations might do a binary search, or indexOf(element)
-no direct iteration over the list is done, though operation like indexOf(..) will traverse list

Do you have to use a sequential list? If a map-type structure is more appropriate, you can use a ConcurrentHashMap. With a list, a ReadWriteLock is probably the most effective way.
Edit to reflect OP's edit: Binary search on insertion order? Do you store a timestamp and use that for comparison, in your binary search? If so, you may be able to use the timestamp as the key, and ConcurrentSkipListMap as the container (which maintains key order).

What are the reading threads doing? If they're iterating over the list, then you really need to make sure no-one touches the list during the whole of the iteration process, otherwise you could get very odd results.
If you can define precisely what semantics you need, it should be possible to solve the issue - but you may well find that you need to write your own collection type to do it properly and efficiently. Alternatively, CopyOnWriteArrayList may well be good enough - if potentially expensive. Basically, the more you can tie down your requirements, the more efficient it can be.

I don't know if this is a posible solution for the problem but... it makes sense to me to use a Database manager to hold that huge amount of data and let it manage the transactions

I second Telcontar's suggestion of a database, since they are actually designed for managing this scale of data and negotiating between threads, while in-memory collections are not.
You say that the data is on a database on the server, and the local list on the clients is for the sake of user interface. You shouldn't need to keep all 100000 items on the client at once, or perform such complicated edits on it. It seems to me that what you want on the client is a lightweight cache onto the database.
Write a cache that stores only the current subset of data on the client at once. This client cache does not perform complex multithreaded edits on its own data; instead it feeds all edits through to the server, and listens for updates. When data changes on the server, the client simply forgets and old data and loads it again. Only one designated thread is allowed to read or write the collection itself. This way the client simply mirrors the edits happening on the server, rather than needing complicated edits itself.
Yes, this is quite a complicated solution. The components of it are:
A protocol for loading a range of the data, say items 478712 to 478901, rather than the whole thing
A protocol for receiving updates about changed data
A cache class that stores items by their known index on the server
A thread belonging to that cache which communicated with the server. This is the only thread that writes to the collection itself
A thread belonging to that cache which processes callbacks when data is retrieved
An interface that UI components implement to allow them to recieve data when it has been loaded
At first stab, the bones of this cache might look something like this:
class ServerCacheViewThingy {
private static final int ACCEPTABLE_SIZE = 500;
private int viewStart, viewLength;
final Map<Integer, Record> items
= new HashMap<Integer, Record>(1000);
final ConcurrentLinkedQueue<Callback> callbackQueue
= new ConcurrentLinkedQueue<Callback>();
public void getRecords (int start, int length, ViewReciever reciever) {
// remember the current view, to prevent records within
// this view from being accidentally pruned.
viewStart = start;
viewLenght = length;
// if the selected area is not already loaded, send a request
// to load that area
if (!rangeLoaded(start, length))
addLoadRequest(start, length);
// add the reciever to the queue, so it will be processed
// when the data has arrived
if (reciever != null)
callbackQueue.add(new Callback(start, length, reciever));
}
class Callback {
int start;
int length;
ViewReciever reciever;
...
}
class EditorThread extends Thread {
private void prune () {
if (items.size() <= ACCEPTABLE_SIZE)
return;
for (Map.Entry<Integer, Record> entry : items.entrySet()) {
int position = entry.key();
// if the position is outside the current view,
// remove that item from the cache
...
}
}
private void markDirty (int from) { ... }
....
}
class CallbackThread extends Thread {
public void notifyCallback (Callback callback);
private void processCallback (Callback) {
readRecords
}
}
}
interface ViewReciever {
void recieveData (int viewStart, Record[] records);
void recieveTimeout ();
}
There's a lot of detail you'll have to fill in for yourself, obviously.

You can use a wrapper that implements synchronization:
import java.util.Collections;
import java.util.ArrayList;
ArrayList list = new ArrayList();
List syncList = Collections.synchronizedList(list);
// make sure you only use syncList for your future calls...
This is an easy solution. I'd try this before resorting to more complicated solutions.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Using a PriorityBlockingQueue to feed in logged objects for processing - java

Related

ConcurrentMap on demand loading java

Accurate data on a map accessed by many threads

ParallelStreams in java

multiple threads accessing an ArrayList

Best approach to use in Java 6 for a List being accessed concurrently

Categories

Resources