ConcurrentMap on demand loading java - java

I'm working on an on demand cache that needs to be thread-safe. I have data for about 30K + items (in one file) that I want to obtain only when needed for my mult-threaded game. However I'm not sure if my approach is how ConcurrentMap's computeIfAbsent is supposed to be used, and if it isn't what alternative is there for me to lazily load contents from a single file without worrying about threading issues? I want to avoid locking if the object exists in my map, which I've read using CHM does on reads.
I've pre-cached file names (which are IDs) that I want to load to ensure they exist to avoid constant checking via the headers hash map. The headers map is read-only and will only be loaded once upon starting of my program.
this is what I've done:
private static final ConcurrentMap<Integer, ItemData> items = new ConcurentHashMap<>();
private static final HashMap<Integer, Byte> headers = new HashMap<>(); // pre loaded file names to avoid checking if file exists
public static ItemData getItem(int itemID) {
var item = items.get(itemID);
if (item != null) {
return item;
}
// if item doesn't exist in map, check if it exists in file on disk
if (!headers.containsKey(itemID)) {
return null;
}
// if item exists in file add it to cache
return items.computeIfAbsent(itemID, k -> {
try (var dis = new DataInputStream(new FileInputStream("item.bin"))) {
var data = new ItemData(itemID);
data.load(dis); // obtains only data for one item
return item;
} catch (IOException e) {
// ommited for brevity. logging goes here.
return null;
}
});
}
Update: Pre-loading isn't an option for me, I agree doing that would solve threading-issues as it will only be read-only. But my game assets combined have a total size of over 2GB. I don't want to load everything during start up as some items in the files may never be used. Thus I'm looking for an approach to load them only when needed.

You wrote
I want to avoid locking if the object exists in my map, which I've read using CHM does on reads.
I don’t know where you read that but it’s definitely wrong. It’s not even an outdated statement as even the very first version specifies:
Retrieval operations (including get) generally do not block…
The general structure of your approach is fine. In case of concurrent first time accesses for a key, it’s possible that multiple threads pass the first check but only one will do the actual retrieval in computeIfAbsent and all of them will use the result. Subsequent accesses to an already loaded item may benefit from the first plain get access.
There’s still something to improve.
return items.computeIfAbsent(itemID, k -> {
try (var dis = new DataInputStream(new FileInputStream("item.bin"))) {
var data = new ItemData(k);
data.load(dis); // obtains only data for one item
return item;
} catch (IOException e) {
// may still do logging here
throw new UncheckIOException(e);
}
});
First, while it’s a good approach to do logging (which you omitted for brevity), returning null and forcing the calling code to deal with null is not a good idea. You already have the headers.containsKey(…) check that tells us that the resource is supposed to be there, so the application likely has no way to deal with the absence, so we’re talking about an exceptional situation.
Further, you can use the k parameter passed to the function rather than accessing itemID from the surrounding scope. Limiting access scopes is not only cleaner, in this case, it turns the lambda expression into a non-capturing one, which means that it doesn’t require to create a new object each time, that would otherwise be needed to hold the captured value.
If you really read the same item.bin file for all ItemData, you may consider using memory mapped I/O to share the data, instead of reading it with a DataInputStream. The ByteBuffer representation of a memory mapped file offers almost the same methods to get compound items, it even supports little endian processing that DataInputStream doesn’t support.

Related

Picking n random files from a directory

I have a folder containing over 100k folders in it. If I use listFiles() then it takes a lot of time because it returns all the entries present in the folder. What I want is, n random entries from the folder which I will process and will move to a different location.
I was curious to see what sort of performance you get with listFiles(), so I tested. With 100,000 children, I saw a delay of 0.051 seconds. You will likely see this rate hold relatively well (nothing I have found would suggest any substantial increase within Java; any rapid degradation will come natively). While this delay is relatively small, I looked into how listFiles works to determine if there was any potential improvements that could be made.
Improvement 1
The first solution is to use File.list() as opposed to File.listFiles(). If you look at the code for the listFiles() method, you can see how Java finds the children of a Folder.
public File[] listFiles() {
String[] ss = list();
if (ss == null) return null;
int n = ss.length;
File[] fs = new File[n];
for (int i = 0; i < n; i++) {
fs[i] = new File(ss[i], this);
}
return fs;
}
The listFiles() method takes an array of the names of the children, which are Strings, and creates a File object for each child. The iteration and instantiation of File objects would create an unnecessary overheard for yours task; You only want a single File which would be less expensive if the conversion from a String[] to a File[] was ignored. Fortunately, the list(); method is public, so you can use this method instead to obtain a slight performance increase.
A rough test shows that this reduced the time by approximately 25% (when searching a folder with 100,000 children).
Improvement 2
The next logical step would be to look at the list() and see what it does. Here things get a little bit sticky:
public String[] list() {
SecurityManager security = System.getSecurityManager();
if (security != null) {
security.checkRead(path);
}
if (isInvalid()) {
return null;
}
return fs.list(this);
}
Under the assumption you are okay with skipping the security and validation checks, you would want to follow fs.list(this); to where it takes you. Following this takes you down a bit of a rabbit hole:
fs.list(this)
DefaultFileSystem.getFileSystem().list(File f)
new WinNTFileSystem.list(File f)
which is where you stop. The .list(File f)is declarednative` meaning that it has been implemented in native code using JNI. All the way down the line access is restricted meaning
If you are wanting to go as deep as you can possibly go, you could use reflection to gain access to these methods. The lowest level I believe you can go is the native method WinNTFileSystem.file(File f), though I would highly recommend against doing this.
/* Setup */
// Get FileSystem from File class
Field fieldFileSystem = File.class.getDeclaredField("fs");
fieldFileSystem.setAccessible(true);
Object fs = fieldFileSystem.get(null);
// Get WinNTFileSystem class
Class<?> classWinNTFileSystem = Class.forName("java.io.WinNTFileSystem");
// Get native `list` method from WinNTFileSystem class
Method methodList = classWinNTFileSystem .getMethod("list", File.class);
methodList.setAccessible(true);
/* Each time you want to invoke the method */
String[] files = (String[]) methodList.invoke(fs, root);
The performance upgrade for this varied significantly. At times I saw slightly better than using the previous method while others I saw drastic improvements of over 50%, though I am skeptical of this performance. Using this method you should see at least a minor increase over File.list(). (The assumption was made that you only create the Method object once and reused it through the code).
Note
Short of using keys are file names, you won't see any significant performance increases beyond what I have shown. In order to index a File, as you want, you would need the list as there simply is no native implementation for "get child at index n". You could use a key or index as the filename itself, and simply create a new File object using new File(root, "12353");.
Actually java has DirectoryStream interface can be used to iterate over a directory without preloading its content into memory. A sample code for the same is mentioned below.
Path logFolder = Paths.get(windowsClientParentFolder);
try (DirectoryStream<Path> stream = Files.newDirectoryStream(logFolder)) {
for (Path entry : stream) {
String folderName = entry.getFileName().toString();
//process the folder
}
} catch (IOException ex) {
System.out.println("Exception occurred while reading folders.");
}

Using a PriorityBlockingQueue to feed in logged objects for processing

I have an application that reads in objects from multiple serialized object logs and hands them off to another class for processing. My question focuses on how to efficiently and cleanly read in the objects and send them off.
The code was pulled from an older version of the application, but we ended up keeping it as is. It hasn't really been used much until the past week, but I recently started looking at the code more closely to try and improve it.
It opens N ObjectInputStreams, and reads one object from each stream to store them in an array (assume inputStreams below is just an array of ObjectInputStream objects that corresponds to each log file):
for (int i = 0; i < logObjects.length; i++) {
if (inputStreams[i] == null) {
continue;
}
try {
if (logObjects[i] == null) {
logObjects[i] = (LogObject) inputStreams[i].readObject();
}
} catch (final InvalidClassException e) {
LOGGER.warn("Invalid object read from " + logFileList.get(i).getAbsolutePath(), e);
} catch (final EOFException e) {
inputStreams[i] = null;
}
}
The objects that were serialized to file are LogObject objects. Here is the LogObject class:
public class LogObject implements Serializable {
private static final long serialVersionUID = -5686286252863178498L;
private Object logObject;
private long logTime;
public LogObject(Object logObject) {
this.logObject = logObject;
this.logTime = System.currentTimeMillis();
}
public Object getLogObject() {
return logObject;
}
public long getLogTime() {
return logTime;
}
}
Once the objects are in the array, it then compares the log time and sends off the object with the earliest time:
// handle the LogObject with the earliest log time
minTime = Long.MAX_VALUE;
for (int i = 0; i < logObjects.length; i++) {
logObject = logObjects[i];
if (logObject == null) {
continue;
}
if (logObject.getLogTime() < minTime) {
index = i;
minTime = logObject.getLogTime();
}
}
handler.handleOutput(logObjects[index].getLogObject());
My first thought was to create a thread for each file that reads in and puts the objects in a PriorityBlockingQueue (using a custom comparator that uses the LogObject log time to compare). Another thread could then be taking the values out and sending them off.
The only issue here is that one thread could put an object on the queue and have it taken off before another thread could put one on that may have an earlier time. This is why the objects were read in and stored in an array initially before checking for the log time.
Does this constraint prohibit me from implementing a multi-threaded design? Or is there a way I can tweak my solution to make it more efficient?
As far as I understand your problem you need to process LogObjects strictly in order. In that case initial part of your code is totally correct. What this code does is merge sort of several input streams. You need to read one object for each stream (this is why temporary array is needed) then take appropriate (minimum/maximum) LogObject and handle to processor.
Depending on your context you might be able to do processing in several threads. The only thing you need to change is to put LogObjects in ArrayBlockingQueue and processors might runs on several independent threads. Another option is to send LogObjects for processing in ThreadPoolExecutor. Last option is more simple and straightforward.
But be aware of several pitfalls on the way:
for this algorithm to work correctly individual streams must be already sorted. Otherwise your program is broken;
when you do processing in parallel message processing order is strictly speaking is not defined. So proposed algorithms only guarantees message processing start order (dispatch order). It might be not what you want.
So now you should face several questions:
Do processing order is really required?
If so, does global order required (over all messages) or local one (over independent group of messages)?
Answer to those question will have great impact on your ability to do parallel processing.
If the answer on first question is yes, sadly, parallel processing is not an option.
I agree with you. Throw this away and use a PriorityBlockingQueue.
The only issue here is that if Thread 1 has read an object from File 1 in and put it in the queue (and the object File 2 was going to read in has an earlier log time), the reading Thread could take it and send it off, resulting in a log object with a later time being sent first
This is exactly like the merge phase of a balanced merge (Knuth ACP vol 3). You must read the next input from the same file as you got the previous lowest element from.
Does this constraint prohibit me from implementing a multi-threaded design?
It isn't a constraint. It's imaginary.
Or is there a way I can tweak my solution to make it more efficient?
Priority queues are already pretty efficient. In any case you should certainly worry about correctness first. Then add buffering ;-) Wrap the ObjectInputStreams around BufferedInputStreams, and ensure there is a BufferedOutputStream in your output stack.

How to use ReadWriteLock?

I'm the following situation.
At web application startup I need to load a Map which is thereafter used by multiple incoming threads. That is, requests comes in and the Map is used to find out whether it contains a particular key and if so the value (the object) is retrieved and associated to another object.
Now, at times the content of the Map changes. I don't want to restart my application to reload the new situation. Instead I want to do this dynamically.
However, at the time the Map is re-loading (removing all items and replacing them with the new ones), concurrent read requests on that Map still arrive.
What should I do to prevent all read threads from accessing that Map while it's being reloaded ? How can I do this in the most performant way, because I only need this when the Map is reloading which will only occur sporadically (each every x weeks) ?
If the above is not an option (blocking) how can I make sure that while reloading my read request won't suffer from unexpected exceptions (because a key is no longer there, or a value is no longer present or being reloaded) ?
I was given the advice that a ReadWriteLock might help me out. Can you someone provide me an example on how I should use this ReadWriteLock with my readers and my writer ?
Thanks,
E
I suggest to handle this as follow:
Have your map accessible at a central place (could be a Spring singleton, a static ...).
When starting to reload, let the instance as is, work in a different Map instance.
When that new map is filled, replace the old map with this new one (that's an atomic operation).
Sample code:
static volatile Map<U, V> map = ....;
// **************************
Map<U, V> tempMap = new ...;
load(tempMap);
map = tempMap;
Concurrency effects :
volatile helps with visibility of the variable to other threads.
While reloading the map, all other threads see the old value undisturbed, so they suffer no penalty whatsoever.
Any thread that retrieves the map the instant before it is changed will work with the old values.
It can ask several gets to the same old map instance, which is great for data consistency (not loading the first value from the older map, and others from the newer).
It will finish processing its request with the old map, but the next request will ask the map again, and will receive the newer values.
If the client threads do not modify the map, i.e. the contents of the map is solely dependent on the source from where it is loaded, you can simply load a new map and replace the reference to the map your client threads are using once the new map is loaded.
Other then using twice the memory for a short time, no performance penalty is incurred.
In case the map uses too much memory to have 2 of them, you can use the same tactic per object in the map; iterate over the map, construct a new mapped-to object and replace the original mapping once the object is loaded.
Note that changing the reference as suggested by others could cause problems if you rely on the map being unchanged for a while (e.g. if (map.contains(key)) {V value = map.get(key); ...}. If you need that, you should keep a local reference to the map:
static Map<U,V> map = ...;
void do() {
Map<U,V> local = map;
if (local.contains(key)) {
V value = local.get(key);
...
}
}
EDIT:
The assumption is that you don't want costly synchronization for your client threads. As a trade-off, you allow client threads to finish their work that they've already begun before your map changed - ignoring any changes to the map that happened while it is running. This way, you can safely made some assumptions about your map - e.g. that a key is present and always mapped to the same value for the duration of a single request. In the example above, if your reader thread changed the map just after a client called map.contains(key), the client might get null on map.get(key) - and you'd almost certainly end this request with a NullPointerException. So if you're doing multiple reads to the map and need to do some assumptions as the one mentioned before, it's easiest to keep a local reference to the (maybe obsolete) map.
The volatile keyword isn't strictly necessary here. It would just make sure that the new map is used by other threads as soon as you changed the reference (map = newMap). Without volatile, a subsequent read (local = map) could still return the old reference for some time (we're talking about less than a nanosecond though) - especially on multicore systems if I remember correctly. I wouldn't care about it, but f you feel a need for that extra bit of multi-threading beauty, your free to use it of course ;)
I like the volatile Map solution from KLE a lot and would go with that. Another idea that someone might find interesting is to use the map equivalent of a CopyOnWriteArrayList, basically a CopyOnWriteMap. We built one of these internally and it is non-trivial but you might be able to find a COWMap out in the wild:
http://old.nabble.com/CopyOnWriteMap-implementation-td13018855.html
This is the answer from the JDK javadocs for ReentrantReadWriteLock implementation of ReadWriteLock. A few years late but still valid, especially if you don't want to rely only on volatile
class RWDictionary {
private final Map<String, Data> m = new TreeMap<String, Data>();
private final ReentrantReadWriteLock rwl = new ReentrantReadWriteLock();
private final Lock r = rwl.readLock();
private final Lock w = rwl.writeLock();
public Data get(String key) {
r.lock();
try { return m.get(key); }
finally { r.unlock(); }
}
public String[] allKeys() {
r.lock();
try { return m.keySet().toArray(); }
finally { r.unlock(); }
}
public Data put(String key, Data value) {
w.lock();
try { return m.put(key, value); }
finally { w.unlock(); }
}
public void clear() {
w.lock();
try { m.clear(); }
finally { w.unlock(); }
}
}

How handle cache misses: NotFoundException, contains() or `if (null == result)`?

Maybe this is slightly academic, but if I implement a cache for speeding up an application, how should I best handle cache misses? (In my case, the language would be Java, but maybe the answer can be more general.)
Throw an exception:
ResultType res;
try {
res = Cache.resLookup(someKey);
} catch (NotFoundException e) {
res = Cache.resInsert(someKey, SlowDataSource.resLookup(someKey));
}
Ask before fetch:
ResultType res;
if (Cache.contains(someKey) {
res = Cache.resLookup(someKey);
} else {
res = Cache.resInsert(someKey, SlowDataSource.resLookup(someKey));
}
Return null:
ResultType res;
res = Cache.resLookup(someKey);
if (null == res) {
res = Cache.resInsert(someKey, SlowDataSource.resLookup(someKey));
}
Throwing an Exception seems wrong, after all, this isn't an error. Letting the Cache do a look up for contains() and then again to retrieve the data seems wasteful, especially as this would occur every time. And checking for null of course requires that null can never be a valid result...
The first is excessive I think and not a good use for exceptions. Do you have an expectation that there will be a cache hit? A cache miss is a fairly normal occurrence I would think and thus an exception becomes simply flow control. Not good imho.
The second is a race condition. There is a time delay between checking on the existence of the cache entry and querying it. That could lead to all sorts of trouble.
Returning null is probably appropriate in the general sense but that comes with some qualifications.
Firstly, what type of cache is it? Ideally you'd be talking to a read-through cache in which case if it's not in the cache it'll simply get it from the source, which is not the style of code you've written there.
Secondly, the get then insert is another race condition. Look at the interface for ConcurrentHashMap for a good general way of dealing with this kind of thing. Most notably, the putIfAbsent() call is an atomic operation that does the equivalent of you're third call.
The last option (if null == result) is best, in my opinion.
A cache miss is not an exceptional condition, and should be handled in the normal code flow.
And if the action of checking if something exists in the cache may be a somewhat expensive operation (e.g. the network overhead of a memcached call), it shouldn't be a separate operation. Also, the value of contains() may change before you actually retrieve the item, if the cache is shared among threads.
What about a 4th option? You could use a holder for the return value, and have the lookup return a boolean for success:
ResultHolder result = new ResultHolder();
if(!cache.resLookup(someKey, result))
{
// get from slower source and insert to cache
}
if(result.value == null)
{
// special case if you wanted null as a valid value
}
This is basically your third option, keeping a single call, but if you wanted to have null as a value you could.
I would tend towards a checked exception, since you can't inadvertently ignore it and accidently return a null. I'm assuming (unlike most people here) that a cache miss is an unusual scenario.
I also assume you're talking about the cache internal implementation. From the client perspective this should be invisible.
Since you are speeding up an existing api via caching the invisible one (2nd one) would apply. I say this as I'm assuming that the api exists and you're not prematurely optimising.
(3) looks like the easiest to read and most performant (even though the actual difference in performance between these alternatives is problably neglible).
With (2) you have to make two lookups in the cache (first "contains" and then "resLookup").
With (1) you create an additional object, the code gets more complicated and harder to read, and a cache miss isn't a exceptional case.
From a 'clear code' perspective
A cache-miss is not an exception but a normal case. So I'd not use an Exception here.
null-values should be avoided when ever possible. Choose something meaningful instead
This leaves option 'ask before fetch' or a variation of your third option, where you don't return 'null' but a special ResultType object that signals a cache-miss
Example:
public class ResultType {
public final static ResultType CACHE_MISS = new ResultType();
// ... rest of the class implementation
}
and later on
ResultType res;
res = Cache.resLookup(someKey);
if (ResultType.CACHE_MISS == res) {
res = Cache.resInsert(someKey, SlowDataSource.resLookup(someKey));
}
Advantage over 'null' solution: the reader now immediatly knows that is 'if' handles a cache miss.

Best approach to use in Java 6 for a List being accessed concurrently

I have a List object being accessed by multiple threads. There is mostly one thread, and in some conditions two threads, that updates the list. There are one to five threads that can read from this list, depending on the number of user requests being processed.
The list is not a queue of tasks to perform, it is a list of domain objects that are being retrieved and updated concurrently.
Now there are several ways to make the access to this list thread-safe:
-use synchronized block
-use normal Lock (i.e. read and write ops share same lock)
-use ReadWriteLock
-use one of the new ConcurrentBLABLBA collection classes
My question:
What is the optimal approach to use, given that the cricital sections typically do not contain a lot of operations (mostly just adding/removing/inserting or getting elements from the list)?
Can you recommend another approach, not listed above?
Some constrains
-optimal performance is critical, memory usage not so much
-it must be an ordered list (currently synchronizing on an ArrayList), although not a sorted list (i.e. not sorted using Comparable or Comparator, but according to insertion order)
-the list will is big, containing up to 100000 domain objects, thus using something like CopyOnWriteArrayList not feasible
-the write/update ciritical sections are typically very quick, doing simple add/remove/insert or replace (set)
-the read operations will do primarily a elementAt(index) call most of the time, although some read operations might do a binary search, or indexOf(element)
-no direct iteration over the list is done, though operation like indexOf(..) will traverse list
Do you have to use a sequential list? If a map-type structure is more appropriate, you can use a ConcurrentHashMap. With a list, a ReadWriteLock is probably the most effective way.
Edit to reflect OP's edit: Binary search on insertion order? Do you store a timestamp and use that for comparison, in your binary search? If so, you may be able to use the timestamp as the key, and ConcurrentSkipListMap as the container (which maintains key order).
What are the reading threads doing? If they're iterating over the list, then you really need to make sure no-one touches the list during the whole of the iteration process, otherwise you could get very odd results.
If you can define precisely what semantics you need, it should be possible to solve the issue - but you may well find that you need to write your own collection type to do it properly and efficiently. Alternatively, CopyOnWriteArrayList may well be good enough - if potentially expensive. Basically, the more you can tie down your requirements, the more efficient it can be.
I don't know if this is a posible solution for the problem but... it makes sense to me to use a Database manager to hold that huge amount of data and let it manage the transactions
I second Telcontar's suggestion of a database, since they are actually designed for managing this scale of data and negotiating between threads, while in-memory collections are not.
You say that the data is on a database on the server, and the local list on the clients is for the sake of user interface. You shouldn't need to keep all 100000 items on the client at once, or perform such complicated edits on it. It seems to me that what you want on the client is a lightweight cache onto the database.
Write a cache that stores only the current subset of data on the client at once. This client cache does not perform complex multithreaded edits on its own data; instead it feeds all edits through to the server, and listens for updates. When data changes on the server, the client simply forgets and old data and loads it again. Only one designated thread is allowed to read or write the collection itself. This way the client simply mirrors the edits happening on the server, rather than needing complicated edits itself.
Yes, this is quite a complicated solution. The components of it are:
A protocol for loading a range of the data, say items 478712 to 478901, rather than the whole thing
A protocol for receiving updates about changed data
A cache class that stores items by their known index on the server
A thread belonging to that cache which communicated with the server. This is the only thread that writes to the collection itself
A thread belonging to that cache which processes callbacks when data is retrieved
An interface that UI components implement to allow them to recieve data when it has been loaded
At first stab, the bones of this cache might look something like this:
class ServerCacheViewThingy {
private static final int ACCEPTABLE_SIZE = 500;
private int viewStart, viewLength;
final Map<Integer, Record> items
= new HashMap<Integer, Record>(1000);
final ConcurrentLinkedQueue<Callback> callbackQueue
= new ConcurrentLinkedQueue<Callback>();
public void getRecords (int start, int length, ViewReciever reciever) {
// remember the current view, to prevent records within
// this view from being accidentally pruned.
viewStart = start;
viewLenght = length;
// if the selected area is not already loaded, send a request
// to load that area
if (!rangeLoaded(start, length))
addLoadRequest(start, length);
// add the reciever to the queue, so it will be processed
// when the data has arrived
if (reciever != null)
callbackQueue.add(new Callback(start, length, reciever));
}
class Callback {
int start;
int length;
ViewReciever reciever;
...
}
class EditorThread extends Thread {
private void prune () {
if (items.size() <= ACCEPTABLE_SIZE)
return;
for (Map.Entry<Integer, Record> entry : items.entrySet()) {
int position = entry.key();
// if the position is outside the current view,
// remove that item from the cache
...
}
}
private void markDirty (int from) { ... }
....
}
class CallbackThread extends Thread {
public void notifyCallback (Callback callback);
private void processCallback (Callback) {
readRecords
}
}
}
interface ViewReciever {
void recieveData (int viewStart, Record[] records);
void recieveTimeout ();
}
There's a lot of detail you'll have to fill in for yourself, obviously.
You can use a wrapper that implements synchronization:
import java.util.Collections;
import java.util.ArrayList;
ArrayList list = new ArrayList();
List syncList = Collections.synchronizedList(list);
// make sure you only use syncList for your future calls...
This is an easy solution. I'd try this before resorting to more complicated solutions.

Categories