How should I maintain a cache of values read from a file?

How should I maintain a cache of values read from a file? - java

Setup
There is a program running that is performing arbitrary computations and writing a status (an integer value, representing progress) to a file. The integer values can only be incremented.
Now I am developing an other application that can (among other things) perform arithmetic operations, e.g., comparisons, on those integer values. The files are permanently deleted and written by a different program. As such, there is no guarantee that a file exists at any time.
Basically, the application needs to execute something arbitrary, but has a constraint on the other program's progress, i.e., it may only execute something if the other program has done enough work.
Problem
When performing the arithmetic operations, the application should not care about where the integer values come from. Especially, accessing those integer values must not throw an exception. How should I separate all the bad things that can happen when performing io access?
Note that I do not want the execution thread to block until a value can be read from the file. E.g., say the file system dies somehow, then the integer values will not be updated, but the main thread should still continue to work. This desire is driven by the definition of the arithmetic comparison as a predicate, which has exactly two outcomes, true and false, but no third "error"-outcome. That's why I think that the values that are read from the file would need to be cached somehow.
Limitation
Java 1.7, Scala 2.11
Current Approach
I have a solution that looks as if it would work, but I am not sure if there could something go wrong.
The solution is to maintain a cache of those integer values for each file. The core functionality is provided the getters of the cache, while there is a separate "updater"-thread that constantly reads the files and updates the chaches.
If an error occurs the producer should take notice (i.e., log the error), but continue to run, because an incomplete computation should not affect subsequent computations.
A minimal example of what I am currently doing would look something like this:
object Application {
def main(args: Array[String]) {
val caches = args.map(filename => new Cache(Paths.get(filename))
val producer = new Thread(new Updater(caches)))
producer.start()
execute(caches)
producer.interrupt()
}
def execute(values: Array[AccessValue]) {
while (values.head.getValue < 5) {/* This should never throw an exception */}
}
class Updater(caches: Array[Cache]) {
def run() {
var interrupted = false
while(!interrupted) {
caches.foreach{cache =>
try {
val input = Files.newInputStream(cache.file)
cache.updateValue(parse(input))
} catch {
case _: InterruptedException =>
interrupted = true
case t: Throwable =>
log.error(t)
/*continue as if nothing happend*/
}
}
}
}
def parse(input: InputStream): Int = input.read() /* In reality, some xml parsing */
}
trait AccessValue{
def getValue: Int // should not throw an exception
}
class Cache(val file: Path) extends AccessValue{
private val value = 0
def getValue = value
def updateValue(newValue: Int) { value = newValue }
}
Doing it like this works on a synthetic test setup, but I am wondering whether something bad can happen. Also, if anyone would approach the problem differently, I would be glad to hear how.
Could there be a throwable that could cause other threads to go wild? I am thinking of something like OutOfMemoryException or StackOverflow. Would I need to handle them differently, or does it not matter, because, e.g., the whole application would die anyways?
What would happen if the the InterruptException is thrown outside the try block, or even in the catch block? Is there a better way to terminate a thread?
Must the member value of class Cache be declared volatile? I do not care much about the ordering of reads and write, but the compiler must not "optimize" reading the value away just because it deduces that the value is constant.
There are a lot of different concurrency-related libraries. Do you suggest me to use something other than new Thread(...).start()? If yes, what facility do you suggest? I know of Scala's ExecutionContext, Future's, and Java's Executors class, which provides various static constructors for thread pools. However, I have never used any of these before and I do not know their advantages and disadvantages. I also stumbled upon the name "Akka", but my guess is that using Akka is overkill for what I want to achieve.
Thank you

I would recommend to read through oracle's documentation on concurrency.
When one thread writes a value and different thread reads a value, you should always use a synchronized block or declare that value as volatile. Otherwise there is no guarantee that the value written by one thread is visible to the other thread (see oracle's documentation on establishing happens-before relationship).
The OutOfMemoryException can influence the other threads as the heap space to which the OutOfMemoryException refers is shared among threads. The StackOverflow exception would kill only the thread in which it occurs because each thread has its own stack.
If you do not need some sort of synchronization between the two threads then you probably do not need any Futures or Executors.

Related

What is the worst can happen in java race condition?

I know this is clearly a race condition. But what are the possible things that can happen?
class Blah {
List<String> stuff;
public List<String> getStuff() {
return stuff
}
public void setStuff(List<String> newValue) {
this.stuff = newValue
}
}
b = new Blah();
// Thread one
b.setStuff(getListFromSomeNetworkResource());
for (String c : b.getStuff()) {
// Work with c
}
// Thread two
b.setStuff(getListFromSomeNetworkResource());
for (String c : b.getStuff()) {
// Work with c
}
Can this throw RuntimeException?
Can this segfault jvm?
Can this segfault one of the thread?
Does it depend on processor. What if it is an Intel Xeon processor?
Can this throw a NullPointer exception?
Thread 2 can read the contents set by Thread 1 and vice versa if the function actually returned different values
I understand this is a race condition and will not write such a code. But How do I convince others not to?
Update:
Assumptions:
getListFromSomeNetworkResource() always returns a new ArrayList. Size may be 0 or more.
getListFromSomeNetworkResource() is thread safe.

Can this throw RuntimeException?
No, if getListFromSomeNetworkResource() is thread-safe and doesn't return null.
Can this segfault jvm?
Can this segfault one of the thread?
Does it depend on processor. What if it is an Intel Xeon processor?
No.
Can this throw a NullPointer exception?
only if getListFromSomeNetworkResource() can return null.
Thread 2 can read the contents set by Thread 1 and vice versa if the function actually returned different values
yes, this is likely to happen.

The danger would be an ordering such as:
Thread one: b.setStuff(getListFromSomeNetworkResource());
Thread two: b.setStuff(getListFromSomeNetworkResource());
Thread one: b.stuff.iterator() (via b.getStuff(), at the start of the for loop)
In this case, thread one may be iterating over the list that thread two set. That publication, from thread two to thread one, was done without any synchronization -- that's a data race. Assuming that list is not itself thread-safe, lots of things can happen. The main issue would be that some of the list's state is visible to thread one, but not all of it, due to that data race.
It may throw some RuntimeException. For instance, maybe one field thinks that the list has n elements due to a resize. But the new array that came from that resize didn't make it over, so you end up with an ArrayIndexOutOfBoundsException
It may throw a NullPointerException for any number of reasons; maybe it's a linked list, and one of the reference writes didn't make it to thread one
It should not cause any segfaults: those can only come about from bugs in the JVM, but never bugs in your code.
It may depend on the processor, in that processors may have different handling of things like flushing memory from one CPU cache to another -- that's one of the reasons that unsafe publication can cause you to see only some of the data that one thread wrote, from another thread. There are ways to force those caches to get flushed; the way to specify them in Java is through the various data synchronization mechanisms (acquiring locks, using volatile fields, etc).

Why does an empty while in Java not break when condition is set by other thread?

While trying to unit test a threaded class, I decided to use active waiting to control the behavior of the tested class. Using empty while statements for this failed to do what I intended. So my question is:
Why does the first code not complete, but the second does?
There is a similar question, but it doesn't have a real answer nor an MCVE and is far more specific.
Doesn't complete:
public class ThreadWhileTesting {
private static boolean wait = true;
private static final Runnable runnable = () -> {
try {Thread.sleep(50);} catch (InterruptedException ignored) {}
wait = false;
};
public static void main(String[] args) {
wait = true;
new Thread(runnable).start();
while (wait); // THIS LINE IS IMPORTANT
}
}
Does complete:
public class ThreadWhileTesting {
private static boolean wait = true;
private static final Runnable runnable = () -> {
try {Thread.sleep(50);} catch (InterruptedException ignored) {}
wait = false;
};
public static void main(String[] args) {
wait = true;
new Thread(runnable).start();
while (wait) {
System.out.println(wait); // THIS LINE IS IMPORTANT
}
}
}
I suspect that the empty while gets optimized by the Java compiler, but I am not sure. If this behavior is intended, how can I achieve what I want? (Yes, active waiting is intented since I cannot use locks for this test.)

wait isn't volatile and the loop body is empty, so the thread has no reason to believe it will change. It is JIT'd to
if (wait) while (true);
which never completes if wait is initially true.
The simple solution is just to make wait volatile, which prevents JIT making this optimization.
As to why the second version works: System.out.println is internally synchronized; as described in the JSR133 FAQ:
Before we can enter a synchronized block, we acquire the monitor, which has the effect of invalidating the local processor cache so that variables will be reloaded from main memory.
so the wait variable will be re-read from main memory next time around the loop.
However, you don't actually guarantee that the write of the wait variable in the other thread is committed to main memory; so, as #assylias notes above, it might not work in all conditions. (Making the variable volatile fixes this also).

The short answer is that both of those examples are incorrect, but the second works because of an implementation artifact of the System.out stream.
A deeper explanation is that according to the JLS Memory Model, those two examples have a number of legal execution traces which give unexpected (to you) behavior. The JLS explains it like this (JLS 17.4):
A memory model describes, given a program and an execution trace of that program, whether the execution trace is a legal execution of the program. The Java programming language memory model works by examining each read in an execution trace and checking that the write observed by that read is valid according to certain rules.
The memory model describes possible behaviors of a program. An implementation is free to produce any code it likes, as long as all resulting executions of a program produce a result that can be predicted by the memory model.
This provides a great deal of freedom for the implementor to perform a myriad of code transformations, including the reordering of actions and removal of unnecessary synchronization.
In your first example, you have one thread updating a variable and a second thread updating it with no form of synchronization between the tro threads. To cut a (very) long story short, this means that the JLS does not guarantee that the memory update made by the writing thread will every be visible to the reading thread. Indeed, the JLS text I quoted above means that the compiler is entitled to assume that the variable is never changed. If you perform an analysis using the rules set out in JLS 17.4, an execution trace where the reading thread never sees the change is legal.
In the second example, the println() call is (probably) causing some serendipitous flushing of memory caches. The result is that you are getting a different (but equally legal) execution trace, and the code "works".
The simple fix to make your examples both work is to declare the wait flag as volatile. This means that there is a happens-before relationship between a write of the variable in one thread and a subsequent read in another thread. That in turn means that in all legal execution traces, the result of the write will be visible to to the readin thread.
This is a drastically simplified version of what the JLS actually says. If you really want to understand the technical details, they are all in the spec. But be prepared for some hard work understanding the details.

Do I need to add some locks or synchronization if there is only one thread writing and several threads reading?

Say I have a global object:
class Global {
public static int remoteNumber = 0;
}
There is a thread runs periodically to get new number from remote, and updates it (only write):
new Thread {
#override
public void run() {
while(true) {
int newNumber = getFromRemote();
Global.remoteNumber = newNumber;
Thread.sleep(1000);
}
}
}
And there are one or more threads using this global remoteNumber randomly (only read):
int n = Global.remoteNumber;
doSomethingWith(n);
You can see I don't use any locks or synchronize to protected it, is it correct? Is there any potential issue that might cause problems?
Update:
In my case, it's not really important that the reading threads must get the latest new value in realtime. I mean, if there is any issue (caused of lacking lock/synchronization) make one reading thread missed that value, it doesn't matter, because it will have chance to run the same code soon (maybe in a loop)
But reading a undetermined value is not allowed (I mean, if the old value is 20, the new updated value is 30, but the reading threads reads a non-existent value say 33, I'm not sure if it's possible)

You need synchronization here (with one caveat, which I'll discuss later).
The main problem is that the reader threads may never see any of the updates the writer thread makes. Usually any given write will be seen eventually. But here your update loop is so simple that a write could easily be held in cache and never make it out to main memory. So you really must synchronize here.
EDIT 11/2017 I'm going to update this and say that it's probably not realistic that a value could be held in cache for so long. I think it's a issue though that a variable access like this could be optimized by the compiler and held in a register though. So synchronization is still needed (or volatile) to tell the optimizer to be sure to actually fetch a new value for each loop.
So you either need to use volatile, or you need to use a (static) getter and setter methods, and you need to use the synchronized keyword on both methods. For an occasional write like this, the volatile keyword is much lighter weight.
The caveat is if you truly don't need to see timely updates from the write thread, you don't have to synchronize. If a indefinite delay won't affect your program functionality, you could skip the synchronization. But something like this on a timer doesn't look like a good use case for omitting synchronization.
EDIT: Per Brian Goetz in Java Concurrency in Practice, it is not allowed for Java/a JVM to show you "indeterminate" values -- values that were never written. Those are more technically called "out of thin air" values and they are disallowed by the Java spec. You are guaranteed to see some write that was previously made to your global variable, either the zero it was initialized with, or some subsequent write, but no other values are permitted.

Read threads can read old value for undetermined time, but in practice there no problem. Its because each thread has own copy of this variable. Sometimes they sync. You can use volatile keyword to remove this optimisation:
public static volatile int remoteNumber = 0;

Is the expression "a==1 ? 1 : 0" with comparison plus ternary operator expression atomic?

Quick question? Is this line atomic in C++ and Java?
class foo {
bool test() {
// Is this line atomic?
return a==1 ? 1 : 0;
}
int a;
}
If there are multiple thread accessing that line, we could end up with doing the check
a==1 first, then a is updated, then return, right?
Added: I didn't complete the class and of course, there are other parts which update a...

No, for both C++ and Java.
In Java, you need to make your method synchronized and protect other uses of a in the same way. Make sure you're synchronizing on the same object in all cases.
In C++, you need to use std::mutex to protect a, probably using std::lock_guard to make sure you properly unlock the mutex at the end of your function.

return a==1 ? 1 : 0;
is a simple way of writing
if(a == 1)
return 1;
else
return 0;
I don't see any code for updating a. But I think you could figure it out.

Regardless of whether there is a write, reading the value of a non-atomic type in C++ is not an atomic operation. If there are no writes then you might not care whether it's atomic; if some other thread might be modifying the value then you certainly do care.

The correct way of putting it is simply: No! (both for Java and C++)
A less correct, but more practical answer is: Technically this is not atomic, but on most mainstream architectures, it is at least for C++.
Nothing is being modified in the code you posted, the variable is only tested. The code will thus usually result in a single TEST (or similar) instruction accessing that memory location, and that is, incidentially, atomic. The instruction will read a cache line, and there will be one well-defined value in the respective loaction, whatever it may be.
However, this is incidential/accidential, not something you can rely on.
It will usually even work -- again, incidentially/accidentially -- when a single other thread writes to the value. For this, the CPU fetches a cache line, overwrites the location for the respective address within the cache line, and writes back the entire cache line to RAM. When you test the variable, you fetch a cache line which contains either the old or the new value (nothing in between). No happens-before guarantees of any kind, but you can still consider this "atomic".
It is much more complicated when several threads modify that variable concurrently (not part of the question). For this to work properly, you need to use something from C++11 <atomic>, or use an atomic intrinsic, or something similar. Otherwise it is very much unclear what happens, and what the result of an operation may be -- one thread might read the value, increment it and write it back, but another one might read the original value before the modified value is written back.
This is more or less guaranteed to end badly, on all current platforms.

No, it is not atomic (in general) although it can be in some architectures (in C++, for example, in intel if the integer is aligned which it will be unless you force it not to be).
Consider these three threads:
// thread one: // thread two: //thread three
while (true) while (true) while (a) ;
a = 0xFFFF0000; a = 0x0000FFFF;
If the write to a is not atomic (for example, intel if a is unaligned, and for the sake of discussion with 16bits in each one of two consecutive cache lines). Now while it seems that the third thread cannot ever come out of the loop (the two possible values of a are both non-zero), the fact is that the assignments are not atomic, thread two could update the higher 16bits to be 0, and thread three could read the lower 16bits to be 0 before thread two gets the time to complete the update, and come out of the loop.
The whole conditional is irrelevant to the question, since the returned value is local to the thread.

No, it still a test followed by a set and then a return.
Yes, multithreadedness will be a problem.
It's just syntactic sugar.

Your question can be rephrased as: is statement:
a == 1
atomic or not? No it is not atomic, you should use std::atomic for a or check that condition under lock of some sort. If whole ternary operator atomic or not does not matter in this context as it does not change anything. If you mean in your question if in this code:
bool flag = somefoo.test();
flag to be consistent to a == 1, it would definitely not, and it irrelevant if whole ternary operator in your question is atomic.

There a lot of good answers here, but none of them mention the need in Java to mark a as volatile.
This is especially important if no other synchronization method is employed, but other threads could updating a. Otherwise, you could be reading an old value of a.

Consider the following code:
bool done = false;
void Thread1() {
while (!done) {
do_something_useful_in_a_loop_1();
}
do_thread1_cleanup();
}
void Thread2() {
do_something_useful_2();
done = true;
do_thread2_cleanup();
}
The synchronization between these two threads is done using a boolean variable done. This is a wrong way to synchronize two threads.
On x86, the biggest issue is the compile-time optimizations.
Part of the code of do_something_useful_2() can be moved below "done = true" by the compiler.
Part of the code of do_thread2_cleanup() can be moved above "done = true" by the compiler.
If do_something_useful_in_a_loop_1() doesn't modify "done", the compiler may re-write Thread1 in the following way:
if (!done) {
while(true) {
do_something_useful_in_a_loop_1();
}
}
do_thread1_cleanup();
so Thread1 will never exit.
On architectures other than x86, the cache effects or out-of-order instruction execution may lead to other subtle problems.
Most race detector will detect such race.
Also, most dynamic race detectors will report data races on the memory accesses that were intended to be synchronized with this bool
(i.e. between do_something_useful_2() and do_thread1_cleanup())
To fix such race you need to use compiler and/or memory barriers (if you are not an expert -- simply use locks).

specific question on java threading + synchronization

I know this question sounds crazy, but consider the following java snippets:
Part - I:
class Consumer implements Runnable{
private boolean shouldTerminate = false
public void run() {
while( !shouldTerminate ){
//consume and perform some operation.
}
}
public void terminate(){
this.shouldTerminate = true;
}
}
So, the first question is, should I ever need to synchronize on shouldTerminate boolean? If so why? I don't mind missing the flag set to true for one or two cycles(cycle = 1 loop execution). And second, can a boolean variable ever be in a inconsistent state?(anything other than true or false)
Part - II of the question:
class Cache<K,V> {
private Map<K, V> cache = new HashMap<K, V>();
public V getValue(K key) {
if ( !cache.containsKey(key) ) {
synchronized(this.cache){
V value = loadValue(key)
cache.put(key, value);
}
}
return cache.get(key);
}
}
Should access to the whole map be synchronized? Is there any possibility where two threads try to run this method, with one "writer thread" halfway through the process of storing value into the map and simultaneously, a "reader thread" invoking the "contains" method. Will this cause the JVM to blow up? (I don't mind overwriting values in the map -- if two writer threads try to load at the same time)

Both of the code examples have broken concurrency.
The first one requires at least the field marked volatile or else the other thread might never see the variable being changed (it may store its value in CPU cache or a register, and not check whether the value in memory has changed).
The second one is even more broken, because the internals of HashMap are no thread-safe and it's not just a single value but a complex data structure - using it from many threads produces completely unpredictable results. The general rule is that both reading and writing the shared state must be synchronized. You may also use ConcurrentHashMap for better performance.

Unless you either synchronize on the variable, or mark the variable as volatile, there is no guarantee that separate threads' view of the object ever get reconciled. To quote the Wikipedia artible on the Java Memory Model
The major caveat of this is that as-if-serial semantics do not prevent different threads from having different views of the data.
Realistically, so long as the two threads synchronize on some lock at some time, the update to the variable will be seen.
I am wondering why you wouldn't want to mark the variable volatile?

It's not that the JVM will "blow up" as such. But both cases are incorrectly synchronised, and so the results will be unpredictable. The bottom line is that JVMs are designed to behave in a particular way if you synchronise in a particular way; if you don't synchronise correctly, you lose that guarantee.
It's not uncommon for people to think they've found a reason why certain synchronisation can be omitted, or to unknowingly omit necessary synchronisation but with no immediately obvious problem. But with inadequate synchronisation, there is a danger that your program could appear to work fine in one environment, only for an issue to appear later when a particular factor is changed (e.g. moving to a machine with more CPUs, or an update to the JVM that adds a particular optimisation).

Synchronizing shouldTerminate: See
Dilum's answer
Your bool value will
never be inconsistent state.
If one
thread is calling
cache.containsKey(key) while
another thread is calling
cache.put(key, value) the JVM will
blow up (by throwing ConcurrentModificationException)
something bad might happen if that put call caused the map
the grow, but will usually mostly work (worse than failure).

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.