Setup
I have a multithreaded Java application which will receive 200-300 requests per second to perform a task 'A'(which take approximately 30 milliseconds) on an input received in a request.
The application has a cache(max size = 1MB) which is read by each thread to perform task 'A' on input received:
public class DataProvider() {
private HashMap<KeyObject, ValueObject> cache;
private Database database;
// Scheduled to run in interval of 15 seconds by a background thread
public synchronized void updateData() {
this.cache = database.getData();
}
public HashMap<KeyObject, ValueObject> getCache() {
return this.cache;
}
}
KeyObject and ValueObject are POJO. ValueObject contains List of another POJO.
For every request received task is done in following way:
public class TaskExecutor() {
private DataProvider dataProvider;
public boolean doTask(final InputObject input) {
final HashMap<KeyObject, ValueObject> data = dataProvider.getCache(); // shallow copy I think
// Do Task 'A' using data
}
}
Problem
One of the thread starts executing task 'A' at timestamp 't' using data 'd1' from cache. At time 't + t1' cache data gets updated to 'd2'. Thread now starts using data 'd2' to finish rest of the task. Task gets completed at 't+t1+t2'. Half of the task was completed with different data. This will lead to invalid outcome of task.
Current Approach
Each thread will create a deep copy of the cache and then use the deep copy to perform the task using one of the following approach(best in performance) to perform deep copy:
How do you make a deep copy of an object in Java?
Deep clone utility recommendation
Limitation
Cloning using deep copy will create thousand of objects which may crash JVM.
All the cloning approaches don't look good in terms of performance.
For Your use case, returning a new cache from database.getData(); is much better choice. Because If You choose this way, You would only have to create new cache object once in 15 second. If You choose to clone cache in each task, You would have to create 4501 cache object in 15 second. Obviously returning new cache object is the right choice.
If the code You provided is the same code as in Your project, I believe database.getData(); method changing the content of a single cache object instead of returning a new one. If You return a new cache object from this method Your problem will be solved.
Related
We need to read data from our checkpoints manually for different reasons (let's say we need to change our state object/class structure, so we want to read restore and copy data to a new type of object)
But, while we are reading everything is good, when we want to keep/store it in memory and deploying to flink cluster we get empty list/map. in log we see that we are reading and adding all our data properly to list/map but as soon as our method completes it's work we lost data, list/map is empty :(
val env = ExecutionEnvironment.getExecutionEnvironment();
val savepoint = Savepoint.load(env, checkpointSavepointLocation, new HashMapStateBackend());
private List<KeyedAssetTagWithConfig> keyedAssetsTagWithConfigs = new ArrayList<>();
val keyedStateReaderFunction = new KeyedStateReaderFunctionImpl();
savepoint.readKeyedState("my-uuid", keyedStateReaderFunction)
.setParallelism(1)
.output(new MyLocalCollectionOutputFormat<>(keyedAssetsTagWithConfigs));
env.execute("MyJobName");
private static class KeyedStateReaderFunctionImpl extends KeyedStateReaderFunction<String, KeyedAssetTagWithConfig> {
private MapState<String, KeyedAssetTagWithConfig> liveTagsValues;
private Map<String, KeyedAssetTagWithConfig> keyToValues = new ConcurrentHashMap<>();
#Override
public void open(final Configuration parameters) throws Exception {
liveTagsValues = getRuntimeContext().getMapState(ExpressionsProcessor.liveTagsValuesStateDescriptor);
}
#Override
public void readKey(final String key, final Context ctx, final Collector<KeyedAssetTagWithConfig> out) throws Exception {
liveTagsValues.iterator().forEachRemaining(entry -> {
keyToValues.put(entry.getKey(), entry.getValue());
log.info("key {} -> {} val", entry.getKey(), entry.getValue());
out.collect(entry.getValue());
});
}
public Map<String, KeyedAssetTagWithConfig> getKeyToValues() {
return keyToValues;
}
}
as soon as this code executes I expect having all values inside map which we get from keyedStateReaderFunction.getKeyToValues(). But it returns empty map. However, I see in logs we are reading all of them properly. Even data empty inside keyedAssetsTagWithConfigs list where we are reading output in it.
If anyone has any idea will be very helpful because I get lost, I never had such experience that I put data to map and then I lose it :) When I serialize and write my map or list to text file and then deserialize it from there (using jackson) I see my data exists, but this is not a solution, kind of "workaround"
Thanks in advance
The code you show creates and submits a Flink job to be executed in its own environment orchestrated by the Flink framework: https://nightlies.apache.org/flink/flink-docs-stable/docs/concepts/flink-architecture/#flink-application-execution
The job runs independently than the code that builds and submits the Flink job so when you call keyedStateReaderFunction.getKeyToValues(), you are calling the method of the object that was used to build the job, not the actual object that was run in the Flink execution environment.
Your workaround seems like a valid option to me. You can then submit the file with your savepoint contents to your new job to recreate its state as you'd like.
You have an instance of KeyedStateReaderFunctionImpl in the Flink client which gets serialized and sent to each task manager. Each task manager then deserializes a copy of that KeyedStateReaderFunctionImpl and calls its open and readKey methods, and gradually builds up a private Map containing its share of the data extracted from the savepoint/checkpoint.
Meanwhile the original KeyedStateReaderFunctionImpl back in the Flink client has never had its open or readKey methods called, and doesn't hold any data.
In your case the parallelism is one, so there is only one task manager, but in general you will need collect the output from each task manager and assemble together the complete results from these pieces. These results are not available in the flink client process because the work hasn't been done there.
I found a solution, started job in attached mode and collecting results in main thread
val env = ExecutionEnvironment.getExecutionEnvironment();
val configuration = env.getConfiguration();
configuration
.setBoolean(DeploymentOptions.ATTACHED, true);
...
val myresults = dataSource.collect();
Hope will help somebody else because I wasted couple of days while trying to find a soltion.
I have built a plain Kafka streams API using Low-level Kafka API. The topology is linear.
p1 -> p2 -> p3
While doing context.forward(), I am getting NPE, snippet here:
NAjava.lang.NullPointerException: null
at org.apache.kafka.streams.processor.internals.ProcessorContextImpl.forward(ProcessorContextImpl.java:178)
at org.apache.kafka.streams.processor.internals.ProcessorContextImpl.forward(ProcessorContextImpl.java:133)
...
I am using Kafka Stream 2.3.0.
I see a similar SO question [here][1], and the question is based on the very old version. So, not sure if this is the same error?
Edit
I am putting some more info, keeping the Gist of what I am doing
public class SP1Processor implements StreamProcessor {
private StreamProcessingContext ctxt;
// In init(), create a single thread pool
// which does some processing and sends the
// data to next processor
#Override
void init(StreamProcessingContext ctxt) {
this.ctxt = ctxt;
// Create a thread pool, do some work
// and then do this.ctxt.forward(K,V)
// Not showing code of Thread pool
// Strangely, inside this thread pool,
// this.ctxt isn't same what I see in process()
// shouldn't it be same? ctxt is member variable
// and shouldn't it be same
// this.ctxt.forward(K,V) here in this thread pool is causing NPE
// why does it happen?
this.ctxt.forward(K,V);
}
#Override
void process(K,V) {
// Here do some processing and go to the next processor chain
// This works fine
this.ctxt.forward(K,V);
}
}
[1]: https://stackoverflow.com/questions/39067846/periodic-npe-in-kafka-streams-processor-context
It looks like it could be the same issue as the linked question, although we are talking about a much more contemporary version in your case.
Make sure that ProcessorSupplier.get() returns a new instance each time it is called.
You shouldn't create any thread pool inside Processor or DSL calls.
Parallelism is managed in KafkaStreams by num.stream.threads, number of partitions and number of instances.
ctxt is the same but its fields/members might be different (ex. currentNode) - might be change by different threads.
I write a thread safe class to get input from multiple threads and upload the result to S3 once it runs up to a fixed size.
S3Exporter class
// this class is thread safe.
public class S3Exporter {
private static final int BUFFER_PADDING = 1000;
private final int targetSize;
private final ByteArrayOutputStream buf;
private volatile boolean started;
public S3Exporter(final int targetSize) {
buf = new ByteArrayOutputStream(targetSize + BUFFER_PADDING);
this.targetSize = targetSize;
started = false;
}
public synchronized void start() {
started = true;
}
public synchronized void end() {
started = false;
flush();
}
public synchronized void export(byte[] data) throws IOException {
Preconditions.checkState(started, "Not started!");
buf.write(b, buf.size(), b.length);
flushIfNeeded();
}
private void flushIfNeeded() {
if (buf.size() >= targetSize) {
flush();
}
}
public synchronized void flush() {
if (buf.size() > 0) {
// upload buf to s3, it's a time-consuming operation
buf.reset();
}
}
}
The client calls export method to pass data and if exception is thrown the client will pass that data later.
To avoid losing data when restarting the application, I add a shutdown hook when creating S3Exporter object:
S3Exporter exporter = new S3Exporter(10000);
Runtime.getRuntime().addShutdownHook(new Thread(() -> exporter.end()));
My concern is the class is not scalable, I mean it could become bottleneck of the system when data are getting more. I could figure out 2 ways to improve the situation:
do the time-consuming upload operation asynchronously: use an executor to upload and call ThreadPoolExecutor.awaitTermination() in the shutdown hook.
just put data to a LinkedBlockingQueue in export method and use multiple threads to handle it.( This way is more scalable than the first per my understanding)
Then I need to do more work in the shutdown hook thread to make sure not losing the accepted data and it's not a good idea as I know. I'll take the risk of losing data when restarting the application, which is the last thing I wanna see.
My question
Is my concern about the scalability a really problem?( To make the question less stupid, let's say the data size is a few bytes and TPS to call export method is 500)
If the answer to the 1st question is yes, what about my improvements, are they right? How to do the cleanup work to avoid losing data?
Scalability depends on requirements, constraints, desired service level, personal preferences, expected users growth rate, and especially money: given infinite resources, every piece of software can be scaled. You didn't mention any, so I guess you don't have any actual figure. In this phase, as a programmer, your job is to make a correct program that uses a predictable amount of resources.
Your program seems correct, and most of your assumptions are correct, too. However I suggest to immediately store chunks to some local persistent database (or the raw filesystem) and have a periodic job, run in a separate thread, that upload group of chunks to S3, and remove any shutdown hooks (you can use Camel for the boring parts). This is because such hooks are unreliable and should only be used as last resources for quick and optional cleanup (optional in the sense that you must be prepared that the cleanup could not have been run properly until the end).
Using a file instead of memory, your data can survive fatal errors and the working memory required by your application is almost independent on the load: there's an irrelevant amount of extra CPU and some disk I/O that is way cheaper then memory.
I have made a Java program that connects to a SQLite database using SQLite4Java.
I read from the serial port and write values to the database. This worked fine in the beginning, but now my program has grown and I have several threads. I have tried to handle that with a SQLiteQueue-variable that execute database operations with something like this:
public void insertTempValue(final SQLiteStatement stmt, final long logTime, final double tempValue)
{
if(checkQueue("insertTempValue(SQLiteStatement, long, double)", "Queue is not running!", false))
{
queue.execute(new SQLiteJob<Object>()
{
protected Object job(SQLiteConnection connection) throws SQLiteException
{
stmt.bind(1, logTime);
stmt.bind(2, tempValue);
stmt.step();
stmt.reset(true);
return null;
}
});
}
} // end insertTempValue(SQLiteStatement, long, double)
But now my SQLite-class can't execute the statements reporting :
DB[1][U]: disposing [INSERT INTO Temperatures VALUES (?,?)]DB[1][U] from alien thread
SQLiteDB$6#8afbefd: job exception com.almworks.sqlite4java.SQLiteException: [-92] statement is disposed
So the execution does not happen.
I have tried to figure out what's wrong and I think I need a Java wrapper that makes all the database operations calls from a single thread that the other threads go through.
Here is my problem I don't know how to implement this in a good way.
How can I make a method-call and ensure that it always runs from the same thread?
Put all your database access code into a package and make all the classes package private. Write one Runnable or Thread subclass with a run() method that runs a loop. The loop checks for queued information requests, and runs the appropriate database access code to find the information, putting the information into the request and marking the request complete before going back to the queue.
Client code queues data requests and waits for answers, perhaps by blocking until the request is marked complete.
Data requests would look something like this:
public class InsertTempValueRequest {
// This method is called from client threads before queueing
// Client thread queues this object after construction
public InsertTempValueRequest(
final long logTime,
final double tempValue
) {
this.logTime = logTime
this.tempValue = tempValue
}
// This method is called from client threads after queueing to check for completion
public isComplete() {
return isComplete;
}
// This method is called from the database thread after dequeuing this object
execute(
SQLiteConnection connection,
SQLiteStatement statement
) {
// execute the statement using logTime and tempValue member data, and commit
isComplete = true;
}
private volatile long logTime;
private volatile double tempValue;
private volatile boolean isComplete = false;
}
This will work, but I suspect there will be a lot of hassle in the implementation. I think you could also get by by using a lock that only permits one thread at a time to access the database, and also - this is the difference from your existing situation - beginning the access by creating the database resources - including statements - from scratch, and disposing of those resources before releasing the lock.
I found a solution to my problem. I have now implemented a wrapper-class that makes all operations with my older SQLite-class using an ExecutorService, inspired from Thread Executor Example and got the correct usage from Java Doc ExecutorService.
What is the best way to share a file between two "writer" services in the same application?
Edit:
Sorry I should have given more details I guess.
I have a Service that saves entries into a buffer. When the buffer gets full it writes all the entries to the file (and so on). Another Service running will come at some point and read the file (essentially copy/compress it) and then empty it.
Here is a general idea of what you can do:
public class FileManager
{
private final FileWriter writer = new FileWriter("SomeFile.txt");
private final object sync = new object();
public void writeBuffer(string buffer)
{
synchronized(sync)
{
writer.write(buffer.getBytes());
}
}
public void copyAndCompress()
{
synchronized(sync)
{
// copy and/or compress
}
}
}
You will have to do some extra work to get it all to work safe, but this is just a basic example to give you an idea of how it looks.
A common method for locking is to create a second file in the same location as the main file. The second file may contain locking data or be blank. The benefit to having locking data (such as a process ID) is that you can easily detect a stale lockfile, which is an inevitability you must plan for. Although PID might not be the best locking data in your case.
example:
Service1:
creates myfile.lock
creates/opens myfile
Service2:
Notices that myfile.lock is present and pauses/blocks/waits
When myfile.lock goes away, it creates it and then opens myfile.
It would also be advantageous for you to double-check that the file contains your locking information (identification specific to your service) right after creating it - just in case two or more services are waiting and create a lock at the exact same time. The last one succeeds and so all other services should notice that their locking data is no longer in the file. Also - pause a few milliseconds before checking its contents.