I have a program which accesses a single RocksDB using multiple threads.
Our workflow for a given document is to read the cache, do some work, then update the cache.
My code uses chained CompletableFutures to process multiple documents in order (and processes the first document before starting the subsequent document). So my RocksDB workload consists of (read, write) repeated several times for the same key.
Most of the time we get the correct value from the cache for each run through the workflow, but occasionally we will get stale data. Each operation could run on one of many threads in the Executor, but they will never run in parallel for the same key.
Is there a way to ensure that we get strong consistency? I wrote a unit test to confirm that this happens, and it happens between 1-3% of the time. I even added a read-after-write, and that reduced the inconsistency, but did not eliminate it.
Not sure what you are referring to as strong consistency is rocksdb is strongly consistent - there is no across the network replication going on where you would see eventual consistency
if you want to get a snapshotted read use a snapshot sequence identifier when doing your reads
Sounds more like a threading issue where your reads and writes are happening in non-determenistic order
Related
I have a datastream in which the order of the events is important. The time characteristic is set to EventTime as the incoming records have a timestamp within them.
In order to guarantee the ordering, I set the parallelism for the program to 1. Could that become a problem, performance wise, when my program gets more complex?
If I understand correctly, I need to assign watermarks to my events, if I want to keep the stream ordered by timestamp. This is quite simple. But I'm reading that even that doesn't guarantee order? Later on, I want to do stateful computations over that stream. So, for that I use a FlatMap function, which needs the stream to be keyed. But if I key the stream, the order is lost again. AFAIK this is because of different stream partitions, which are "caused" by parallelism.
I have two questions:
Do I need parallelism? What factors do I need to consider here?
How would I achieve "ordered parallelism" with what I described above?
Several points to consider:
Setting the parallelism to 1 for the entire job will prevent scaling your application, which will affect performance. Whether this actually matters depends on your application requirements, but it would certainly be limitation, and could be a problem.
If the aggregates you've mentioned are meant to be computed globally across all the event records then operating in parallel will require doing some pre-aggregation in parallel. But in this case you will then have to reduce the parallelism to 1 in the later stages of your job graph in order to produce the ultimate (global) results.
If on the other hand these aggregates are to be computed independently for each value of some key, then it makes sense to consider keying the stream and to use that partitioning as the basis for operating in parallel.
All of the operations you mention require some state, whether computing max, min, averages, or uptime and downtime. For example, you can't compute the maximum without remembering the maximum encountered so far.
If I understand correctly how Flink's NiFi source connector works, then if the source is operating in parallel, keying the stream will result in out-of-order events.
However, none of the operations you've mentioned require that the data be delivered in-order. Computing uptime (and downtime) on an out-of-order stream will require some buffering -- these operations will need to wait for out-of-order data to arrive before they can produce results -- but that's certainly doable. That's exactly what watermarks are for; they define how long to wait for out-of-order data. You can use an event-time timer in a ProcessFunction to arrange for an onTimer callback to be called when all earlier events have been processed.
You could always sort the keyed stream. Here's an example.
The uptime/downtime calculation should be easy to do with Flink's CEP library (which sorts its input, btw).
UPDATE:
It is true that after applying a ProcessFunction to a keyed stream the stream is no longer keyed. But in this case you could safely use reinterpretAsKeyedStream to inform Flink that the stream is still keyed.
As for CEP, this library uses state on your behalf, making it easier to develop applications that need to react to patterns.
In our current Java project, we need to batch process a huge set of records. Once, this processing is done, it must start again and process all records again. This processing must be parallelized as well as distributed among multiple nodes.
The records itself are stored in a database. Using some id range (e.g. 1-10000) for identifying a batch would be sufficient.
From a high level perspective, I see the following steps:
A sub task processes one batch of records.
A master task checks if any sub task is still running. If not, create one sub task for each batch of records.
We use MongoDB quite heavily and thought of persisting sub tasks in it. Then, each node can pick up sub tasks that are not done yet, does the processing and marks the record as done. Once there are no undone subtasks, the master task creates all the sub tasks again. This would probably work, but we are looking for a solution in which we don't need to do the heavy synchronization work ourselves.
Could this be a possible use-case for akka?
Can akka-persistence be used to synchronize the processing among different nodes?
Are there any other Java/JVM frameworks suited for this job?
Your question is way too broad for SO's format. Plase read this guide in the future before asking, and don't ask your group members to vote your question up just to inflate what is obviously an ill-posed question ( ͡° ͜ʖ ͡°).
Anyways:
1) Yes, you can implement your requirements in Akka. In particular, since you mentioned multiple nodes, you are looking at the akka-cluster module (for inter-node communication), and you might also need akka-cluster-sharding (in case you want to keep all data in memory beside during processing).
2) No, I would strongly not reccomend that. While you could technically force your problem into using akka-persistence for synchronizing the tasks, the goal of akka-persistence is simply to make an actor's state persistent. Akka itself in its basic form is enough for handling all your synchronization issues. Simply have a master actor create a worker for every subtask and monitor its completion.
3) Yes. Note that the answer to this question is always yes no matter which job.
I am wondering how I would go about doing this. Say I load a list of 1,000 words and for each word a thread is created and say it does a google search on each word. The problem here is obvious. I can't have 1k threads, can I. Keep in mind I am extremely new to threads and synchronization. So basically I am wondering how I would go about using less threads. I assume I have to set thread amount to a fixed number and synchronize the threads. Was wondering how to do this with Apache HttpClient using GetThread and then run it. In run I'm getting the data from webpage and turning it into a String and then checking if it contains a certain word.
Surely you can have as many threads as you want. But in general it is not recommended to use more threads than there are processing cores on your computer.
And don't forget that creating 1000 internet sessions at once affects your networking. A size of one single google page is nearly 0.3 megabytes. Are you really going to download 300 megabytes of data at once?
By the way,
There is a funny thing about concurrency.
Some people say: "synchronization is like concurrency". It is not true.
Synchronization is the opposite of concurrency.
Concurrency is when lots of things happen in parallel.
Synchronization is when I am blocking you.
(Joshua Bloch)
Maybe you can look at this problem this way.
You have 1000 words and for each word you are going to carry out a search.
In other words there are 1000 tasks to be executed and they are not related
to each other, so there is no need for synchronization in the case of this
problem as per the following definition from Wiki.
"In computer science, synchronization refers to one of two distinct but related concepts: synchronization of processes, and synchronization of data. Process synchronization refers to the idea that multiple processes are to join up or handshake at a certain point, in order to reach an agreement or commit to a certain sequence of action. Data Synchronization refers to the idea of keeping multiple copies of a dataset in coherence with one another, or to maintain data integrity"
So in this problem you do not have to synchronize the 1000 processes which
execute the word searches since they can run independently and dont need
to join forces. So it is not a Process synchronization.
It is not a Data synchronization either since the data of each search is
independent of the other 999 searches.
Hence when Joshua says Synchronization is when I am blocking you, there is no need of blocking in this case.
Yes all tasks can concurrently get executed in different threads.
Of course your system may not have the resources to run 1000 threads
concurrently ( read same time ).
So you need concepts like pools where a pool has a certain no of
threads...say if it has 10 threads...then those 10 will start
10 independent searches on 10 words from your list.
If any of them is done with its task then it will take up the next
word search task available and the process goes on....
While a Hadoop Job is running or in progress if I write something to HDFS or Hbase then will that
data be visible to all nodes in the cluster
1.)immediately?
2.)If not immediately then after how much time?
3.)Or the time really cannot be determined?
HDFS is strongly consistent, so once a write has completed successfully, the new data should be visible across all nodes immediately. Clearly the actual writing takes some time - see replication pipelining for some details on this.
This is in contrast to eventually consistent systems, where it may take an indefinite time (though often only a few milliseconds) before all nodes see a consistent view of the data.
Systems such as Cassandra have tunable consistency - each read and write can be performed at a different level of consistency to suit the operation being performed.
In best of my understanding the data is visible immediately, after write operation is finished.
Lets see some aspects of the process:
When client writes to HDFS data is written in all replicas, and after the write operation finished it should be perfectly available
There is also only one place with metadata - NameNode which also do not have any notion of isolation which would enable hiding data till some larger peace of work is done.
HBase is a different case - since it will write only LOG to HDFS immediately and its HFiles will be updated with new data after compaction only. In the same time - after HBase itself write something into HDFS - data will be visible immediately.
In HDFS data is visible once it is flushed or synced using hflush() or hsync() method - these methods were introduced in 0.21 version I guess. HFlush gives you a guarantee that data is visible to all readers. Hsync gives you a guarantee that data was saved to disk (altough it may still be in your disk cache). The write method does not give you any guarantees. To answer your question - in HDFS data is visible immediately to everyone after doing hflush() or hsync().
Our company has a Batch Application which runs every day, It does some database related jobs mostly, import data into database table from file for example.
There are 20+ tasks defined in that application, each one may depends on other ones or not.
The application execute tasks one by one, the whole application runs in a single thread.
It takes 3~7 hours to finish all the tasks. I think it's too long, so I think maybe I can improve performance by multi-threading.
I think as there is dependency between tasks, it not good (or it's not easy) to make tasks run in parallel, but maybe I can use multi-threading to improve performance inside a task.
for example : we have a task defined as "ImportBizData", which copy data into a database table from a data file(usually contains 100,0000+ rows). I wonder is that worth to use multi-threading?
As I know a little about multi-threading, I hope some one provide some tutorial links on this topic.
Multi-threading will improve your performance but there are a couple of things you need to know:
Each thread needs its own JDBC connection. Connections can't be shared between threads because each connection is also a transaction.
Upload the data in chunks and commit once in a while to avoid accumulating huge rollback/undo tables.
Cut tasks into several work units where each unit does one job.
To elaborate the last point: Currently, you have a task that reads a file, parses it, opens a JDBC connection, does some calculations, sends the data to the database, etc.
What you should do:
One (!) thread to read the file and create "jobs" out of it. Each job should contains a small, but not too small "unit of work". Push those into a queue
The next thread(s) wait(s) for jobs in the queue and do the calculations. This can happen while the threads in step #1 wait for the slow hard disk to return the new lines of data. The result of this conversion step goes into the next queue
One or more threads to upload the data via JDBC.
The first and the last threads are pretty slow because they are I/O bound (hard disks are slow and network connections are even worse). Plus inserting data in a database is a very complex task (allocating space, updating indexes, checking foreign keys)
Using different worker threads gives you lots of advantages:
It's easy to test each thread separately. Since they don't share data, you need no synchronization. The queues will do that for you
You can quickly change the number of threads for each step to tweak performance
Multi threading may be of help, if the lines are uncorrelated, you may start off two processes one reading even lines, another uneven lines, and get your db connection from a connection pool (dbcp) and analyze performance. But first I would investigate whether jdbc is the best approach normally databases have optimized solutions for imports like this. These solutions may also temporarily switch of constraint checking of your table, and turn that back on later, which is also great for performance. As always depending on your requirements.
Also you may want to checkout springbatch which is designed for batch processing.
As far as I know,the JDBC Bridge uses synchronized methods to serialize all calls to ODBC so using mutliple threads won't give you any performance boost unless it boosts your application itself.
I am not all that familiar with JDBC but regarding the multithreading bit of your question, what you should keep in mind is that parallel processing relies on effectively dividing your problem into bits that are independent of one another and in some way putting them back together (their output that is). If you dont know the underlying dependencies between tasks you might end up having really odd errors/exceptions in your code. Even worse, it might all execute without any problems, but the results might be off from true values. Multi-threading is tricky business, in a way fun to learn (at least I think so) but pain in the neck when things go south.
Here are a couple of links that might provide useful:
Oracle's java trail: best place to start
A good tutorial for java concurrency
an interesting article on concurrency
If you are serious about putting effort to getting into multi-threading I can recommend GOETZ, BRIAN: JAVA CONCURRENCY, amazing book really..
Good luck
I had a similar task. But in my case, all the tables were unrelated to each other.
STEP1:
Using SQL Loader(Oracle) for uploading data into database(very fast) OR any similar bulk update tools for your database.
STEP2:
Running each uploading process in a different thread(for unrelated tasks) and in a single thread for related tasks.
P.S. You could identify different inter-related jobs in your application and categorize them in groups; and running each group in different threads.
Links to run you up:
JAVA Threading
follow the last example in the above link(Example: Partitioning a large task with multiple threads)
SQL Loader can dramatically improve performance
The fastest way I've found to insert large numbers of records into Oracle is with array operations. See the "setExecuteBatch" method, which is specific to OraclePreparedStatement. It's described in one of the examples here:
http://betteratoracle.com/posts/25-array-batch-inserts-with-jdbc
If Multi threading would complicate your work, you could go with Async messaging. I'm not fully aware of what your needs are, so, the following is from what I am seeing currently.
Create a file reader java whose purpose is to read the biz file and put messages into the JMS queue on the server. This could be plain Java with static void main()
Consume the JMS messages in the Message driven beans(You can set the limit on the number of beans to be created in the pool, 50 or 100 depending on the need) if you have mutliple servers, well and good, your job is now split into multiple servers.
Each row of data is asynchronously split between 2 servers and 50 beans on each server.
You do not have to deal with threads in the whole process, JMS is ideal because your data is within a transaction, if something fails before you send an ack to the server, the message will be resent to the consumer, the load will be split between the servers without you doing anything special like multi threading.
Also, spring is providing spring-batch which can help you. http://docs.spring.io/spring-batch/reference/html/spring-batch-intro.html#springBatchUsageScenarios