I have a web application, I am fetching some ids from elastic search. After fetching the ids I divide them into batches of 1000 each and query the database for more information of those ids. I have to query 4 different tables of the database for fetching all the information of a particular id that I want. To optimize it I used a fixed thread pool of size 10 having singleton object to process the different batches of 1000 each. Then I made a fixed thread pool of size 4 from each of these threads to query the 4 tables parallely. Now since it is a web application it will receive request parallely from different users creating a bottleneck due to the pool size because there is a single pool for all request(of size 10), so I just want to know what should be the ideal way of parallelizing the tasks here.
Related
I have data stored at HDFS and N Postgres shards (separate instances).
During data retrieval based on Java COLUMN_STRING_DATA.hashCode() % N calculation, we pick an exact shard for SQL request.
We have a Spark set up with 5 workers, each has 32 CPU, CPU per task is 2.
I'm struggling with uploading data to shards because the database connections are limited by Database engineering team. Each worker tries to upload data to each shard, that's why I'm limited by allowed connections to PG shards. That's why I can't fully utilize Workers and PG instances.
So I would like to somehow bind a worker to work with a unique, limited group of PG shards.
spark.read()
.option("header", "true")
.orc(pathToHdfs)
.withColumn("hash_group", functions.hash(new Column("search")).mod(PG_SHARDS_SIZE))
Functions.hash uses murmur_hash algorithm, during data retrieval from PG, I choose shard based on Java String.hashCode() % N. I googled, and I'm able to write custom functions, so that shouldn't be a problem.
The question is how after creation a new column with Integer values in range [1..PG_SHARDS_SIZE]. I can assign Worker #1 to work only with values [1,2] of column hash_group. Worker #2 with values [3,4] and so on?
Maybe there are other approaches to achieve binding worker to exact group of PG shards?
I have 1 to 1 mapping Worker - Executor, so binding Executor to the group of PG shard working for me as well.
I have a queue with jobs that goes to different executor pools depending on the type of jobs. The queue is in a DB table and contains jobs from different clients with priorities, etc. I'm omitting some details irrelevant to the question.
At some point different clients put many jobs in the queue at the same time with the same priority, for example about 15-20'000 jobs.
With the current implementation, jobs are fetched using hibernate with this criteria and again, I'm omitting some restrictions for simplicity.
Calendar cal = Calendar.getInstance();
cal.add(Calendar.MINUTE, -minutes);
Criteria c = getSession().createCriteria(QueueEntry.class)
.add(Restrictions.eq("processing", false))
.add(Restrictions.or(Restrictions.ge("serverTimestamp", cal.getTime()), Restrictions.ge("sentTimestamp", cal.getTime())))
.add(Restrictions.lt("attemps", attemps))
.addOrder(Order.asc("priority"))
.addOrder(Order.asc("serverTimestamp"))
.setMaxResults(limit);
In the current situation if client A inserts 15k tasks in 10:00:00 and client B inserts 3k tasks in 10:00:05 (5 seconds later) with the same priority, B's tasks will be fetched and executed after those of A's.
I need to balance fetched jobs between the clients (there's a "client" column in the queue table) - for example if the throughput is 10 tasks/sec to get 5 of the A's tasks and 5 of the B's. When there's no more tasks for client B, to get 10 of A's tasks.
Is there some easy way or trick to do this with the query? The DB is Postgres.
I don't think you will be able to do it by modifying your existing Criteria or with just a single query. To prevent client starvation you would have to create separate resource pools for each client, which is the apporach taken by Fair Scheduler for Hadoop:
The fair scheduler organizes jobs into pools, and divides resources fairly between these pools. By default, there is a separate pool for each user, so that each user gets an equal share of the cluster. It is also possible to set a job's pool based on the user's Unix group or any jobconf property. Within each pool, jobs can be scheduled using either fair sharing or first-in-first-out (FIFO) scheduling.
You can run a query to get a list of distinct clients with total number of awaiting jobs. Based on the distinct client count divide the global job limit and fetch the awaiting jobs for each given client in a separate query.
Multiple instances of my multi-threaded(approx 10 threads) application is running on different machines(approx 10 machines). So overall 100 threads of this application are active simultaneously.
Each of these threads produce 4 output sets, each set containing 1k-5k rows. Each of these sets is pushed to a single Mysql machine , same db, same table(insert or update operation). So there are 4 tables consuming 4 sets produced by each thread.
I am using mybatis as ORM. These threads may consume a lot of time in writing output to DB than processing the requests.
How can I optimize the database writes in this case?
1. Use batch processing of mybatis
2. Write data to files which will be picked up by single consumer thread & written into DB?
3. Write each data set to different files & use 4 consumer threads to pick data from same set that must be pushed to same table, so locking is minimized?
Please suggest other better ways if possible?
Databases are made to handle concurrency. Not sure what exactly mybatis brings into the picture (not a huge fan of ORM in general), but if it is using it, that makes you start thinking about hacks like intermediate files and single-threaded updates, you are probably much better off ripping it out and writing to db with plain jdbc, which should have no problem handling your use case, provided, you batch your updates adequately.
I am using a single standalone Mongo DB server with no special topology like replication or sharding etc. Currently I have an issue that mongo DB does not support more than 500 parallel requests. Note that I am using only one instance of MongoClient and the remaining threads are used for inserts. I am using a java executor framework to create the threads and these threads are used to insert data to a collection [all insert in the same collection]
You should queue the requests before you issue them towards the database. There is no use requesting 500 things from your database in parallel. Remember a single request comes with some costs memory wise, locking wise and so on. Actually you are wasting resources by asking your database too much at once - remember I mean this request wise not data wise.
So use a queue (or more) and pool up the requests. From that pool you feed your worker threads (lets say 5 or 10 are enough) and that's it.
Take a look at the Future interface in the concurrent package of java. Using asynchrone processing here looks like the thing with the highest throughput and the lowest resource impact.
But check the MongoDB driver first. I would not be surprised if they have implemented it already this way. If this is the case you just have to limit yourself by using a queue to have only lets say 10 or 100 requests at once being handled by the database driver. Do some performance check tweaking the number of actual requests send to the database.
I have a MySQL database with a large number of rows.
I want to initialize multiple Threads (each with its own database connection) in Java and read/print the data simultaneously.
How to partition data between multiple threads so as no two Threads read the same record? What strategies can be used?
It depends on what kind of work are your threads going to do. For example i usually execute single SELECT for some kind of large dataset, add tasks to thread safe task queue and submit workers which picks up proper task from queue to process. I usually write to DB without synchronisation, but that depends on size of unit of work, and DB constrains (like unique keys etc). Works like charm.
Other method would be to just simply run multiple threads and let them work on their own. I strongly disadvice usage of some fancy LIMIT, OFFSET however. It still requires DB to fetch MORE data rows than it will actually return from query.
EDIT:
As you have added comment that you have same data, than yes, my solution is what are you looking for
Get dataset by single query
Add data to queue
Lunch your threads (by executors or new threads)
Pick data from queue and process it.
If the large dataset has an integer primary key, then one of the approaches would be as follows
Get the count of rows using the same select query.
Divide the entire dataset into equal number of partitions
Assign each partition to each thead. Each thread will have its own select query with primary key value range as constraint.
Note: the following issues with this approach
You (fire number of threads + 1) queries to database. So performance might be a problem.
All the partitions may not be equal (as there will be some ids which are deleted).
This approach is simple and makes sure that a row is strictly processed by only thread.
You can use a singleton class to maintain already read rows. So every thread can access the row number from that singleton.
Otherwise you can use static AtomicInteger variable from a common class. Every time threads will call getAndIncrement method. So you can partition data between the threads.