Best way to implement batch reads in aerospike - java

I have a scenario where for each request, I've to make a batch get of atleast 1000 keys.
Currently I'm getting 2000 requests per minute and this is expected to rise.
Also I've read that batch get of aerospike internally makes individual request to server concurrently/sequentially.
I am using the aerospike as a cluster (running on SSD). So is this efficient to write UDF (user defined method) in lua for making a batch request, and aggregating the results at server level instead of multiple hits from client
Kindly suggest if default batch get of aerospike will be efficient or I've to do something else.

Batch read is the right way to do it. Results are returned in the order of keys specified in the list. Records not found will return null. Client parallel-izes the keys by nodes - waits (there is no callback in client unlike Secondary Index or Scan) and collects the returns from all nodes and presents them back in the client in original order. Make sure you have adequate memory in the client to hold all the returned batch results.

To UDF or Not to UDF?
First thing, you cannot do batch reads as a UDF, at least not in any way that's remotely efficient.
You have two kinds of UDF. The first is a record UDF, which is limited to operating on a single record. The record is locked as your UDF executes, so it can either read or modify the data, but it is sandboxed from accessing other records. The second is a stream UDF, which is read-only, and runs against either a query or a full scan of a namespace or set. Its purpose is to allow you to implement aggregations. Even if you're retrieving 1000 keys at a time, using stream UDFs to just pick a batch of keys from a much larger set or namespace is very inefficient. That aside, UDFs will always be slower than the native operations provided by Aerospike, and this is true for any database.
Batch Reads
Read the documentation for batch operations, and specifically the section on the batch-index protocol. There is a great pair of FAQs in the community forum you should read:
FAQ - Differences between getting single record versus batch
FAQ - batch-index tuning parameters
Capacity Planning
Finally, if you are getting 2000 requests per-second at your application, and each of those turns into a batch-read of 1000 keys, you need to make sure that your cluster is sized properly to handle 2000 * 1000 = 2Mtps reads. Tuning the batch-index parameters will help, but if you don't have enough aggregate SSD capacity to support those 2 million reads per-second, your problem is one of capacity planning.

Related

How can I implement size based batching instead of time based for BoundedSource in apache beam/data flow?

Intention: I have a bounded data set (Rest API exposed, 10k records) to be written to BigQuery with some additional steps. As my data set is bounded I've implemented BoundedSource interface to read records in my apache beam pipeline.
Problem: all 10k records are read in one shot (one shot for write to BigQiery as well). But I want to query small part (for example 200 rows), process, save to BigQuery and then query next 200 rows.
I've found that I can use windowing with bounded PCollections, but windows are created on time basis (every 10 sec for example) and I want it to be on record counter basis (every 200 records)
Question: How can I implement the mentioned splitting to batches/windows with 200 records size? Am I missing something?
The question is similar to this but it wasn't answered
Given a PCollection of rows, you can use GroupIntoBatches to batch these up into a PCollection of sets of rows of a given size.
As for reading your input in an incremental way, you can use the split method of BoundedSource to shard your read into several pieces which will then be read independently (possibly on separate workers). For a bounded pipeline, this will still happen in its entirety (all 10k records read) before anything is written, but need not happen on a single worker. You could also insert a Reshuffle to decouple the parallelism between your read and your write.

Spark: sparkSession read from the result of an http response

Small question regarding Spark and how to read from the result of a http response please.
It is well known Spark can take as datasource some database, or CSV, etc...
sparkSession.read().format("csv").load("path/to/people.csv");
sparkSession.read().format("org.apache.spark.sql.cassandra").options(properties).load()
May I ask how to read from the result of an http call directly please?
Without having to dump the data back inside another intermediate csv / intermediate database table.
For instance the csv and database would contains millions of rows, and once read, the job needs to perform some kind of map reduce operation.
Now, the exact same data comes from the result of an http call. It is small enough for the network layer, but the information contained inside the payload is big, so I would like to apply the same map reduce.
How to read from the response of an http call please?
Thank you
You have two options for reading data in Spark:
Read directly to the driver and distribute to the executors (not scalable as everything passes through driver)
Read directly from the executors
The built in data sources like csv, parquet etc all implement reading from the executors so the job can scale with the data. They define how each partition of the data should be read - e.g. if we have 10 executors, how do you cut up the data source into 10 sections so each executor can directly read one section.
If you want to load from a HTTP request you will either have to read through the driver and distribute, which may be OK if you know the data is going to be less than ~10mb. Otherwise you would need to implement a custom data source to allow the executors to each read partition, can read here for more: https://aamargajbhiye.medium.com/speed-up-apache-spark-job-execution-using-a-custom-data-source-fd791a0fa4b0
Will finish by saying that this second option is almost definitely an anti-pattern. You will likely be much better off providing an intermediate staging environment (e.g. S3/GCS), calling the server to load the data to the intermediate store and then reading to Spark on completion. In scenario 2, you will likely end up putting too much load on the server, amongst other issues.
In previous lifetimes, I created a custom datasource. It is not the most trivial thing to do, but this GitHub repo explains it: https://github.com/jgperrin/net.jgp.books.spark.ch09.
When it comes to reading from a network stream, make sure that only one executor does it.

IO With Callback to set Database status

Give this
public void do(RequestObject request, Callback<RequestObject> callback);
Where Callback is called when the request is processed. One client has to set status of the request to the database. The client fetches some items passes them to the above method and the callback sets the status.
It was working ok for small number of items and slower IO. But now, the IO is speed up and the status is written to database vary frequently. This is causing my database (MySQL) to make so many disk read write calls. My disk usage goes through the roof.
I was thinking of aggregating the setting of status but power in not reliable, that is not a plausible solution. How should re'design this?
EDIT
When the process is started I insert a value and when there is an update, I fetch the item and update the item. #user2612030 Your question lead me to believe, using hibernate might be what is causing more reads than it is necessary.
I can upgrade my disk drive to SSD but that would only do so much. I want a solution that scales.
An SSD is a good starting point, more RAM to MySQL should also help. It can't get rid of the writes, but with enough RAM (and MySQL configured to use it!) there should be few physical reads. If you are using the default configuration, tune it. See for example https://www.percona.com/blog/2016/10/12/mysql-5-7-performance-tuning-immediately-after-installation/ or just search for MySQL memory configuration.
You could also add disks and spread the writes to multiple disks with multiple controllers. That should also help a bit.
It is hard to give good advice without knowing how you record status values. Inserts or updates? How many records are there? Data model? However, to really scale you need to shard the data somehow. That way one server can handle data in one range and another server data in another range and so on.
For write-heavy applications that is non-trivial to set up with MySQL unless you do the sharding in the application code. Most solutions with replication work best for read-mostly applications. You may want to look into a NoSQL database, for example MongoDB, that has been designed for distributing writes from the outset. MongoDB has other challenges (eventual consistency), but it can deliver scalable writes.

Hash based partitioning

I want to use Spring Batch to process CSV files. Each CSV file contains one record per line. For a given file, some records may be inter related i.e. processing of such records MUST follow the order they appear in the file. Using the regular sequential approach (i.e. single thread for the entire file) yields me bad performances, therefore I want to use the partitioning feature. Due to my processing requirement, inter related records MUST be in the same partition (as well as in the order they appear in the file). I thought about the idea of using a hash based partitioning algorithm with a carefully chosen hash function (so that near equally sized partitions are created).
Any idea if this is possible with Spring Batch?
How should the Partitioner be implemented for such case? According to one of the Spring Batch author/developer, the master does not send the actual data, only the information required for the slave to obtain the data it is supposed to process. In my case, I guess this information would be the hash value. Therefore, does the FlatFileItemReader of each slave need to read the entire file line by line skipping the lines with a different hash?
Thanks,
Mickael
What you're describing is something normally seen in batch processing. You have a couple options here:
Split the file by sequence and partition based on the created files - In this case, you'd iterate through the file once to divide it up into each of the list of records that needs to be processed in sequence. From there, you can use the MultiResourcePartitioner to process each file in parallel.
Load the file into a staging table - This is the easier method IMHO. Load the file into a staging table. From there, you can partition the processing based on any number of factors.
In either case, the results allows you to scale out the process as wide as you need to go to obtain the performance you need to achieve.
Flat file item reader is not thread safe so you cannot simply use it in parallell procesing.
There is more info in the docs:
Spring Batch provides some implementations of ItemWriter and ItemReader. Usually they say in the Javadocs if they are thread safe or not, or what you have to do to avoid problems in a concurrent environment. If there is no information in Javadocs, you can check the implementation to see if there is any state. If a reader is not thread safe, it may still be efficient to use it in your own synchronizing delegator. You can synchronize the call to read() and as long as the processing and writing is the most expensive part of the chunk your step may still complete much faster than in a single threaded configuration.
I think your question is somehow duplicate to this: multithreaded item reader

Best way to have a fast access key-value storage for huge dataset (5 GB)

There is a dataset of ~5GB in size. This big dataset just has a key-value pair per line.
Now this needs to be read for the value of keys some billion times.
I have already tried disk based approach of MapDB, but it throws ConcurrentModification Exception and isn't mature enough to be used in production environment yet.
I also don't want to have it in a DB and make the call billion times (Though, certain level of in-memory caching can be done here).
Basically, I need to access these key-value dataset in mapper/reducer of a hadoop's job step.
So after trying out a bunch of things we are now using SQLite.
Following is what we did:
We load all the key-value pair data in a pre-defined database file (Indexed it on the key column, though it increased the file-size but was worth it.)
Store this file (key-value.db) in S3.
Now this is passed to the hadoop jobs as distributed cache.
In Configure of Mapper/Reducer the connection is opened (It takes around 50 ms) to the db file
In map/reduce method query this db with the key (It took negligible time, didn't even need to profile it, it was so negligible!)
Closed the connection in cleanup method of Mapper/Reducer
Try Redis. It seems this is exactly what you need.
I would try Oracle Berkerley DB Java Edition This support Maps and is both mature and scalable.
I noticed you tagged this with elastic-map-reduce... if you're running on AWS, maybe DynamoDB would be appropriate.
Also, I'd like to clarify: is this dataset going to be the input to your MapReduce job, or is it a supplementary dataset that will be accessed randomly during the MapReduce job?

Categories