Cassandra data aggregation by Spark

Cassandra data aggregation by Spark - java

I would like to use Server-side data selection and filtering using the cassandra spark connector. In fact we have many sensors that send values every 1s, we are interested on these data aggregation using months, days, hours, etc,
I have proposed the following data model:
CREATE TABLE project1(
year int,
month int,
load_balancer int,
day int,
hour int,
estimation_time timestamp,
sensor_id int,
value double,
...
PRIMARY KEY ((year, month, load_balancer), day, hour, estimation_time, sensor_id)
Then, we were interested to get the data aggregation of a 2014-December- with loadbalancer IN (0,1,2,3). So they are 4 different partitions.
We are using the cassandra spark connector version 1.1.1, and we used a combine by query to get all values mean aggregated by hour.
So the processing time for 4,341,390 tuples, spark takes 11min to return the result.
Now the issue is that we are using 5 nodes however spark uses only one worker to execute the task.
Could you please suggest an update to the query or data model in order to enhance the performance?

Spark Cassandra Connector has this feature, it is SPARKC-25. You can just create an arbitrary RDD with values and then use it as a source of keys to fetch data from Cassandra table. Or in other words - join an arbitrary RDD to Cassandra RDD. In your case, that arbitrary RDD would include 4 tuples with different load balancer values. Look at the documentation for more info. SCC 1.2 has been released recently and it is probably compatible with Spark 1.1 (it is designed for Spark 1.2 though).

Related

Spark Java - repartitionByCassandraReplica - recommended number of partition keys and partitionsPerHost

So, I have a 16 node cluster where every node has Spark and Cassandra installed with a replication factor of 3 and spark.sql.shuffle.partitions of 96. I am also using the Spark-Cassandra Connector 3.0.0.
I have a Spark Dataset with 4 partition keys and I want to do a DirectJoin with a cassandra table.
Should I use repartitionByCassandraReplica? Is there a recommended number of partition keys for which it would make sense to use repartitionByCassandraReplica before a DirectJoin?
Is there also a recommended number for partitionsPerHost parameter? How could I just get 4 spark partitions in total if I have 4 partition keys..so that rows with the same partition key would be found in one spark partition?
If I do not use repartitionByCassandraReplica, I can see from SparkUI that DirectJoin is implemented. However if I use repartitionByCassandraReplica on same partition keys then I do not see any DirectJoin in the DAG, just a CassandraPartitionedRDD and later on a HashAggregate. Also it takes ~5 times more time than without repartitionByCassandraReplica. Any idea why and what is happening?
Does converting an RDD after repartitionByCassandraReplica to Spark Dataset, change the number or location of partitions?
How can I see if repartitionByCassandraReplica is working properly? I am using nodetool getendpoints to see where the data are stored, but other than that?
Please let me know if you need any more info. I just tried to summarize my questions from Spark-Cassandra: repartitionByCassandraReplica or converting dataset to JavaRDD and back do not maintain number of partitions?

KStreams Determining which input record timestamp metadata is persisted on joins

Hopefully someone knows this or can point me in the right direction...
I have a data topic that is created through API REST Requests. One of the fields received in the REST Requests is a timestamp for the record EventTime. These records are produced to Kafka and the EventTime is set as the Record's metadata timestamp.
I have another rules topic that provides information that augments the data topics records by adding new fields to the received value.
Both of these topics having matching keys for joining.
My goal is to preserve the EventTime from the data topic throughout all processing stages using the processor API. Note there will be multiple different KStreams applications that process/augment this data in multiple ways/steps.
The good news is that I have seen many things indicating that input record timestamps are preserved when using Kafka Streams.
Such as:
https://kafka.apache.org/documentation/streams/core-concepts#streams_time
input record timestamp and output record timestamp is same across both source and sink topics?
And have been reading on Timestamp extractors as well:
https://cwiki.apache.org/confluence/display/KAFKA/FAQ#FAQ- HowtowriteacustomTimestampExtractor
And more on joining:
https://cwiki.apache.org/confluence/display/KAFKA/Kafka+Streams+Join+Semantics
https://kafka.apache.org/documentation/streams/developer-guide/dsl-api.html#kstream-globalktable-join
Throughout much of the Streams documentation I see it mention that "the timestamp for the input record will persist to the output record" but I am unclear how this works exactly when it comes to joins.
My confusion seems to be that when we join we have 2 different input records and are producing a single output record.
How is it determined which timestamp is persisted between the multiple input records used in the join?
I have been discussing it with coworkers and there have been several views such as the following
The earliest non negative timestamp of the joined input records is
persisted.
The left input record's timestamp is persisted e.g.
leftStream.join(rightStream, ...);
The timestamp of the input record which triggered the join (left or
right)
It's non-Deterministic so the wall-clock-time is used unless a timestamp-extractor is specified for the producer.
Some of these have better arguments then the others but I need the to know what is actually going on...
Any help or suggestions of where to look is appreciated.

Currently (ie, Kafka 2.0 release) there is no public contract which timestamp will be used and the implementation is allowed to use any strategy. The current implementation uses the timestamp of the record that triggers the join computation.
As a workaround, you can manipulate the timestamp by adding a .valueTransformer() after the join. Compare https://cwiki.apache.org/confluence/display/KAFKA/KIP-251%3A+Allow+timestamp+manipulation+in+Processor+API
Ie, you need to embed the original timestamp into the value payload before the join, and extract it after the join and set as metadata timestamp.

How to compare Hive and Cassandra data in Java when there are around 1 million records

I am using Hive and Cassandra, table structure and data is the same in both Hive and Cassandra. There will be almost 1 million records. My requirement is that I need to check if each and every row has the same data in both Cassandra and Hive.
Can I compare two resultset objects directly? (one resultset with Cassandra data and another from Hive)
If we are iterating over resultset object, can resultset object hold 1 million records at a time? Will there be any performance issue?
What do we need to take care of when dealing with such huge data?

Well, some initial conditions seem strange for me.
First, 1M records is not a big deal for modern RDBMS, especially when we don't want to have real-time query responses.
Second, the fact that Hive and Cassandra tables structure are the same. Cassandra's paradigm is query-first modeling and it is good for some scenarios others than Hive.
However, for your question:
1. Yes. You can write Java (as I saw Java in the tag list) program, that would connect to both Hive and Cassandra via JDBC and compare resultset items one by one.
But you need to be sure that order of items is the same for Hive and Cassandra. That could be done via Hive queries as there not too many ways to do Cassandra ordering.
2. Resultset is just a cursor. It doesn't gather the whole data in memory, just some batch of records (it is configurable).
3. 1M or records it not a huge data, however, if you want to deal with billions of records, that would be it. But I could not provide you with a silver bullet to answer all questions dealing with huge data as each case is specific.
Anyway, for your case, I have some concerns:
I have no details of latest Cassandra's JDBC driver features and limitations.
You have not provided details of table structure and future data growth and complexity. I mean that now you have 1M rows with 10 columns in a single database, but later you could have 100M rows in the cluster of 10 Cassandra nodes.
If it's not a problem, then you can try your solution. Otherwise, for the simplicity of comparison, I'd suggest do the following:
1. Export Cassandra's data to Hive.
2. Compare data in two Hive tables.
I believe that would be straightforward and more robust.
But all above doesn't address the thing about the tools (Hive and Cassandra) selection for your task. You could find more about typical Cassandra usage cases here to be sure you've made the right choice.

Java Recon Job - Fastest and generic solution

Currently, I am running my application in RDS and in the process of moving to MongoDB. For now, we have a synch job which syncs data from Oracle to Mongo as and when row gets added/modified or deleted.
Write is happening only on Oracle.
Planning to come up with a recon job which compares the source and the target data. I am trying for full recon which fetches all the data from oracle and then get then compares with MongoDB to find the descrepencies.
I am planning for the below approach.
Note, The oracle DB size could be in TeraBytes.
1) Get first thousand rows from oracle table A. ( Simple JDBC results approach )
2) For each entry, create a map of key values. ( Map)
3) Get the corresponding data from MongoDB and convert the data based on oracle format.
4) For each entry, create a map of key values.
5) Compare these two map to find if they are same. ( Oracle Map equals Mongoldb Map )
6) Repeat the same for next rows ....
But, this approach is taking much time even i do using multi threading. I do not have much idea on Big Data, but open for new ideas.
Is there any other way or technology which can be used here for parallel processing.
Note, there could be some tables which are mapped straight forward b/w oracle and Mongo. Few tables are in denormalized form in Mongo.
Thanks,

Alternative Approach for Counter In Cassandra

Is there any other way to implement counters in Cassandra ?
I have a following table structure
CREATE TABLE userlog (
term text,
ts timestamp,
year int,
month int,
day int,
hour int,
weekofyear int,
dayofyear int,
count counter,
PRIMARY KEY (term, ts, year,month,day,hour,weekofyear,dayofyear)
);
But because of counter I need to put all the others columns in primary key,which is creating problems to my application.
So,is there any other way where I can avoid doing this (preferably using Java)?

You can avoid counters in Cassandra altogether by using an analytics engine such as Spark. The idea is to only store events in Cassandra and either periodically trigger Spark or continuously run Spark as a background job that would read the events and create aggregates such as counts. Those aggregate results can be written back into Cassandra again into a separate table (e.g. userlog_by_month, userlog_by_week,..).

Usually you would put the counter column in a separate table from the data table. In that way you can use whatever key you find convenient to access the counters.
The downside is you need to update two tables rather than just one, but this is unavoidable due to the way counters are implemented.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.