Is there a way to live stream data using spring-data-cassandra? Basically, I want to send data to client whenever there is a new addition to the database.
This is what I'm trying to do:-
#GetMapping(path = "mapping", produces = MediaType.TEXT_EVENT_STREAM_VALUE)
public Flux<Mapping> getMapping() {
Flux<Mapping> flux = reactiveMappingByExternalRepository.findAll();
Flux<Long> durationFlux = Flux.interval(Duration.ofSeconds(1));
return Flux.zip(flux, durationFlux).map(Tuple2::getT1);
}
But it doesn't return once the stream is complete.
The short answer is no, there's no live-streaming of real-time changes through the Cassandra driver. Although Cassandra has a CDC (Change Data Capture), it's quite low-level and you need to consume commit logs on the server. See Listen to a cassandra database with datastax for further details.
Related
I've implemented a Kafka Connect JDBC Source connector that is connected to an Oracle-Database and is writing data to a Kafka-Topic. Currently, I've set the option value.converter=org.apache.kafka.connect.json.JsonConverter with value.converter.schemas.enable=false being set. This option makes it possible to write JSON data to the Kafka topic (which works fine, by the way), but doesn't include the option to modify the data before sending it to the Kafka Broker.
My question now is: Is there a way to modify the data that is being sent to the Kafka Topic? In my case, the Source Connector runs a custom query and writes this directly to the Kafka topic. Anyhow, I want to extend this JSON with some custom columns and nesting. Is there a way to do so?
Please don't use JsonConverter & schemas.enable=false :-) Your data in Oracle has such a wonderful schema, it is a shame to throw it away! In all seriousness, using something like Avro, Protobuf, or JSON Schema keeps your message sizes small in the Kafka topic whilst retaining the schema.
See articles like this one for more details on this important concept.
Single Message Transform (SMT) are probably what you're looking for to transform the data en-route to Kafka. For example, you can insert fields, flatten payloads, and lots more. If there isn't an existing SMT to do what you want, you can write your own using the Java API.
You can also use Kafka Streams or ksqlDB to do stream processing on the data once it's in Kafka if you want to do more complex work like joining, aggregating, etc.
An example configuration to do some SMTs like rename fields, remove fields, or add fields.
Process:
DB table -connector infers schema of fields-> input Connect fields (internal connect data-structure/ connectRecord(s)) -> SMT1 -> SMT2 -> ... -> last SMT -> JsonConverter -> output json message.
DB Table:
current_name1 | current_name2 | FieldToDrop
bla1 bla2 bla3
input Connect fields inferred:
"current_name1" = "bla1" // this is a connect record
"current_name2" = "bla2" // this is a connect record
"FieldToDrop" = "bla3" // this is a connect record
output json for value:
{
"new_name1": "bla1",
"new_name2": "bla2",
"type": "MyCustomType"
}
Connector configuration:
name=example-connector
connector.class=io.confluent.connect.jdbc.JdbcSourceConnector
value.converter=org.apache.kafka.connect.json.JsonConverter
...
transforms=RenameFields,InsertFieldType,DropFields
transforms.RenameFields.type=org.apache.kafka.connect.transforms.ReplaceField$Value
transforms.RenameFields.renames=current_name1:new_name1,current_name2:new_name2
transforms.InsertFieldType.type=org.apache.kafka.connect.transforms.InsertField$Value
transforms.InsertFieldType.static.field=type
transforms.InsertFieldType.static.value=MyCustomType
transforms.DropFields.type=org.apache.kafka.connect.transforms.ReplaceField$Value
transforms.DropFields.blacklist=FieldToDrop
I have a situation where I am trying to stream using spark streaming from kafka. The stream is a direct stream. I am able to create a stream and then start streaming, also able to get any updates (if any) on kafka via the streaming.
The issue comes in when i have a new request to stream a new topic. Since SparkStreaming context can be only 1 per jvm, I cannot create a new stream for every new request.
The way I figured out is
Once a DStream is created and spark streaming is already in progress, just attach a new stream to it. This does not seem to work, the createDStream (for a new topic2) does not return a stream and further processing is stopped. The streaming keep on continuing on the first request (say topic1).
Second, I thought to stop the stream, create DStream and then start streaming again. I cannot use the same streaming context (it throws an excpection that jobs cannot be added after streaming has been stopped), and if I create a new stream for new topic (topic2), the old stream topic (topic1) is lost and it streams only the new one.
Here is the code, have a look
JavaStreamingContext javaStreamingContext;
if(null == javaStreamingContext) {
javaStreamingContext = JavaStreamingContext(sparkContext, Durations.seconds(duration));
} else {
StreamingContextState streamingContextState = javaStreamingContext.getState();
if(streamingContextState == StreamingContextState.STOPPED) {
javaStreamingContext = JavaStreamingContext(sparkContext, Durations.seconds(duration));
}
}
Collection<String> topics = Arrays.asList(getTopicName(schemaName));
SparkVoidFunctionImpl impl = new SparkVoidFunctionImpl(getSparkSession());
KafkaUtils.createDirectStream(javaStreamingContext,
LocationStrategies.PreferConsistent(),
ConsumerStrategies.<String, String>Subscribe(topics, getKafkaParamMap()))
.map((stringStringConsumerRecord) -> stringStringConsumerRecord.value())
.foreachRDD(impl);
if (javaStreamingContext.getState() == StreamingContextState.ACTIVE) {
javaStreamingContext.start();
javaStreamingContext.awaitTermination();
}
Don't worry about SparkVoidFunctionImpl, this is a custom class with is the implementation of VoidFunction.
The above is approach 1, where i do not stop the existing streaming. When a new request comes into this method, it does not get a new streaming object, it tries to create a dstream. The issue is the DStream object is never returned.
KafkaUtils.createDirectStream(javaStreamingContext,
LocationStrategies.PreferConsistent(),
ConsumerStrategies.<String, String>Subscribe(topics, getKafkaParamMap()))
This does not return a dstream, the control just terminates without an error.The steps further are not executed.
I have tried many things and read multiple article, but I belive this is a very common production level issue. Any streaming done is to be done on multiple different topics and each of them is handled differently.
Please help
The thing is spark master sends out code to workers and although the data is streaming, underlying code and variable values remain static unless job is restarted.
Few options I could think:
Spark Job server: Every time you want to subscribe/stream from a different topic instead of touching already running job, start a new job. From your API body you can supply the parameters or topic name. If you want to stop streaming from a specific topic, just stop respective job. It will give you a lot of flexibility and control on resources.
[Theoritical] Topic Filter: Subscribe all topics you think you will want, when records are pulled for a duration, filter out records based on a LIST of topics. Manipulate this list of topics through API to increase or decrease your scope of topics, it could be a broadcast variable as well. This is just an idea, I have not tried this option at all.
Another work around is to relay your Topic-2 data to Topic-1 using a microservice whenever you need it & stop if you don't want to.
I have a Spark streaming app written in Java and using Spark 2.1. I am using KafkaUtils.createDirectStream to read messages from Kafka. I am using kryo encoder/decoder for kafka messages. I specified this in Kafka properties-> key.deserializer, value.deserializer, key.serializer, value.deserializer
When Spark pulls the messages in a micro batch, the messages are successfully decoded using kryo decoder. However I noticed that Spark executor creates a new instance of kryo decoder for decoding each message read from kafka. I checked this by putting logs inside the decoder constructor
This seems weird to me. Shouldn't the same instance of decoder be used for each message and each batch?
Code where I am reading from kafka:
JavaInputDStream<ConsumerRecord<String, Class1>> consumerRecords = KafkaUtils.createDirectStream(
jssc,
LocationStrategies.PreferConsistent(),
ConsumerStrategies.<String, Class1>Subscribe(topics, kafkaParams));
JavaPairDStream<String, Class1> converted = consumerRecords.mapToPair(consRecord -> {
return new Tuple2<String, Class1>(consRecord.key(), consRecord.value());
});
If we want to see how Spark fetches data from Kafka internally, we'll need to look at KafkaRDD.compute, which is a method implemented for every RDD which tells the framework how to, well, compute that RDD:
override def compute(thePart: Partition, context: TaskContext): Iterator[R] = {
val part = thePart.asInstanceOf[KafkaRDDPartition]
assert(part.fromOffset <= part.untilOffset, errBeginAfterEnd(part))
if (part.fromOffset == part.untilOffset) {
logInfo(s"Beginning offset ${part.fromOffset} is the same as ending offset " +
s"skipping ${part.topic} ${part.partition}")
Iterator.empty
} else {
new KafkaRDDIterator(part, context)
}
}
What's important here is the else clause, which creates a KafkaRDDIterator. This internally has:
val keyDecoder = classTag[U].runtimeClass.getConstructor(classOf[VerifiableProperties])
.newInstance(kc.config.props)
.asInstanceOf[Decoder[K]]
val valueDecoder = classTag[T].runtimeClass.getConstructor(classOf[VerifiableProperties])
.newInstance(kc.config.props)
.asInstanceOf[Decoder[V]]
Which as you see, creates an instance of both the key decoder and the value decoder via reflection, for each underlying partition. This means that it isn't being generated per message but per Kafka partition.
Why is it implemented this way? I don't know. I'm assuming because a key and value decoder should have a neglectable performance hit compared to all the other allocations happening inside Spark.
If you've profiled your app and found this to be an allocation hot-path, you could open an issue. Otherwise, I wouldn't worry about it.
I do post like this:
Settings settings = Settings.settingsBuilder()
.put("cluster.name", "cluster-name")
.build();
client = TransportClient.builder()
.settings(settings)
.build();
client.addTransportAddress(new InetSocketTransportAddress(InetAddress.getByName("my.elastic.server"), 9300));
IndexResponse response = client
.prepareIndex("myindex", "info")
.setSource(data) //here data is stored in a Map
.get();
But data could be about 2Mb or more and I care about the speed it would be posted to elastic. What is the best way to limit that time? Such Elastic Java API feature or maybe run posting in a separate Thread or maybe something else? Thanks
You could utilize Spring Data Elasticsearch in Java and Spring Batch to create an index batch job. This way you can break the data up into smaller chunks, for more frequent but smaller writes to your index.
If your job is big enough(millions of records), you can utilize a multi-threaded batch job, and significantly reduce the time it takes to generate your index. This may be overkill for a smaller index though.
I am trying to build a simple application that reads data from AWS Kinesis. I have managed to read data using a single shard but I want to get data from 4 different shards.
Problem is, I have a while loop which iterates as long as the shard is active which prevents me from reading data from different shards. So far I couldn't find an alternative algorithm nor was able to implement a KCL-based solution.
Many thanks in advance
public static void DoSomething() {
AmazonKinesisClient client = new AmazonKinesisClient();
//noinspection deprecation
client.setEndpoint(endpoint, serviceName, regionId);
/** get shards from the stream using describe stream method*/
DescribeStreamRequest describeStreamRequest = new DescribeStreamRequest();
describeStreamRequest.setStreamName(streamName);
List<Shard> shards = new ArrayList<>();
String exclusiveStartShardId = null;
do {
describeStreamRequest.setExclusiveStartShardId(exclusiveStartShardId);
DescribeStreamResult describeStreamResult = client.describeStream(describeStreamRequest);
shards.addAll(describeStreamResult.getStreamDescription().getShards());
if (describeStreamResult.getStreamDescription().getHasMoreShards() && shards.size() > 0) {
exclusiveStartShardId = shards.get(shards.size() - 1).getShardId();
} else {
exclusiveStartShardId = null;
}
}while (exclusiveStartShardId != null);
/** shards obtained */
String shardIterator;
GetShardIteratorRequest getShardIteratorRequest = new GetShardIteratorRequest();
getShardIteratorRequest.setStreamName(streamName);
getShardIteratorRequest.setShardId(shards.get(0).getShardId());
getShardIteratorRequest.setShardIteratorType("LATEST");
GetShardIteratorResult getShardIteratorResult = client.getShardIterator(getShardIteratorRequest);
shardIterator = getShardIteratorResult.getShardIterator();
GetRecordsRequest getRecordsRequest = new GetRecordsRequest();
while (!shardIterator.equals(null)) {
getRecordsRequest.setShardIterator(shardIterator);
getRecordsRequest.setLimit(250);
GetRecordsResult getRecordsResult = client.getRecords(getRecordsRequest);
List<Record> records = getRecordsResult.getRecords();
shardIterator = getRecordsResult.getNextShardIterator();
if(records.size()!=0) {
for(Record r : records) {
System.out.println(r.getPartitionKey());
}
}
try {
Thread.sleep(1000);
} catch (InterruptedException e) {
}
}
}
It is recommended that you will not read from a single process/worker from multiple shards. First, as you can see it is adding to the complexity of your code, but more importantly, you will have problems scaling up.
The "secret" of scalability is to have small and independent workers or other such units. Such design you can see in Hadoop, DynamoDB or Kinesis in AWS. It allows you to build small systems (micro-services), that can easily scale up and down as needed. You can easily add more units of work/data as your service becomes more successful, or other fluctuations in its usage.
As you can see in these AWS services, you sometimes can get this scalability automatically such in DynamoDB, and sometimes you need add shards to your kinesis streams. But for your application you need to control somehow your scalability.
In the case of Kinesis, you can scale up and down using AWS Lambda or Kinesis Client Library (KCL). Both of them are listening to the status of your streams (number of shards and events) and using it to add or remove workers and deliver the events for them to process. In both of these solutions you should build a worker that is working against a single shard.
If you need to align events from multiple shards, you can do that using some state service such as Redis or DynamoDB.
For a simpler and neater solution where you only have to worry about providing your own message processing code, I would recommend using the KCL Library.
Quoting from the documentation
The KCL acts as an intermediary between your record processing logic
and Kinesis Data Streams. The KCL performs the following tasks:
Connects to the data stream
Enumerates the shards within the data stream
Uses leases to coordinates shard associations with its workers
Instantiates a record processor for every shard it manages
Pulls data records from the data stream
Pushes the records to the corresponding record processor
Checkpoints processed records
Balances shard-worker associations (leases) when the worker instance count changes or when the data stream is resharded (shards are split or merged)