I'm attempting to send data via an actor to a runnable graph that contains a fan out.
I define the source as :
final Source<Integer, ActorRef> integerSource =
Source.actorRef(
elem -> {
if (elem == Done.done()) return Optional.of(CompletionStrategy.immediately());
else return Optional.empty();
},
elem -> Optional.empty(),
10,
OverflowStrategy.dropHead());
But I'm unsure how to get a handle on an ActoRef to send data via an actor to the source so that the runnable graph will process messages asynchronously as they are received :
RunnableGraph<CompletionStage<Done>> graph = RunnableGraph.fromGraph(
GraphDSL.create(sink, (builder, out) -> {
SourceShape<Integer> sourceShape = builder.add(integerSource);
FlowShape<Integer, Integer> flow1Shape = builder.add(flow1);
FlowShape<Integer, Integer> flow2Shape = builder.add(flow1);
UniformFanOutShape<Integer, Integer> broadcast =
builder.add(Broadcast.create(2));
UniformFanInShape<Integer, Integer> merge =
builder.add(Merge.create(2));
builder.from(sourceShape)
.viaFanOut(broadcast)
.via(flow1Shape);
builder.from(broadcast).via(flow2Shape);
builder.from(flow1Shape)
.viaFanIn(merge)
.to(out);
builder.from(flow2Shape).viaFanIn(merge);
return ClosedShape.getInstance();
} )
);
Entire src :
import akka.Done;
import akka.NotUsed;
import akka.actor.ActorRef;
import akka.actor.typed.ActorSystem;
import akka.actor.typed.javadsl.Behaviors;
import akka.stream.*;
import akka.stream.javadsl.*;
import lombok.extern.slf4j.Slf4j;
import java.util.Optional;
import java.util.concurrent.CompletionStage;
#Slf4j
public class GraphActorSource {
private final static ActorSystem actorSystem = ActorSystem.create(Behaviors.empty(), "flowActorSystem");
public void runFlow() {
final Source<Integer, ActorRef> integerSource =
Source.actorRef(
elem -> {
if (elem == Done.done()) return Optional.of(CompletionStrategy.immediately());
else return Optional.empty();
},
elem -> Optional.empty(),
10,
OverflowStrategy.dropHead());
Flow<Integer, Integer, NotUsed> flow1 = Flow.of(Integer.class)
.map (x -> {
System.out.println("Flow 1 is processing " + x);
return (x * 2);
});
Sink<Integer, CompletionStage<Done>> sink = Sink.foreach(x -> {
System.out.println(x);
});
RunnableGraph<CompletionStage<Done>> graph = RunnableGraph.fromGraph(
GraphDSL.create(sink, (builder, out) -> {
SourceShape<Integer> sourceShape = builder.add(integerSource);
FlowShape<Integer, Integer> flow1Shape = builder.add(flow1);
FlowShape<Integer, Integer> flow2Shape = builder.add(flow1);
UniformFanOutShape<Integer, Integer> broadcast =
builder.add(Broadcast.create(2));
UniformFanInShape<Integer, Integer> merge =
builder.add(Merge.create(2));
builder.from(sourceShape)
.viaFanOut(broadcast)
.via(flow1Shape);
builder.from(broadcast).via(flow2Shape);
builder.from(flow1Shape)
.viaFanIn(merge)
.to(out);
builder.from(flow2Shape).viaFanIn(merge);
return ClosedShape.getInstance();
} )
);
graph.run(actorSystem);
}
public static void main(String args[]){
new GraphActorSource().runFlow();
}
}
How to send data to the Runnable graph via an actor?
Something like ? :
integerSource.tell(1)
integerSource.tell(2)
integerSource.tell(3)
ActorRef.tell works. Construct the graph blueprint so the source ActorRef will be returned when the blueprint is materialized and run.
For just one materialized object, use that materialized type for the materialized type parameter of the Graph.
Here the materialized type parameter for integerSource is ActorRef.
The materialized type parameter for Graph is also ActorRef.
Only integerSource is passed to GraphDSL.create.
Source<Integer, ActorRef> integerSource = ...
Graph<ClosedShape, ActorRef> graph =
GraphDSL.create(integerSource, (builder, src) -> {
...
});
RunnableGraph<ActorRef> runnableGraph = RunnableGraph.fromGraph(graph);
ActorRef actorRef = runnableGraph.run(actorSystem);
actorRef.tell(1, ActorRef.noSender());
To access more than one materialized object, a tuple must be constructed to capture them. If two objects from the materialized graph are desired, say src and snk, then Pair<A,B> can capture both types.
Here both integersource and sink are passed to GraphDSL.create.
The materialized ActorRef and CompletionStage are paired for the result of run with Pair::new.
The type Pair<ActorRef,CompletionStage<Done>> is the materialized type parameter of the Graph.
Source<Integer, ActorRef> integerSource = ...
Sink<Integer, CompletionStage<Done>> sink = ...
Graph<ClosedShape, Pair<ActorRef, CompletionStage<Done>>> graph =
GraphDSL.create(integerSource, sink, Pair::new, (builder, src, snk) -> {
....
});
RunnableGraph<Pair<ActorRef, CompletionStage<Done>>> runnableGraph =
RunnableGraph.fromGraph(graph);
Pair<ActorRef, CompletionStage<Done>> pair =
runnableGraph.run(actorSystem);
ActorRef actorRef = pair.first();
CompletionStage<Done> completionStage = pair.second();
actorRef.tell(1, ActorRef.noSender());
Full example:
(build.gradle)
apply plugin: "java"
apply plugin: "application"
mainClassName = "GraphActorSource"
repositories {
mavenCentral()
}
dependencies {
implementation "com.typesafe.akka:akka-actor-typed_2.13:2.6.19"
implementation "com.typesafe.akka:akka-stream-typed_2.13:2.6.19"
implementation 'org.slf4j:slf4j-jdk14:1.7.36'
}
compileJava {
options.compilerArgs << "-Xlint:unchecked"
}
(src/main/java/GraphActorSource.java)
import akka.Done;
import akka.NotUsed;
import akka.actor.ActorRef;
import akka.actor.Status.Success;
import akka.actor.typed.ActorSystem;
import akka.actor.typed.javadsl.Behaviors;
import akka.japi.Pair;
import akka.stream.*;
import akka.stream.javadsl.*;
import akka.util.Timeout;
import java.util.Optional;
import java.util.concurrent.CompletionStage;
import java.util.concurrent.TimeUnit;
public class GraphActorSource {
private final static ActorSystem actorSystem =
ActorSystem.create(Behaviors.empty(), "flowActorSystem");
public void runFlow() {
// 1. Create graph (blueprint)
// 1a. Define source, flows, and sink
final Source<Integer, ActorRef> integerSource =
Source.actorRef
(
elem -> {
if (elem == Done.done()) return Optional.of(CompletionStrategy.immediately());
else return Optional.empty();
},
elem -> Optional.empty(),
10,
OverflowStrategy.dropHead()
);
Flow<Integer, Integer, NotUsed> flow1 = Flow.of(Integer.class)
.map (x -> {
System.out.println("Flow 1 is processing " + x);
return (100 + x);
});
Flow<Integer, Integer, NotUsed> flow2 = Flow.of(Integer.class)
.map (x -> {
System.out.println("Flow 2 is processing " + x);
return (200 + x);
});
Sink<Integer, CompletionStage<Done>> sink = Sink.foreach(x -> {
System.out.println("Sink received "+x);
});
// 1b. Connect nodes and flows into a graph.
// Inputs and output nodes (source, sink) will be produced at run start.
Graph<ClosedShape, Pair<ActorRef, CompletionStage<Done>>> graph =
GraphDSL.create(integerSource, sink, Pair::new, (builder, src, snk) -> {
UniformFanOutShape<Integer, Integer> broadcast =
builder.add(Broadcast.create(2));
FlowShape<Integer, Integer> flow1Shape = builder.add(flow1);
FlowShape<Integer, Integer> flow2Shape = builder.add(flow2);
UniformFanInShape<Integer, Integer> merge =
builder.add(Merge.create(2));
builder.from(src)
.viaFanOut(broadcast);
builder.from(broadcast.out(0))
.via(flow1Shape)
.toInlet(merge.in(0));
builder.from(broadcast.out(1))
.via(flow2Shape)
.toInlet(merge.in(1));
builder.from(merge)
.to(snk);
return ClosedShape.getInstance();
} );
RunnableGraph<Pair<ActorRef, CompletionStage<Done>>> runnableGraph =
RunnableGraph.fromGraph(graph);
// 2. Start run,
// which produces materialized source ActorRef and sink CompletionStage.
Pair<ActorRef, CompletionStage<Done>> pair =
runnableGraph.run(actorSystem);
ActorRef actorRef = pair.first();
CompletionStage<Done> completionStage = pair.second();
// On completion, terminates actor system (optional).
completionStage.thenRun(() -> {
System.out.println("Done, terminating.");
actorSystem.terminate();
});
// 3. Send messages to source actor
actorRef.tell(1, ActorRef.noSender());
actorRef.tell(2, ActorRef.noSender());
// The stream completes successfully with the following message
actorRef.tell(Done.done(), ActorRef.noSender());
}
public static void main(String args[]){
new GraphActorSource().runFlow();
}
}
Reference Akka Documentation (accessed Version 2.6.19)
Streams / Operators / Source.actorRef
Streams / Streams Cookbook / Working with operators
Related
I have spring-cloud-stream project that use kafka binder.
Application consumes messages in batch mode. I need to filter consumed records by specific header. In this case i use BatchInterceptor:
#Bean
public ListenerContainerCustomizer<AbstractMessageListenerContainer<String, String>> customizer(
BatchInterceptor<String, String> customInterceptor
) {
return (((container, destinationName, group) -> {
container.setBatchInterceptor(customInterceptor);
log.info("Container customized");
}));
}
#Bean
public BatchInterceptor<String, String> customInterceptor() {
return (consumerRecords, consumer) -> {
log.info("Origin records count: {}", consumerRecords.count());
final Set<TopicPartition> partitions = consumerRecords.partitions();
final Map<TopicPartition, List<ConsumerRecord<String, String>>> filteredByHeader
= Stream.of(partitions).flatMap(Collection::stream)
.collect(Collectors.toMap(
Function.identity(),
p -> Stream.ofNullable(consumerRecords.records(p))
.flatMap(Collection::stream)
.filter(r -> Objects.nonNull(r.headers().lastHeader("TEST")))
.collect(Collectors.toList())
));
var filteredRecords = new ConsumerRecords<>(filteredByHeader);
log.info("Filtered count: {}", filteredRecords.count());
return filteredRecords;
};
}
Example code here batch interceptor example.
In logs i see, that the records are filtered successfully, but the filtered ones are still get into the consumer.
Why ButchInterceptor does not filter records?
How can i filter ConsumerRecords by specific header in spring-cloud-stream with enabled batch mode? You can run the tests from the example to reproduce behaveor.
You are using very old code (Boot 2.5.0) which is out of OSS support.
https://spring.io/projects/spring-boot#support
(Cloud too).
I tested your interceptor with current versions and it works fine.
Boot 2.7.5, cloud 2021.0.4:
#SpringBootApplication
public class So74203611Application {
private static final Logger log = LoggerFactory.getLogger(So74203611Application.class);
public static void main(String[] args) {
SpringApplication.run(So74203611Application.class, args);
}
#Bean
public ListenerContainerCustomizer<AbstractMessageListenerContainer<String, String>> customizer(
BatchInterceptor<String, String> customInterceptor) {
return (((container, destinationName, group) -> {
container.setBatchInterceptor(customInterceptor);
log.info("Container customized {}", destinationName);
}));
}
#Bean
public BatchInterceptor<String, String> customInterceptor() {
return (consumerRecords, consumer) -> {
log.info("Origin records count: {}", consumerRecords.count());
final Set<TopicPartition> partitions = consumerRecords.partitions();
final Map<TopicPartition, List<ConsumerRecord<String, String>>> filteredByHeader = Stream.of(partitions)
.flatMap(Collection::stream)
.collect(Collectors.toMap(Function.identity(),
p -> Stream.ofNullable(consumerRecords.records(p)).flatMap(Collection::stream)
.filter(r -> Objects.nonNull(r.headers().lastHeader("TEST")))
.collect(Collectors.toList())));
var filteredRecords = new ConsumerRecords<>(filteredByHeader);
log.info("Filtered count: {}", filteredRecords.count());
return filteredRecords;
};
}
#Bean
Consumer<List<String>> input() {
return str -> {
System.out.println(str);
};
}
#Bean
ApplicationRunner runner(KafkaTemplate<byte[], byte[]> template) {
return args -> {
Headers headers = new RecordHeaders();
headers.add("TEST", "foo".getBytes());
ProducerRecord<byte[], byte[]> rec = new ProducerRecord<>("input-in-0", 0, 0L, null, "bar".getBytes(),
headers);
template.send(rec);
headers = new RecordHeaders();
rec = new ProducerRecord<>("input-in-0", 0, 0L, null, "baz".getBytes(), headers);
template.send(rec);
template.send(rec);
};
}
}
spring.cloud.stream.bindings.input-in-0.group=foo
spring.cloud.stream.bindings.input-in-0.consumer.batch-mode=true
[bar]
I'm facing a performance problem while processing a big stream of objects (from one source) that is being filtered and mapped with values from another big stream/collection (suppose from a different source). I'm trying to do a kind of Join (sql).
My machine is taking 11+ minutes to execute it.
I have tried added a filter before to the map, but it degraded more the situation.
What could I do on this matter to get better results?
I'm going to provide an example of what I'm trying to achieve.
Note that the filter is using just the ID, but could be used more common properties of both streams.
import static java.util.stream.Collectors.toSet;
import java.io.IOException;
import java.time.Duration;
import java.time.Instant;
import java.util.Collection;
import java.util.Objects;
import java.util.Random;
import java.util.stream.IntStream;
import java.util.stream.Stream;
public class ProcessorQuestion {
static record Element(int id, String content) {
public Element(int id, String content) {
this.id = id;
this.content = content == null ? "Data " + id : content;
}
}
static record Row(int id, String content) {
public Row(int id, String content) {
this.id = id;
this.content = content == null ? "Row " + id : content;
}
}
static record RowVsElement(Row row, Element element) {
}
private static Random r = new Random();
protected static Stream<Element> loadElementsData() {
return IntStream.range(1, 1_000_000)
.mapToObj(value -> new Element(r.nextInt(235_000), null));
}
protected static Stream<Row> loadRowsData() {
return IntStream.range(1, 235_000)
.mapToObj(value -> new Row(r.nextInt(235_000), null));
}
public static void main(String[] args) throws IOException {
var init = Instant.now();
final ProcessorQuestion processor = new ProcessorQuestion(loadElementsData());
processor.process();
System.err.println("Runned in " + Duration.between(init, Instant.now()).toMinutes() + " min");
}
private final Collection<Element> entries;
public ProcessorQuestion(Stream<Element> entries) {
this.entries = entries.collect(toSet());
}
void process() {
// System.out.println("rows size = " + rows.size());
System.out.println("elements size = " + entries.size());
loadRowsData().parallel()
// .filter(r0 -> entries.stream()
// .anyMatch(
// entry -> entry.getId() == r0.getId()))
.map(r1 -> entries.parallelStream()
.filter(entry -> entry.id() == r1.id())
.findFirst()
.map(elem -> new RowVsElement(r1, elem))
.orElse(null))
.filter(Objects::nonNull)
.forEachOrdered(pair -> saveOnMedia(pair.row, pair.element));
}
void saveOnMedia(Row row, Element element) {
StringBuilder rsb = new StringBuilder(Integer.toString(row.id()));
rsb.append(Integer.toString(row.id()));
rsb.append(";");
rsb.append(Integer.toString(element.id()));
rsb.append(";");
rsb.append(row.content());
rsb.append(";");
rsb.append(element.content());
rsb.append(System.lineSeparator());
System.out.println(rsb.toString());
}
}
I've taken some print from the screen of the execution on VisualVM:
Try using the map, as people suggested:
Map<Integer, Element> elementMap = entries.stream().collect(Collectors.toMap(it -> it.id(), it -> it, (a, b) -> a));
loadRowsData()
// .filter(r0 -> entries.stream()
// .anyMatch(
// entry -> entry.getId() == r0.getId()))
.map(r1 -> elementMap.containsKey(r1.id()) ? new RowVsElement(r1, elementMap.get(r1.id())) : null)
.filter(Objects::nonNull)
.forEachOrdered(pair -> saveOnMedia(pair.row, pair.element));
It should be completed in one second if you just save the result to a list, instead of printing all the result. Printing out the result may take couple of seconds.
As Knittl said in the comments: "Your algorithm has O(m*n) complexity – with m=235000 and n=1000000, that's ~235000000000". That's the reason it takes so long.
I've been tinkering with wrapping an old style listener interface using RxJava. What i've come up with seems to work, but the usage of Observable.using feels a bit awkward.
The requirements are:
Only one subscription per id to the underlying service.
The latest value for a given id should be cached and served to new subscribers.
We must unsubscribe from the underlying service if nothing is listening to an id.
Is there a better way? The following is what I've got.
static class MockServiceRXAdapterImpl1 implements MockServiceRXAdapter {
PublishSubject<MockResponse> mockResponseObservable = PublishSubject.create();
MockService mockService = new MockService(mockResponse -> mockResponseObservable.onNext(mockResponse));
final ConcurrentMap<String, Observable<String>> subscriptionMap = new ConcurrentHashMap<>();
public Observable<String> getObservable(String id) {
return Observable.using(() -> subscriptionMap.computeIfAbsent(
id,
key -> mockResponseObservable.filter(mockResponse -> mockResponse.id.equals(id))
.doOnSubscribe(disposable -> mockService.subscribe(id))
.doOnDispose(() -> {
mockService.unsubscribe(id);
subscriptionMap.remove(id);
})
.map(mockResponse -> mockResponse.value)
.replay(1)
.refCount()),
observable -> observable,
observable -> {
}
);
}
}
You may use Observable.create
So code may look like this
final Map<String, Observable<String>> subscriptionMap = new HashMap<>();
MockService mockService = new MockService();
public Observable<String> getObservable(String id) {
log.info("looking for root observable");
if (subscriptionMap.containsKey(id)) {
log.info("found root observable");
return subscriptionMap.get(id);
} else {
synchronized (subscriptionMap) {
if (!subscriptionMap.containsKey(id)) {
log.info("creating new root observable");
final Observable<String> responseObservable = Observable.create(emitter -> {
MockServiceListener listener = emitter::onNext;
mockService.addListener(listener);
emitter.setCancellable(() -> {
mockServices.removeListener(listener);
mockService.unsubscribe(id);
synchronized (subscriptionMap) {
subscriptionMap.remove(id);
}
});
mockService.subscribe(id);
})
.filter(mockResponse -> mockResponse.id.equals(id))
.map(mockResponse -> mockResponse.value)
.replay(1)
.refCount();
subscriptionMap.put(id, responseObservable);
} else {
log.info("Another thread created the observable for us");
}
return subscriptionMap.get(id);
}
}
}
I think I've gotten it to work using .groupBy(...).
In my case Response.getValue() returns an int, but you get the idea:
class Adapter
{
Subject<Response> msgSubject;
ThirdPartyService service;
Map<String, Observable<Integer>> observables;
Observable<GroupedObservable<String, Response>> groupedObservables;
public Adapter()
{
msgSubject = PublishSubject.<Response>create().toSerialized();
service = new MockThirdPartyService( msgSubject::onNext );
groupedObservables = msgSubject.groupBy( Response::getId );
observables = Collections.synchronizedMap( new HashMap<>() );
}
public Observable<Integer> getObservable( String id )
{
return observables.computeIfAbsent( id, this::doCreateObservable );
}
private Observable<Integer> doCreateObservable( String id )
{
service.subscribe( id );
return groupedObservables
.filter( group -> group.getKey().equals( id ))
.doOnDispose( () -> {
synchronized ( observables )
{
service.unsubscribe( id );
observables.remove( id );
}
} )
.concatMap( Functions.identity() )
.map( Response::getValue )
.replay( 1 )
.refCount();
}
}
I have Java 8 application working with Apache Kafka 2.11-0.10.1.0. I need to use the seek feature to poll old messages from partitions. However I faced an exception of No current assignment for partition which is occurred every time I am trying to seekByOffset. Here's my class which is responsible for seeking topics to the specified timestamp:
import org.apache.kafka.clients.consumer.ConsumerRecords;
import org.apache.kafka.clients.consumer.KafkaConsumer;
import org.apache.kafka.clients.consumer.OffsetAndTimestamp;
import org.apache.kafka.common.PartitionInfo;
import org.apache.kafka.common.TopicPartition;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
import org.springframework.util.CollectionUtils;
import java.time.Instant;
import java.util.Collection;
import java.util.Collections;
import java.util.List;
import java.util.Map;
import java.util.function.Function;
import java.util.regex.Pattern;
import java.util.stream.Collectors;
/**
* The main purpose of this class is to move fetching point for each partition of the {#link KafkaConsumer}
* to some offset which is determined either by timestamp or by offset number.
*/
public class KafkaSeeker {
public static final long APP_STARTUP_TIME = Instant.now().toEpochMilli();
private final Logger LOGGER = LoggerFactory.getLogger(this.getClass());
private final KafkaConsumer<String, String> kafkaConsumer;
private ConsumerRecords<String, String> polledRecords;
public KafkaSeeker(KafkaConsumer<String, String> kafkaConsumer) {
this.kafkaConsumer = kafkaConsumer;
this.polledRecords = new ConsumerRecords<>(Collections.emptyMap());
}
/**
* For each assigned or subscribed topic {#link org.apache.kafka.clients.consumer.KafkaConsumer#seek(TopicPartition, long)}
* fetching pointer to the specified {#code timestamp}.
* If no messages were found in each partition for a topic,
* then {#link org.apache.kafka.clients.consumer.KafkaConsumer#seekToEnd(Collection)} will be called.
*
* Due to {#link KafkaConsumer#subscribe(Pattern)} and {#link KafkaConsumer#assign(Collection)} laziness
* method needs to execute dummy {#link KafkaConsumer#poll(long)} method. All {#link ConsumerRecords} which were
* polled from buffer are swallowed and produce warning logs.
*
* #param timestamp is used to find proper offset to seek to
* #param topics are used to seek only specific topics. If not specified or empty, all subscribed topics are used.
*/
public Map<TopicPartition, OffsetAndTimestamp> seek(long timestamp, Collection<String> topics) {
this.polledRecords = kafkaConsumer.poll(0);
Collection<TopicPartition> topicPartitions;
if (CollectionUtils.isEmpty(topics)) {
topicPartitions = kafkaConsumer.assignment();
} else {
topicPartitions = topics.stream()
.map(it -> {
List<Integer> partitions = kafkaConsumer.partitionsFor(it).stream()
.map(PartitionInfo::partition).collect(Collectors.toList());
return partitions.stream().map(partition -> new TopicPartition(it, partition));
})
.flatMap(it -> it)
.collect(Collectors.toList());
}
if (topicPartitions.isEmpty()) {
throw new IllegalStateException("Kafka consumer doesn't have any subscribed topics.");
}
Map<TopicPartition, Long> timestampsByTopicPartitions = topicPartitions.stream()
.collect(Collectors.toMap(Function.identity(), topicPartition -> timestamp));
Map<TopicPartition, Long> beginningOffsets = kafkaConsumer.beginningOffsets(topicPartitions);
Map<TopicPartition, OffsetAndTimestamp> offsets = kafkaConsumer.offsetsForTimes(timestampsByTopicPartitions);
for (Map.Entry<TopicPartition, OffsetAndTimestamp> entry : offsets.entrySet()) {
TopicPartition topicPartition = entry.getKey();
if (entry.getValue() != null) {
LOGGER.info("Kafka seek topic:partition [{}:{}] from [{} offset] to [{} offset].",
topicPartition.topic(),
topicPartition.partition(),
beginningOffsets.get(topicPartition),
entry.getValue());
kafkaConsumer.seek(topicPartition, entry.getValue().offset());
} else {
LOGGER.info("Kafka seek topic:partition [{}:{}] from [{} offset] to the end of partition.",
topicPartition.topic(),
topicPartition.partition());
kafkaConsumer.seekToEnd(Collections.singleton(topicPartition));
}
}
return offsets;
}
public ConsumerRecords<String, String> getPolledRecords() {
return polledRecords;
}
}
Before calling the method I have consumer subscribed to a single topic like this consumer.subscribe(singletonList(kafkaTopic));. When I get kafkaConsumer.assignment() it returns zero TopicPartitions assigned. But if I specify the topic and get its partitions then I have valid TopicPartitions, although they are failing on seek call with the error in the title. What is something I forgot?
The correct way to reliably seek and check current assignment is to wait for the onPartitionsAssigned() callback after subscribing. On a newly created (still not connected) consumer, calling poll() once does not guarantees it will immedaitely be connected and assigned partitions.
As a basic example, see the code below that subscribes to a topic, and in the assigned callback, seeks to the desired position. Finally you'll notice that the poll loop correctly only sees records from the seek location and not from the previous committed or reset offset.
public static final Map<TopicPartition, Long> offsets = Map.of(new TopicPartition("testtopic", 0), 5L);
public static void main(String args[]) {
Properties props = new Properties();
props.put(ConsumerConfig.BOOTSTRAP_SERVERS_CONFIG, "localhost:9092");
props.put(ConsumerConfig.GROUP_ID_CONFIG, "test");
props.put(ConsumerConfig.KEY_DESERIALIZER_CLASS_CONFIG, StringDeserializer.class.getName());
props.put(ConsumerConfig.VALUE_DESERIALIZER_CLASS_CONFIG, StringDeserializer.class.getName());
try (Consumer<String, String> consumer = new KafkaConsumer<>(props)) {
consumer.subscribe(Collections.singletonList("testtopic"), new ConsumerRebalanceListener() {
#Override
public void onPartitionsRevoked(Collection<TopicPartition> partitions) {}
#Override
public void onPartitionsAssigned(Collection<TopicPartition> partitions) {
System.out.println("Assigned " + partitions);
for (TopicPartition tp : partitions) {
OffsetAndMetadata oam = consumer.committed(tp);
if (oam != null) {
System.out.println("Current offset is " + oam.offset());
} else {
System.out.println("No committed offsets");
}
Long offset = offsets.get(tp);
if (offset != null) {
System.out.println("Seeking to " + offset);
consumer.seek(tp, offset);
}
}
}
});
for (int i = 0; i < 10; i++) {
System.out.println("Calling poll");
ConsumerRecords<String, String> records = consumer.poll(Duration.ofMillis(100L));
for (ConsumerRecord<String, String> r : records) {
System.out.println("record from " + r.topic() + "-" + r.partition() + " at offset " + r.offset());
}
}
}
}
KafkaConsumer<String, byte[]> consumer = new KafkaConsumer<>(props);
// Get topic partitions
List<TopicPartition> partitions = consumer
.partitionsFor(topic)
.stream()
.map(partitionInfo ->
new TopicPartition(topic, partitionInfo.partition()))
.collect(Collectors.toList());
// Explicitly assign the partitions to our consumer
consumer.assign(partitions);
//seek, query offsets, or poll
Please note that this disables consumer group management and rebalancing operations. When possible use #Mickael Maison's approach.
How can I identify the topic name from a message in kafka.
String[] topics = { "test", "test1", "test2" };
for (String t : topics) {
topicMap.put(t, new Integer(3));
}
SparkConf conf = new SparkConf().setAppName("KafkaReceiver")
.set("spark.streaming.receiver.writeAheadLog.enable", "false")
.setMaster("local[4]")
.set("spark.cassandra.connection.host", "localhost");
;
final JavaSparkContext sc = new JavaSparkContext(conf);
JavaStreamingContext jssc = new JavaStreamingContext(sc, new Duration(
1000));
/* Receive Kafka streaming inputs */
JavaPairReceiverInputDStream<String, String> messages = KafkaUtils
.createStream(jssc, "localhost:2181", "test-group",
topicMap);
JavaDStream<MessageAndMetadata> data =
messages.map(new Function<Tuple2<String, String>, MessageAndMetadata>()
{
public MessageAndMetadata call(Tuple2<String, String> message)
{
System.out.println("message ="+message._2);
return null;
}
}
);
I can fetch message from kafka producer. But since the consumer now consuming from three topic, it is needed to identify topic name.
As of Spark 1.5.0, official documentation encourages using no-receiver/direct approach starting from recent releases, which has graduated from experimental in recent 1.5.0.
This new Direct API allows you to easily obtain message and its metadata apart from other good things.
Unfortunately, this is not straightforward as KafkaReceiver and ReliableKafkaReceiver in Spark's source code only store MessageAndMetadata.key and message.
There are two open tickets related to this issue in Spark's JIRA:
https://issues.apache.org/jira/browse/SPARK-3146
https://issues.apache.org/jira/browse/SPARK-4960
which have been opened for a while.
A dirty copy/paste/modify of Spark's source code to solve your issue:
package org.apache.spark.streaming.kafka
import java.lang.{Integer => JInt}
import java.util.{Map => JMap, Properties}
import kafka.consumer.{KafkaStream, Consumer, ConsumerConfig, ConsumerConnector}
import kafka.serializer.{Decoder, StringDecoder}
import kafka.utils.VerifiableProperties
import org.apache.spark.Logging
import org.apache.spark.storage.StorageLevel
import org.apache.spark.streaming.StreamingContext
import org.apache.spark.streaming.api.java.{JavaReceiverInputDStream, JavaStreamingContext}
import org.apache.spark.streaming.dstream.ReceiverInputDStream
import org.apache.spark.streaming.receiver.Receiver
import org.apache.spark.streaming.util.WriteAheadLogUtils
import org.apache.spark.util.ThreadUtils
import scala.collection.JavaConverters._
import scala.collection.Map
import scala.reflect._
object MoreKafkaUtils {
def createStream(
jssc: JavaStreamingContext,
zkQuorum: String,
groupId: String,
topics: JMap[String, JInt],
storageLevel: StorageLevel = StorageLevel.MEMORY_AND_DISK_SER_2
): JavaReceiverInputDStream[(String, String, String)] = {
val kafkaParams = Map[String, String](
"zookeeper.connect" -> zkQuorum, "group.id" -> groupId,
"zookeeper.connection.timeout.ms" -> "10000")
val walEnabled = WriteAheadLogUtils.enableReceiverLog(jssc.ssc.conf)
new KafkaInputDStreamWithTopic[String, String, StringDecoder, StringDecoder](jssc.ssc, kafkaParams, topics.asScala.mapValues(_.intValue()), walEnabled, storageLevel)
}
}
private[streaming]
class KafkaInputDStreamWithTopic[
K: ClassTag,
V: ClassTag,
U <: Decoder[_] : ClassTag,
T <: Decoder[_] : ClassTag](
#transient ssc_ : StreamingContext,
kafkaParams: Map[String, String],
topics: Map[String, Int],
useReliableReceiver: Boolean,
storageLevel: StorageLevel
) extends ReceiverInputDStream[(K, V, String)](ssc_) with Logging {
def getReceiver(): Receiver[(K, V, String)] = {
if (!useReliableReceiver) {
new KafkaReceiverWithTopic[K, V, U, T](kafkaParams, topics, storageLevel)
} else {
new ReliableKafkaReceiverWithTopic[K, V, U, T](kafkaParams, topics, storageLevel)
}
}
}
private[streaming]
class KafkaReceiverWithTopic[
K: ClassTag,
V: ClassTag,
U <: Decoder[_] : ClassTag,
T <: Decoder[_] : ClassTag](
kafkaParams: Map[String, String],
topics: Map[String, Int],
storageLevel: StorageLevel
) extends Receiver[(K, V, String)](storageLevel) with Logging {
// Connection to Kafka
var consumerConnector: ConsumerConnector = null
def onStop() {
if (consumerConnector != null) {
consumerConnector.shutdown()
consumerConnector = null
}
}
def onStart() {
logInfo("Starting Kafka Consumer Stream with group: " + kafkaParams("group.id"))
// Kafka connection properties
val props = new Properties()
kafkaParams.foreach(param => props.put(param._1, param._2))
val zkConnect = kafkaParams("zookeeper.connect")
// Create the connection to the cluster
logInfo("Connecting to Zookeeper: " + zkConnect)
val consumerConfig = new ConsumerConfig(props)
consumerConnector = Consumer.create(consumerConfig)
logInfo("Connected to " + zkConnect)
val keyDecoder = classTag[U].runtimeClass.getConstructor(classOf[VerifiableProperties])
.newInstance(consumerConfig.props)
.asInstanceOf[Decoder[K]]
val valueDecoder = classTag[T].runtimeClass.getConstructor(classOf[VerifiableProperties])
.newInstance(consumerConfig.props)
.asInstanceOf[Decoder[V]]
// Create threads for each topic/message Stream we are listening
val topicMessageStreams = consumerConnector.createMessageStreams(
topics, keyDecoder, valueDecoder)
val executorPool =
ThreadUtils.newDaemonFixedThreadPool(topics.values.sum, "KafkaMessageHandler")
try {
// Start the messages handler for each partition
topicMessageStreams.values.foreach { streams =>
streams.foreach { stream => executorPool.submit(new MessageHandler(stream)) }
}
} finally {
executorPool.shutdown() // Just causes threads to terminate after work is done
}
}
// Handles Kafka messages
private class MessageHandler(stream: KafkaStream[K, V])
extends Runnable {
def run() {
logInfo("Starting MessageHandler.")
try {
val streamIterator = stream.iterator()
while (streamIterator.hasNext()) {
val msgAndMetadata = streamIterator.next()
store((msgAndMetadata.key, msgAndMetadata.message, msgAndMetadata.topic))
}
} catch {
case e: Throwable => reportError("Error handling message; exiting", e)
}
}
}
}