I'm writing some kind of middleware HTTP proxy with cache. The workflow is:
Client requests this proxy for resource
If resurce exists in cache, proxy returns it
If resource wasn't found, proxy fetching remote resource and returns to the user. Proxy saves this resource to the cache on data loading.
My interfaces have Publisher<ByteBuffer> stream for remote resource, cache which accepts Publisher<ByteBuffer> to save, and clients' connection which accepts Publisher<ByteBuffer> as a response:
// remote resource
interface Resource {
Publisher<ByteBuffer> fetch();
}
// cache
interface Cache {
Completable save(Publisher<ByteBuffer> data);
}
// clien response connection
interface Connection {
Completable send(Publisher<ByteBuffer> data);
}
My problem is that I need to lazy save this stream of byte buffers to cache when sending the response to the client, so the client should be responsible for requesting ByteByffer chunks from remote resource, not cache.
I tried to use Publisher::cache method, but it's not a good choice for me, because it keeps all received data in memory, it's not acceptable, since cached data may be few GB of size.
As a workaround, I created Subject filled by next items received from Resource:
private final Cache cache;
private final Connection out;
Completable proxy(Resource res) {
Subject<ByteBuffer> mirror = PublishSUbject.create();
return Completable.mergeArray(
out.send(res.fetch().doOnNext(mirror::onNext),
cache.save(mirror.toFlowable(BackpressureStrategy.BUFFER))
);
}
Is it possible to reuse same Publisher without caching items in memory, and where only one subscriber will be responsible for requesting items from publisher?
I might be missing something (added comment about my version of the Publisher interface being different).
But.. here's how I would do something like this conceptually.
I'm going to simplify the interfaces to deal with Integers:
// remote resource
interface Resource {
ConnectableObservable<Integer> fetch();
}
// cache
interface Cache {
Completable save(Integer data);
}
// client response connection
interface Connection {
Completable send(Integer data);
}
I'd use Observable::publish to create a ConnectableObservable and establish two subscriptions:
#Test
public void testProxy()
{
// Override schedulers:
TestScheduler s = new TestScheduler();
RxJavaPlugins.setIoSchedulerHandler(
scheduler -> s );
RxJavaPlugins.setComputationSchedulerHandler(
scheduler -> s );
// Mock interfaces:
Resource resource = () -> Observable.range( 1, 100 )
.publish();
Cache cache = data -> Completable.fromObservable( Observable.just( data )
.delay( 100, TimeUnit.MILLISECONDS )
.doOnNext( __ -> System.out.println( String.format( "Caching %d", data ))));
Connection connection = data -> Completable.fromObservable( Observable.just( data )
.delay( 500, TimeUnit.MILLISECONDS )
.doOnNext( __ -> System.out.println( String.format( "Sending %d", data ))));
// Subscribe to resource:
ConnectableObservable<Integer> observable = resource.fetch();
observable
.observeOn( Schedulers.io() )
.concatMapCompletable( data -> connection.send( data ))
.subscribe();
observable
.observeOn( Schedulers.computation() )
.concatMapCompletable( data -> cache.save( data ))
.subscribe();
observable.connect();
// Simulate passage of time:
s.advanceTimeBy( 10, TimeUnit.SECONDS );
}
Output:
Caching 1
Caching 2
Caching 3
Caching 4
Sending 1
Caching 5
Caching 6
Caching 7
Caching 8
Caching 9
Sending 2
Caching 10
. . .
Update
Based on your comments, it sounds like respecting backpressure is important in your case.
Let's say you have a Publisher somewhere that honors backpressure, you can transform it into a Flowable as follows:
Flowable<T> flowable = Flowable.fromPublisher( publisher );
Once you have a Flowable you can allow for multiple subscribers without worrying about each subscriber having to request values from the Publisher (or either subscriber from missing any events while establishing the subscriptions). You do that by calling flowable.publish() to create a ConnectableFlowable.
ConnectableFlowable<T> flowable = Flowable.fromPublisher( publisher ).publish();
out.send(flowable); // calls flowable.subscribe()
cache.save(flowable); // calls flowable.subscribe()
flowable.connect(); // begins emitting values
Related
I'd like to join data coming in from two Kafka topics ("left" and "right").
Matching records are to be joined using an ID, but if a "left" or a "right" record is missing, the other one should be passed downstream after a certain timeout. Therefore I have chosen to use the coGroup function.
This works, but there is one problem: If there is no message at all, there is always at least one record which stays in an internal buffer for good. It gets pushed out when new messages arrive. Otherwise it is stuck.
The expected behaviour is that all records should be pushed out after the configured idle timeout has been reached.
Some information which might be relevant
Flink 1.14.4
The Flink parallelism is set to 8, so is the number of partitions in both Kafka topics.
Flink checkpointing is enabled
Event-time processing is to be used
Lombok is used: So val is like final var
Some code snippets:
Relevant join settings
public static final int AUTO_WATERMARK_INTERVAL_MS = 500;
public static final Duration SOURCE_MAX_OUT_OF_ORDERNESS = Duration.ofMillis(4000);
public static final Duration SOURCE_IDLE_TIMEOUT = Duration.ofMillis(1000);
public static final Duration TRANSFORMATION_MAX_OUT_OF_ORDERNESS = Duration.ofMillis(5000);
public static final Duration TRANSFORMATION_IDLE_TIMEOUT = Duration.ofMillis(1000);
public static final Time JOIN_WINDOW_SIZE = Time.milliseconds(1500);
Create KafkaSource
private static KafkaSource<JoinRecord> createKafkaSource(Config config, String topic) {
val properties = KafkaConfigUtils.createConsumerConfig(config);
val deserializationSchema = new KafkaRecordDeserializationSchema<JoinRecord>() {
#Override
public void deserialize(ConsumerRecord<byte[], byte[]> record, Collector<JoinRecord> out) {
val m = JsonUtils.deserialize(record.value(), JoinRecord.class);
val copy = m.toBuilder()
.partition(record.partition())
.build();
out.collect(copy);
}
#Override
public TypeInformation<JoinRecord> getProducedType() {
return TypeInformation.of(JoinRecord.class);
}
};
return KafkaSource.<JoinRecord>builder()
.setProperties(properties)
.setBootstrapServers(config.kafkaBootstrapServers)
.setTopics(topic)
.setGroupId(config.kafkaInputGroupIdPrefix + "-" + String.join("_", topic))
.setDeserializer(deserializationSchema)
.setStartingOffsets(OffsetsInitializer.latest())
.build();
}
Create DataStreamSource
Then the DataStreamSource is built on top of the KafkaSource:
Configure "max out of orderness"
Configure "idleness"
Extract timestamp from record, to be used for event time processing
private static DataStreamSource<JoinRecord> createLeftSource(Config config,
StreamExecutionEnvironment env) {
val leftKafkaSource = createLeftKafkaSource(config);
val leftWms = WatermarkStrategy
.<JoinRecord>forBoundedOutOfOrderness(SOURCE_MAX_OUT_OF_ORDERNESS)
.withIdleness(SOURCE_IDLE_TIMEOUT)
.withTimestampAssigner((joinRecord, __) -> joinRecord.timestamp.toEpochSecond() * 1000L);
return env.fromSource(leftKafkaSource, leftWms, "left-kafka-source");
}
Use keyBy
The keyed sources are created on top of the DataSource instances like this:
Again configure "out of orderness" and "idleness"
Again extract timestamp
val leftWms = WatermarkStrategy
.<JoinRecord>forBoundedOutOfOrderness(TRANSFORMATION_MAX_OUT_OF_ORDERNESS)
.withIdleness(TRANSFORMATION_IDLE_TIMEOUT)
.withTimestampAssigner((joinRecord, __) -> {
if (VERBOSE_JOIN)
log.info("Left : " + joinRecord);
return joinRecord.timestamp.toEpochSecond() * 1000L;
});
val leftKeyedSource = leftSource
.keyBy(jr -> jr.id)
.assignTimestampsAndWatermarks(leftWms)
.name("left-keyed-source");
Join using coGroup
The join then combines the left and the right keyed sources
val joinedStream = leftKeyedSource
.coGroup(rightKeyedSource)
.where(left -> left.id)
.equalTo(right -> right.id)
.window(TumblingEventTimeWindows.of(JOIN_WINDOW_SIZE))
.apply(new CoGroupFunction<JoinRecord, JoinRecord, JoinRecord>() {
#Override
public void coGroup(Iterable<JoinRecord> leftRecords,
Iterable<JoinRecord> rightRecords,
Collector<JoinRecord> out) {
// Transform
val result = ...;
out.collect(result);
}
Write stream to console
The resulting joinedStream is written to the console:
val consoleSink = new PrintSinkFunction<JoinRecord>();
joinedStream.addSink(consoleSink);
How can I configure this join operation, so that all records are pushed downstream after the configured idle timeout?
If it can't be done this way: Is there another option?
This is the expected behavior. withIdleness doesn't try to handle the case where all streams are idle. It only helps in cases where there are still events flowing from at least one source partition/shard/split.
To get the behavior you desire (in the context of a continuous streaming job), you'll have to implement a custom watermark strategy that advances the watermark based on a processing time timer. Here's an implementation that uses the legacy watermark API.
On the other hand, if the job is complete and you just want to drain the final results before shutting it down, you can use the --drain option when you stop the job. Or if you use bounded sources this will happen automatically.
I have a Spring Boot app, where I receive one single message from a Azure Service Bus queue session.
The code is:
#Autowired
ServiceBusSessionReceiverAsyncClient apiMessageQueueIntegrator;
.
.
.
Mono<ServiceBusReceiverAsyncClient> receiverMono = apiMessageQueueIntegrator.acceptSession(sessionid);
Disposable subscription = Flux.usingWhen(receiverMono,
receiver -> receiver.receiveMessages(),
receiver -> Mono.fromRunnable(() -> receiver.close()))
.subscribe(message -> {
// Process message.
logger.info(String.format("Message received from quque. Session id: %s. Contents: %s%n", message.getSessionId(),
message.getBody()));
receivedMessage.setReceivedMessage(message);
timeoutCheck.countDown();
}, error -> {
logger.info("Queue error occurred: " + error);
});
As I am receiving only one message from the session, I use a CountDownLatch(1) to dispose of the subscription when I have received the message.
The documentation of the library says that it is possible to use Mono.usingWhen instead of Flux.usingWhen if I only expect one message, but I cannot find any examples of this anywhere, and I have not been able to figure out how to rewrite this code to do this.
How would the pasted code look if I were to use Mono.usingWhen instead?
Thank you conniey. Posting your suggestion as an answer to help other community members.
By default receiveMessages() is a Flux because we imagine the messages from a session to be "infinitely long". In your case, you only want the first message in the stream, so we use the next() operator.
The usage of the countdown latch is probably not the best approach. In the sample, we had one there so that the program didn't end before the messages were received. .subscribe is not a blocking call, it sets up the handlers and moves onto the next line of code.
Mono<ServiceBusReceiverAsyncClient> receiverMono = sessionReceiver.acceptSession("greetings-id");
Mono<ServiceBusReceivedMessage> singleMessageMono = Mono.usingWhen(receiverMono,
receiver -> {
// Anything you wish to do with the receiver.
// In this case we only want to take the first message, so we use the "next" operator. This returns a
// Mono.
return receiver.receiveMessages().next();
},
receiver -> Mono.fromRunnable(() -> receiver.close()));
try {
// Turns this into a blocking call. .block() waits indefinitely, so we have a timeout.
ServiceBusReceivedMessage message = singleMessageMono.block(Duration.ofSeconds(30));
if (message != null) {
// Process message.
}
} catch (Exception error) {
System.err.println("Error occurred: " + error);
}
You can refer to GitHub issue:ServiceBusSessionReceiverAsyncClient - Mono instead of Flux
I am creating a app in Flink to
Read Messages from a topic
Do some simple process on it
Write Result to a different topic
My code does work, however it does not run in parallel
How do I do that?
It seems my code runs only on one thread/block?
On the Flink Web Dashboard:
App goes to running status
But, there is only one block shown in the overview subtasks
And Bytes Received / Sent, Records Received / Sent is always zero ( no Update )
Here is my code, please assist me in learning how to split my app to be able to run in parallel, and am I writing the app correctly?
public class SimpleApp {
public static void main(String[] args) throws Exception {
// create execution environment INPUT
StreamExecutionEnvironment env_in =
StreamExecutionEnvironment.getExecutionEnvironment();
// event time characteristic
env_in.setStreamTimeCharacteristic(TimeCharacteristic.EventTime);
// production Ready (Does NOT Work if greater than 1)
env_in.setParallelism(Integer.parseInt(args[0].toString()));
// configure kafka consumer
Properties properties = new Properties();
properties.setProperty("zookeeper.connect", "localhost:2181");
properties.setProperty("bootstrap.servers", "localhost:9092");
properties.setProperty("auto.offset.reset", "earliest");
// create a kafka consumer
final DataStream<String> consumer = env_in
.addSource(new FlinkKafkaConsumer09<>("test", new
SimpleStringSchema(), properties));
// filter data
SingleOutputStreamOperator<String> result = consumer.filter(new
FilterFunction<String>(){
#Override
public boolean filter(String s) throws Exception {
return s.substring(0, 2).contentEquals("PS");
}
});
// Process Data
// Transform String Records to JSON Objects
SingleOutputStreamOperator<JSONObject> data = result.map(new
MapFunction<String, JSONObject>()
{
#Override
public JSONObject map(String value) throws Exception
{
JSONObject jsnobj = new JSONObject();
if(value.substring(0, 2).contentEquals("PS"))
{
// 1. Raw Data
jsnobj.put("Raw_Data", value.substring(0, value.length()-6));
// 2. Comment
int first_index_comment = value.indexOf("$");
int last_index_comment = value.lastIndexOf("$") + 1;
// - set comment
String comment =
value.substring(first_index_comment, last_index_comment);
comment = comment.substring(0, comment.length()-6);
jsnobj.put("Comment", comment);
}
else {
jsnobj.put("INVALID", value);
}
return jsnobj;
}
});
// Write JSON to Kafka Topic
data.addSink(new FlinkKafkaProducer09<JSONObject>("localhost:9092",
"FilteredData",
new SimpleJsonSchema()));
env_in.execute();
}
}
My code does work, but it seems to run only on a single thread
( One block shown ) in web interface ( No passing of data, hence the bytes sent / received are not updated ).
How do I make it run in parallel ?
To run your job in parallel you can do 2 things:
Increase the parallelism of your job at the env level - i.e. do something like
StreamExecutionEnvironment env_in =
StreamExecutionEnvironment.getExecutionEnvironment().setParallelism(4);
But this would only increase parallelism at flink end after it reads the data, so if the source is producing data faster it might not be fully utilized.
To fully parallelize your job, setup multiple partitions for your kafka topic, ideally the amount of parallelism you would want with your flink job. So, you might want to do something like below when you are creating your kafka topic:
bin/kafka-topics.sh --create --zookeeper localhost:2181
--replication-factor 3 --partitions 4 --topic test
I'm wondering if any one experienced the same problem.
We have a Vert.x application and in the end it's purpose is to insert 600 million rows into a Cassandra cluster. We are testing the speed of Vert.x in combination with Cassandra by doing tests in smaller amounts.
If we run the fat jar (build with Shade plugin) without the -cluster option, we are able to insert 10 million records in about a minute. When we add the -cluster option (eventually we will run the Vert.x application in cluster) it takes about 5 minutes for 10 million records to insert.
Does anyone know why?
We know that the Hazelcast config will create some overhead, but never thought it would be 5 times slower. This implies we will need 5 EC2 instances in cluster to get the same result when using 1 EC2 without the cluster option.
As mentioned, everything runs on EC2 instances:
2 Cassandra servers on t2.small
1 Vert.x server on t2.2xlarge
You are actually running into corner cases of the Vert.x Hazelcast Cluster manager.
First of all you are using a worker Verticle to send your messages (30000001). Under the hood Hazelcast is blocking and thus when you send a message from a worker the version 3.3.3 does not take that in account. Recently we added this fix https://github.com/vert-x3/issues/issues/75 (not present in 3.4.0.Beta1 but present in 3.4.0-SNAPSHOTS) that will improve this case.
Second when you send all your messages at the same time, it runs into another corner case that prevents the Hazelcast cluster manager to use a cache of the cluster topology. This topology cache is usually updated after the first message has been sent and sending all the messages in one shot prevents the usage of the ache (short explanation HazelcastAsyncMultiMap#getInProgressCount will be > 0 and prevents the cache to be used), hence paying the penalty of an expensive lookup (hence the cache).
If I use Bertjan's reproducer with 3.4.0-SNAPSHOT + Hazelcast and the following change: send message to destination, wait for reply. Upon reply send all messages then I get a lot of improvements.
Without clustering : 5852 ms
With clustering with HZ 3.3.3 :16745 ms
With clustering with HZ 3.4.0-SNAPSHOT + initial message : 8609 ms
I believe also you should not use a worker verticle to send that many messages and instead send them using an event loop verticle via batches. Perhaps you should explain your use case and we can think about the best way to solve it.
When you're you enable clustering (of any kind) to an application you are making your application more resilient to failures but you're also adding a performance penalty.
For example your current flow (without clustering) is something like:
client ->
vert.x app ->
in memory same process eventbus (negletible) ->
handler -> cassandra
<- vert.x app
<- client
Once you enable clustering:
client ->
vert.x app ->
serialize request ->
network request cluster member ->
deserialize request ->
handler -> cassandra
<- serialize response
<- network reply
<- deserialize response
<- vert.x app
<- client
As you can see there are many encode decode operations required plus several network calls and this all gets added to your total request time.
In order to achive best performance you need to take advantage of locality the closer you are of your data store usually the fastest.
Just to add the code of the project. I guess that would help.
Sender verticle:
public class ProviderVerticle extends AbstractVerticle {
#Override
public void start() throws Exception {
IntStream.range(1, 30000001).parallel().forEach(i -> {
vertx.eventBus().send("clustertest1", Json.encode(new TestCluster1(i, "abc", LocalDateTime.now())));
});
}
#Override
public void stop() throws Exception {
super.stop();
}
}
And the inserter verticle
public class ReceiverVerticle extends AbstractVerticle {
private int messagesReceived = 1;
private Session cassandraSession;
#Override
public void start() throws Exception {
PoolingOptions poolingOptions = new PoolingOptions()
.setCoreConnectionsPerHost(HostDistance.LOCAL, 2)
.setMaxConnectionsPerHost(HostDistance.LOCAL, 3)
.setCoreConnectionsPerHost(HostDistance.REMOTE, 1)
.setMaxConnectionsPerHost(HostDistance.REMOTE, 3)
.setMaxRequestsPerConnection(HostDistance.LOCAL, 20)
.setMaxQueueSize(32768)
.setMaxRequestsPerConnection(HostDistance.REMOTE, 20);
Cluster cluster = Cluster.builder()
.withPoolingOptions(poolingOptions)
.addContactPoints(ClusterSetup.SEEDS)
.build();
System.out.println("Connecting session");
cassandraSession = cluster.connect("kiespees");
System.out.println("Session connected:\n\tcluster [" + cassandraSession.getCluster().getClusterName() + "]");
System.out.println("Connected hosts: ");
cassandraSession.getState().getConnectedHosts().forEach(host -> System.out.println(host.getAddress()));
PreparedStatement prepared = cassandraSession.prepare(
"insert into clustertest1 (id, value, created) " +
"values (:id, :value, :created)");
PreparedStatement preparedTimer = cassandraSession.prepare(
"insert into timer (name, created_on, amount) " +
"values (:name, :createdOn, :amount)");
BoundStatement timerStart = preparedTimer.bind()
.setString("name", "clusterteststart")
.setInt("amount", 0)
.setTimestamp("createdOn", new Timestamp(new Date().getTime()));
cassandraSession.executeAsync(timerStart);
EventBus bus = vertx.eventBus();
System.out.println("Bus info: " + bus.toString());
MessageConsumer<String> cons = bus.consumer("clustertest1");
System.out.println("Consumer info: " + cons.address());
System.out.println("Waiting for messages");
cons.handler(message -> {
TestCluster1 tc = Json.decodeValue(message.body(), TestCluster1.class);
if (messagesReceived % 100000 == 0)
System.out.println("Message received: " + messagesReceived);
BoundStatement boundRecord = prepared.bind()
.setInt("id", tc.getId())
.setString("value", tc.getValue())
.setTimestamp("created", new Timestamp(new Date().getTime()));
cassandraSession.executeAsync(boundRecord);
if (messagesReceived % 100000 == 0) {
BoundStatement timerStop = preparedTimer.bind()
.setString("name", "clusterteststop")
.setInt("amount", messagesReceived)
.setTimestamp("createdOn", new Timestamp(new Date().getTime()));
cassandraSession.executeAsync(timerStop);
}
messagesReceived++;
//message.reply("OK");
});
}
#Override
public void stop() throws Exception {
super.stop();
cassandraSession.close();
}
}
public boolean connect(IConnection conn, IScope scope, Object[] params)
{
IClient client = conn.getClient();
log.info( "app connect " + conn.getClient().getId() );
client.setAttribute( "stamp", new Long( 0 ) );
return true;
}
This is the method which is being called every time Client is connected at my Custom Application in Red5 Server ,so is there a way to identify if a Client is Subscriber (Consumer ,Viewer) or Publisher (User which streams at my server).
Bests
To disallow or allow publish or subscribe a user, you can use those methods inside appStart callback method:
registerStreamPlaybackSecurity
registerStreamPublishSecurity
For more, look to the:
http://dl.fancycode.com/red5/api/org/red5/server/adapter/MultiThreadedApplicationAdapter.html
I'm using jRuby, and it's very easy to do so:
registerStreamPlaybackSecurity do |scope, name, start, len, flush|
false # no playback allowed
end
registerStreamPublishSecurity do |scope, name, mode|
rand(1) % 1 == 0 # publishing (recording) sometimes allowed, sometimes no
end