kafka producer reading events from wikipedia in Java

kafka producer reading events from wikipedia in Java - java

I'm quite new to Kafka and as one of my first projects I'm trying to create a kafka producer in Java which will read events from Wikipedia/Wikimedia and post them to relevant topics.
I'm looking at https://wikitech.wikimedia.org/wiki/Event_Platform/EventStreams and https://stream.wikimedia.org/v2/ui/#/ for references on the wikipedia API.
I followed the basic guides for creating Kafka producers in Java, but they mainly rely on events created locally on my machine.
When looking at solutions which read events from a remote server, I see they are using libraries which are not kafka native (e.g. spring.io).
Is there a way to set up my producer with native Kafka libraries that come as part of the kafka installation package?

Spring just wraps the native Kafka libraries to ease development and configuration. It is not required, and so, yes, you can essentially do the same that they do, but with less overhead.
mainly rely on events created locally on my machine
Because that is easier to demo, and an implementation detail. If you pull from a remote server, then that data becomes "local" in-memory data structure at some point.

Related

How can I configure a Kafka connector that takes messages from SQS and moves them to a Kafka topic?

I have a use case where I want to move messages from SQS to a Kafka topic. The framework to be used is SpringBoot. So, whenever I run my code it should start moving the messages. I searched for some articles but there were very few. I am looking for some boilerplate code to start with that follow the best practices and how to proceed further.
Thanks in advance.

You need to make yourself familiar with Enterprise Integration Patterns and its Spring Integration implementation.
To take messages from AWS SQS you would need to use an SqsMessageDrivenChannelAdapter from a Spring Integration for AWS extension. To post records into an Apache Kafka topic you need a KafkaProducerMessageHandler from the spring-integration-kafka module.
Then you wire everything together via an IntegrationFlow bean in your Spring Boot configuration.
Of course you can use Spring Cloud for AWS and Spring for Apache Kafka directly.
Choice is yours, but better to follow best practice and start developing really an integration solution.

Apache Kafka offers multiple ways to ingest data from different sources e.g. Kafka Connect, Kafka Producer, etc., and we need to be careful while selecting specific components of Kafka by keeping certain things in mind such as retry mechanism, scalability, etc.
The best solution, in this case, would be to use Amazon SQS Source Connector to ingest data from AWS SQS into Kafka topic and then write your consumer application to do whatever is necessary with the stream of records of that particular topic.

Integrate legacy On-Prem .Net application to GCP using Tibco and GCP Pub/Sub

We are in the process of integrating .Net applications which are deployed on VM's on premises data centers with pub/sub resource topic in Google cloud platform on the cloud. I have a scenario which I am currently not able to decide and would need help and a right direction. Below is the brief detail of the use case. Please have a look and provide your thoughts.
Currently there is a .Net application which is deployed on a Windows VM on legacy on-prem client data centers. What it does is that it publishes XML messages to a Tibco EMS topic on a EMS server deployed in same data centers on-prem. Few Java applications which are deployed on different VM's subscribe to this Tibco topic and pull messages and process them. This is the legacy flow.
As a part of modernization GCP is coming into the mix. Now the scenario is that XML messages that On-Prem .Net application publishes to the Tibco topic should also get pushed to pub/sub resource topic on GCP cloud. A Java microservice which has been deployed on GCP infra would subscribe to this topic and consume these messages from it.
Now the problem I am facing is that how to go about this integration between On-Prem and Cloud applications. I thought about a couple of options.
Copy the messages directly from legacy Tibco topic to which .Net app publishes messages to Pub/sub topic in GCP. I am not a Tibco expert and not sure If this is supported. I found the below link but not sure if this suits my use-case. Also client wants to move away from Tibco and not sure if the legacy Tibco EMS on data centers support the below Tibco connector feature.
https://www.tibco.com/connected/google-cloud-pub/sub
Make changes to the .Net code base so that point in code where it publishes message to Tibco topic we can add additional code to also publish it directly to Pub/Sub topic in GCP. Not sure if this is ok as .Net application is on legacy on-prem VM and the Pub/Sub is in the Cloud. Here also I not familiar with .Net but found out that there are .Net Google client library which can be added in .Net code to achieve this flow. Also is Google Pub/Sub the right integration tool to be used or something else has to be used to connect these two systems to-gather.
This is by far i could proceed. Could you guys let me know are the above 2 approaches right or there is an issue and which one is the right approach. Also if there is any other solution apart from above it would really help me to move forward. Hoping for a positive reply and help from you all.
Thanks, Vikeng21

For the 1st scenario the mentioned connector is in fact a TIBCO BusinessWorks plugin. So the approach would be to build a kind of GCP Cloud Messaging / TIBCO EMS gateway using TIBCO BusinessWorks. It would be then possible to run this solution on premise or in the Cloud (using TIBCO TCI offering).
The advantage of this approach is that it would be transparent for your applications and local .Net applications and Cloud applications would receive exact same messages.

I think EmmanuelM's answer covers the first scenario, it would probably be the easiest and most transparent approach.
In regards of scenario #2. I think this is a valid approach as well, although it requires to modify your application code to publish messages to Pub/Sub alongside Tibco, I'm no Tibco expert either; but when it comes to Pub/Sub, as you've mentioned, Pub/Sub offers a .NET client library which you can you use in your application to easily publish and consume Pub/Sub messages. I see that you've mentioned:
Not sure if this is ok as .Net application is on legacy on-prem VM and the Pub/Sub is in the Cloud
It is completely ok; Cloud Pub/Sub service is used through its API, and you can consume the API regardless of whether you do it on-prem or cloud.
The thing about this approach is that I'm not sure how consistency could be kept between Tibco and Pub/Sub, I'm assuming that this is why the first approach would be easier and transparent as this is probably what the integration plugin is in charge of. Without that, some custom application logic would probably be required to guarantee that messages are being successfully published both to Tibco and Pub/Sub.
Having said that, I would really recommend that you get in contact with Google Cloud Sales to describe in detail your use case, business requirements and get personalized assistance for your migration plan.

How to provide a Flux with live time series data starting from a certain time in the past?

My goal is to develop a repository that provides a Flux of live time series data starting from a certain time in the past. The repository should provide an API as follows:
public interface TimeSeriesRepository {
//returns a Flux with incoming live data without considering past data
public Flux<TimeSeriesData> getLiveData();
//returns a Flux with incoming live data starting at startTime
public Flux<TimeSeriesData> getLiveData(Instant startTime);
}
The assumptions and constraints are:
the application is using Java 11, Spring Boot 2/Spring 5
the data is stored in a relational database such as PostreSQL and is timestamped.
the data is regularly updated with new data from an external actor
a RabbitMQ broker is available and could be used (if appropriate)
should not include components that require a Zookeeper cluster or similar e.g. event logs such as Apache Kafka or Apache Pulsar or Stream Processing Engines such as Apache Storm or Apache Flink because it is not a large-scale cloud application but should run on a regular PC (e.g. with 8GB RAM)
My first idea was that I would use Debezium to forward incoming data to RabbitMQ and use Reactor RabbitMQ to create a Flux. Actually this was my initial plan before I understood that the second method in the repository that considers historical data is also required. However, this solution would not provide historical data.
Thus, I considered using an event log such as Kafka, so I could replay data from the past but found out the operational overhead is too high. So I dismissed this idea and did not even bother to figure out the details on how this could have worked or potential drawbacks.
Now, I have considered using Spring Data R2DBC but I could not figure out how a query should look like that fulfills my goal.
How could I implement the Interface using any of the mentioned tools or maybe even with plain Java/Spring Data?
I will accept any answer that seems like a feasible approach.

Add Web controllers to an Akka Actor System

I am working with Akka and Spring.
I have an actor System that operates on a Kafka Stream set up (using akka-stream-kafka_2.12) and the actors hold some data in memory and persist their state using akka-persistence.
What I wanted to know is that can I create a REST-endpoint that can interact with my Actor-System to provide some data or send messages to my actors.
My question is, how can it be achieved?

As said in the comments, I have created a sample working application in github to demonstrate the usage of Spring with Akka.
Please note that :
I have used Springboot for quick setup and configuration.
You can't expect any kind of good/best practices in this demo
project as i had to create this in 30 mins. It just explains one of
the ways(simple) to use akka within Spring.
This sample cannot be used in microservice architure because there is
no Remoting or Clustering involved here. API controllers directly talk to actors.
In Controllers, Used GetMapping in all places instead of PostMapping for simplicity.
Will update the repository with another sample explaining the usage
with Clustering where the way of communication between API
Controller and ActorSystem changes.
Here is the Link to Repo. Hope this will get you started.
Either you can build the application yourself or run the api-akka-integration-0.0.1-SNAPSHOT.jar file in command prompt. It runs in default 8080 port.
This sample includes two kinds of APIs, /Calc/{Operation}/{operand1}/{operand2} and /Chat/{message}
/chat/hello
/calc/add/1/2
/calc/mul/1/2
/calc/div/1/2
/calc/sub/1/2
Edit:2
Updated the repo with Akka CLuster Usage in API
API-Akka-Cluster

Create a Kafka brokers cluster using Apache Kafka 0.10.0 API with Java

I want to create a broker cluster using Kafka 0.10 API preferable with Java. As far as I have read kafka_2.11-0.10.0.0.jar do support creating broker using :
import kafka.cluster.Broker;
import kafka.cluster.Cluster;
But I can't find any documentation for doing so. I recently read [1], which tell how to create a topic using Kafka API in Java. Can we do similar things to create broker cluster, update partition, migrate existing data/partition to new broker (as these new broker will not automatically be assigned any data partitions, so unless partitions are moved to them they won't be doing any work [2])
[1] How Can we create a topic in Kafka from the IDE using API
[2] https://kafka.apache.org/0100/ops.html#basic_ops_cluster_expansion

I have some sample code which you may find useful.
For creating a broker, take a look at KafkaTestServer. It's really for simpler testing, so it does not create a cluster, just a single broker, but it should not be difficult to extend.
Once I built in the ability to create/query/delete topics into the test server, I created a standalone admin client for doing the same against other servers, so if you are already creating a broker cluster, you should be able to use the code to maintain topics on it. Take a look at KafkaAdminClient.
The admin client is basically a pure java wrapper around the scala kafka.admin.AdminUtils class so it handles all the scala <--> java conversions under the covers.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.