I have an use case where, I read in the newline json elements stored in google cloud storage and start processing each json. While processing each json, I have to call an external API for doing de-duplication whether that json element was discovered previously. I'm doing a ParDo with a DoFn on each json.
I haven't seen any online tutorial saying how to call an external API endpoint from apache beam DoFn Dataflow.
I'm using JAVA SDK of Beam. Some of the tutorial I studied explained that using startBundle and FinishBundle but I'm not clear on how to use it
If you need to check duplicates in external storage for every JSON record, then you still can use DoFn for that. There are several annotations, like #Setup, #StartBundle, #FinishBundle, etc, that can be used to annotate methods in your DoFn.
For example, if you need to instantiate a client object to send requests to your external database, then you might want to do this in #Setup method (like POJO constructor) and then leverage this client object in your #ProcessElement method.
Let's consider a simple example:
static class MyDoFn extends DoFn<Record, Record> {
static transient MyClient client;
#Setup
public void setup() {
client = new MyClient("host");
}
#ProcessElement
public void processElement(ProcessContext c) {
// process your records
Record r = c.element();
// check record ID for duplicates
if (!client.isRecordExist(r.id()) {
c.output(r);
}
}
#Teardown
public void teardown() {
if (client != null) {
client.close();
client = null;
}
}
}
Also, to avoid doing remote calls for every record, you can batch bundle records into internal buffer (Beam split input data into bundles) and check duplicates in batch mode (if your client support this). For this purpose, you might use #StartBundle and #FinishBundle annotated methods that will be called right before and after processing Beam bundle accordingly.
For more complicated examples, I'd recommend to take a look on a Sink implementations in different Beam IOs, like KinesisIO, for instance.
There is an example of calling external system in batches using a stateful DoFn in the following blog post: https://beam.apache.org/blog/2017/08/28/timely-processing.html, might be helpful.
Related
How to achieve handling multiple versions of UserDetailDto while processing it from Topic-A to Topic-B with Kafka stream using processor API.
Existing instance/replica of aggregation service should not impacted and Kubernetes upgrade scenario should also not hamper (means old version of aggregation replica service are able to handle the modified/new versioned of UserDetailDto).
For Example, modify the UserId datatype from Integer to String and remove UserPhone field from the below User detail dto
class UserDetailDto{
#JsonProperty("userID)
#NotNull(message = "UserId can not be null")
private int userID;
#JsonProperty("userPhone")
#NotNull(message = "User Phone number can not be null")
private int userPhone;
Now after update UserDetailDto, old replica/instance of aggregation service should able to handle both new or old UserdetailDto and also new replica/instance of aggregation service should able to new or old UserdetailDto.
My Processor as given below with Custom Serde UserDetailDto
public class AggregationProcessor implements Processor<String, UserDetailDto, String, UserDetailDto> {
private ProcessorContext<String, UserDetailDto> processorContext;
public AggregationProcessor() {
super();
}
#Override
public void init(ProcessorContext<String, UserDetailDto> processorContext) {
System.out.println("Inside Aggregation Processor init method.");
Objects.requireNonNull(processorContext, "Processor context should not be null or empty.");
this.processorContext = processorContext;
}
#Override
public void process(Record<String, UserDetailDto> message) {
System.out.println("Inside AggregationProcessor init method - to initialize all the resources for the data processing.");
Objects.requireNonNull(processorContext, "Processor context should not be null or empty.");
// Forwarding the message as is it without doing any modification
processorContext.forward(message);
}
#Override
public void close() {
System.out.println("Inside AggregationProcessor close method.");
}
Topology given below
Topology topology = new Topology();
// Adding the sourceNode of the application
topology = topology.addSource(Topology.AutoOffsetReset.EARLIEST,
sourceName,
new UsePartitionTimeOnInvalidTimestamp(),
KEY_SERDE.deserializer(),
USER_DETAIL_DTO.deserializer(),
sourceTopic);
// Adding the processorNode of the application
topology = topology.addProcessor(
processorName,
AggregationProcessor::new,
parentNames);
// Adding sinkNode of the application
topology = topology.addSink(sinkName,
destinationTopic,
KEY_SERDE.serializer(),
USER_DETAIL_DTO.serializer(),
parentNames);
Please provide all possible suggestions.Thanks!
In Kafka streaming applications using processor API, we can do Data Feed Serdes validations at consumer application inside init() or process() methods of kafka.streams.processor.api.Processor implementation which would be one of the standard way during Rolling upgrade scenario.
Data Feed validation can be achieved in processor API as below and This support should be provided for two consecutives versions for Rolling back scenario as well.
Old Producer to New Consumer
In new consumer, Mark old field as deprecated by removing validation from processor API with its support. Add new field with validation, So in this case Data feed can't be process by new consumer and only be consume by Old consumer as it will still persist in source topic, this way inflight packages will only be processed by old consumer only during rolling upgrade scenario.
New Producer to Old Consumer
In Old consumer, newly added field in Stream Data feed will be ignore and dummy value of deprecated field can be validated in processor API.
New Producer to New Consumer And Old Producer to Old Consumer
No impact.
Don't change types of existing fields, otherwise, parsing old data will fail if it's type cannot be coerced by the JSON parser.
Mark them as deprecated and instead add completely new, nullable fields. You can use #JsonAlias to workaround duplicate field names. You could also add a generic int version/string type field, then run a switch case on that, and delegate to intermediate deserializers.
Removal of fields can be handled by configuring Jackson not to fail on unknown properties.
An alternative solution would be to use a serialization format that has builtin evolution guarantees such as Avro
Is it possible to use interactive query (InteractiveQueryService) within Spring Cloud Stream the class with #EnableBinding annotation or within the method with #StreamListener? I tried instantiating ReadOnlyKeyValueStore within provided KStreamMusicSampleApplication class and process method but its always null.
My #StreamListener method is listening to a bunch of KTables and KStreams and during the process topology e.g filtering, I have to check whether the key from a KStream already exists in a particular KTable.
I tried to figure out how to scan an incoming KTable to check if a key already exists but no luck. Then I came across InteractiveQueryService whose get() method could be used to check if a key exists inside a state store materializedAs from a KTable. The problem is that I can't access it from with the process topology (#EnableBinding or #StreamListener). It can only be accessed from outside these annotation e.g RestController.
Is there a way to scan an incoming KTable to check for the existence of a key or value? if not then can we access InteractiveQueryService within the process topology?
InteractiveQueryService in Spring Cloud Stream is not available to be used within the actual topology in your StreamListener. As you mentioned, it is supposed to be used outside of your main topology. However, with the use case you described, you still can use the state store from your main flow. For example, if you have an incoming KStream and a KTable which is materialized as a state store, then you can call process on the KStream and access the state store that way. Here is a rough code to achieve that. You need to convert this to fit into your specific use case, but here is the idea.
ReadOnlyKeyValueStore<Object, String> store;
input.process(() -> new Processor<Object, Product>() {
#Override
public void init(ProcessorContext processorContext) {
store = (ReadOnlyKeyValueStore) processorContext.getStateStore("my-store");
}
#Override
public void process(Object key, Object value) {
//find the key
store.get(key);
}
#Override
public void close() {
if (state != null) {
state.close();
}
}
}, "my-store");
I am using Anypoint studio.
I have used esper CEP engine for event detection using java file. Once the event is detected i am getting output in the console from java file as system.out.println(Object).
I want the Obejct to be sent from java output to the mule flow either as a message property or payload, so I can store in MongoDB or I can reuse it for another event detection.
here is my flow:
mule flow
Here I want the "event.getUnderlying()" Object to be sent to mule flow.
public void update(EventBean[] newData, EventBean[] oldData) {
EventBean event = newData[0];
obj=event.getUnderlying();
if(a2==0){
i++;
System.out.println("Event received:"+i+" "+event.getUnderlying());
Thanks in Advance :)
Just "post" to the input connector of the flow you want to send to. So for HTTP input use something like org.apache.http.client.HttpClient or HttpUrlConnection
(There are so many examples of how to use those on this site and many others...)
Other inputs have different libraries you can use, you could just save it as a file and have the file input pick it up. (depends on where you are deploying).
If you are calling the Java class via Component (as you mentioned in your comment), your Java class esper.Test_main must be implementing Callable interface. More details on using Java component right - https://docs.mulesoft.com/mule-user-guide/v/3.8/java-component-reference
In that case, you need to implement below method:
public Object onCall(MuleEventContext eventContext) {
//your code here
return someObject; // return event.getUnderlying() in your case
}
The object returned from onCall() method is passed on as 'payload' to the next message processor in mule flow.
If you need to set a flow variable from the Java class:
public Object onCall(MuleEventContext eventContext) {
//your code here
eventContext.getMessage().setInvocationProperty("variableName", "variableValue");
return someObject; // return event.getUnderlying() in your case
}
Now, you will have a flowVar called variableName available in your mule flow.
HTH.
I am developing an architecture in Java using tomcat and I have come across a situation that I believe is very generic and yet, after reading several questions/answers in StackOverflow, I couldn't find a definitive answer. My architecture has a REST API (running on tomcat) that receives one or more files and their associated metadata and writes them to storage. The configuration of the storage layer has a 1-1 relationship with the REST API server, and for that reason the intuitive approach is to write a Singleton to hold that configuration.
Obviously I am aware that Singletons bring testability problems due to global state and the hardship of mocking Singletons. I also thought of using the Context pattern, but I am not convinced that the Context pattern applies in this case and I worry that I will end up coding using the "Context anti-pattern" instead.
Let me give you some more background on what I am writing. The architecture is comprised of the following components:
Clients that send requests to the REST API uploading or retrieving "preservation objects", or simply put, POs (files + metadata) in JSON or XML format.
The high level REST API that receives requests from clients and stores data in a storage layer.
A storage layer that may contain a combination of OpenStack Swift containers, tape libraries and file systems. Each of these "storage containers" (I'm calling file systems containers for simplicity) is called an endpoint in my architecture. The storage layer obviously does not reside on the same server where the REST API is.
The configuration of endpoints is done through the REST API (e.g. POST /configEndpoint), so that an administrative user can register new endpoints, edit or remove existing endpoints through HTTP calls. Whilst I have only implemented the architecture using an OpenStack Swift endpoint, I anticipate that the information for each endpoint contains at least an IP address, some form of authentication information and a driver name, e.g. "the Swift driver", "the LTFS driver", etc. (so that when new storage technologies arrive they can be easily integrated to my architecture as long as someone writes a driver for it).
My problem is: how do I store and load configuration in an testable, reusable and elegant way? I won't even consider passing a configuration object to all the various methods that implement the REST API calls.
A few examples of the REST API calls and where the configuration comes into play:
// Retrieve a preservation object metadata (PO)
#GET
#Path("container/{containername}/{po}")
#Produces({ MediaType.APPLICATION_JSON, MediaType.APPLICATION_XML })
public PreservationObjectInformation getPOMetadata(#PathParam("containername") String containerName, #PathParam("po") String poUUID) {
// STEP 1 - LOAD THE CONFIGURATION
// One of the following options:
// StorageContext.loadContext(containerName);
// Configuration.getInstance(containerName);
// Pass a configuration object as an argument of the getPOMetadata() method?
// Some sort of dependency injection
// STEP 2 - RETRIEVE THE METADATA FROM THE STORAGE
// Call the driver depending on the endpoint (JClouds if Swift, Java IO stream if file system, etc.)
// Pass poUUID as parameter
// STEP 3 - CONVERT JSON/XML TO OBJECT
// Unmarshall the file in JSON format
PreservationObjectInformation poi = unmarshall(data);
return poi;
}
// Delete a PO
#DELETE
#Path("container/{containername}/{po}")
public Response deletePO(#PathParam("containername") String containerName, #PathParam("po") String poName) throws IOException, URISyntaxException {
// STEP 1 - LOAD THE CONFIGURATION
// One of the following options:
// StorageContext.loadContext(containerName); // Context
// Configuration.getInstance(containerName); // Singleton
// Pass a configuration object as an argument of the getPOMetadata() method?
// Some sort of dependency injection
// STEP 2 - CONNECT TO THE STORAGE ENDPOINT
// Call the driver depending on the endpoint (JClouds if Swift, Java IO stream if file system, etc.)
// STEP 3 - DELETE THE FILE
return Response.ok().build();
}
// Submit a PO and its metadata
#POST
#Consumes(MediaType.MULTIPART_FORM_DATA)
#Path("container/{containername}/{po}")
public Response submitPO(#PathParam("containername") String container, #PathParam("po") String poName, #FormDataParam("objectName") String objectName,
#FormDataParam("inputstream") InputStream inputStream) throws IOException, URISyntaxException {
// STEP 1 - LOAD THE CONFIGURATION
// One of the following options:
// StorageContext.loadContext(containerName);
// Configuration.getInstance(containerName);
// Pass a configuration object as an argument of the getPOMetadata() method?
// Some sort of dependency injection
// STEP 2 - WRITE THE DATA AND METADATA TO STORAGE
// Call the driver depending on the endpoint (JClouds if Swift, Java IO stream if file system, etc.)
return Response.created(new URI("container/" + container + "/" + poName))
.build();
}
** UPDATE #1 - My implementation based on #mawalker's comment **
Find below my implementation using the proposed answer. A factory creates concrete strategy objects that implement lower-level storage actions. The context object (which is passed back and forth by the middleware) contains an object of the abstract type (in this case, an interface) StorageContainerStrategy (its implementation will depend on the type of storage in each particular case at runtime).
public interface StorageContainerStrategy {
public void write();
public void read();
// other methods here
}
public class Context {
public StorageContainerStrategy strategy;
// other context information here...
}
public class StrategyFactory {
public static StorageContainerStrategy createStorageContainerStrategy(Container c) {
if(c.getEndpoint().isSwift())
return new SwiftStrategy();
else if(c.getEndpoint().isLtfs())
return new LtfsStrategy();
// etc.
return null;
}
}
public class SwiftStrategy implements StorageContainerStrategy {
#Override
public void write() {
// OpenStack Swift specific code
}
#Override
public void read() {
// OpenStack Swift specific code
}
}
public class LtfsStrategy implements StorageContainerStrategy {
#Override
public void write() {
// LTFS specific code
}
#Override
public void read() {
// LTFS specific code
}
}
Here is the paper Doug Schmidt (in full disclosure my current PhD Advisor) wrote on the Context Object Pattern.
https://www.dre.vanderbilt.edu/~schmidt/PDF/Context-Object-Pattern.pdf
As dbugger stated, building a factory into your api classes that returns the appropriate 'configuration' object is a pretty clean way of doing this. But if you know the 'context'(yes, overloaded usage) of the paper being discussed, it mainly for use in middleware. Where there are multiple layers of context changes. And note that under the 'implementation' section it recommends use of the Strategy Pattern for how to add each layer's 'context information' to the 'context object'.
I would recommend a similar approach. Each 'storage container' would have a different strategy associated with it. Each "driver" therefore has its own strategy impl. class. That strategy would be obtained from a factory, and then used as needed. (How to design your Strats... best way (I'm guessing) would be to make your 'driver strat' be generic for each driver type, and then configure it appropriately as new resources arise/the strat object is assigned)
But as far as I can tell right now(unless I'm reading your question wrong), this would only have 2 'layers' where the 'context object' would be aware of, the 'rest server(s)' and the 'storage endpoints'. If I'm mistaken then so be it... but with only 2 layers, You can just use 'strategy pattern' in the same way you were thinking 'context pattern', and avoid the issue of singletons/Context 'anti-pattern'. (You 'could' have a context object, which contains the strategy for which driver to use, and then a 'configuration' for that driver... that wouldn't be insane, and might fit well with your dynamic HTTP configuration.)
The Strategy(s) Factory Class doesn't 'have to' be singleton/have static factory methods either. I've made factories that are objects before just fine, even with D.I. for testing. There is always trade-offs to different approaches, but I've found better testing to be worth it in almost all cases I've ran into.
I am developing an Android app using GAE on Eclipse.
On one of the EndPoint classes I have a method which returns a "Bla"-type object:
public Bla foo()
{
return new Bla();
}
This "Bla" object holds a "Bla2"-type object:
public class Bla {
private Bla2 bla = new Bla2();
public Bla2 getBla() {
return bla;
}
public void setBla(Bla2 bla) {
this.bla = bla;
}
}
Now, my problem is I cant access the "Bla2" class from the client side. (Even the method "getBla()" doesn't exist)
I managed to trick it by creating a second method on the EndPoint class which return a "Bla2" object:
public Bla2 foo2()
{
return new Bla2();
}
Now I can use the "Bla2" class on the client side, but the "Bla.getBla()" method still doesn't exit. Is there a right way to do it?
This isn't the 'right' way, but keep in mind that just because you are using endpoints, you don't have to stick to the endpoints way of doing things for all of your entities.
Like you, I'm using GAE/J and cloud endpoints and have an ANdroid client. It's great running Java on both the client and the server because I can share code between all my projects.
Some of my entities are communicated and shared the normal 'endpoints way', as you are doing. But for other entities I still use JSON, but just stick them in a string, send them through a generic endpoint, and deserialize them on the other side, which is easy because the entity class is in the shared code.
This allows me to send 50 different entity types through a single endpoint, and it makes it easy for me to customize the JSON serializing/deserializing for those entities.
Of course, this solution gets you in trouble if decide to add an iOS or Web (unless you use GWT) client, but maybe that isn't important to you.
(edit - added some impl. detail)
Serializing your java objects (or entities) to/from JSON is very easy, but the details depend on the JSON library you use. Endpoints can use either Jackson or GSON on the client. But for my own JSON'ing I used json.org which is built-into Android and was easy to download and add to my GAE project.
Here's a tutorial that someone just published:
http://www.survivingwithandroid.com/2013/10/android-json-tutorial-create-and-parse.html
Then I added an endpoint like this:
#ApiMethod(name = "sendData")
public void sendData( #Named("clientId") String clientId, String jsonObject )
(or something with a class that includes a List of String's so you can send multiple entities in one request.)
And put an element into your JSON which tells the server which entity the JSON should be de serialized into.
Try using #ApiResourceProperty on the field.