Flink using multiple structures of data in java

Flink using multiple structures of data in java - java

I am reading data from kafka in Java to perform some processing in Apache Flink and sink the results.
I have the kafka topic topic_a which has some data like {name: "abc", age: 20} and some data like {pin: 111, number: 999999, address: "some place"}
When I read the data from kafka using KafkaSource, I deserialize the records into a POJO which has the fields String name, int age with their respective getter and setter functions and constructor.
When I run the flink code, the deserliazer works fine for {name: "abc", age: 20}
KafkaSource<AllDataPOJO> kafkaAllAlertsSource = KafkaSource.<AllAlertsPOJO>builder()
.setBootstrapServers(bootstrapServers)
.setTopics(Arrays.asList("topic_a"))
.setProperties(properties)
.setGroupId(allEventsGroupID)
.setStartingOffsets(OffsetsInitializer.earliest())
.setValueOnlyDeserializer(new AllDataDeserializationSchema())
.build();
AllDataPOJO
private String name;
private int age;
The code runs fine for {name: "abc", age: 20}, but as soon as {pin: 111, number: 999999, address: "some place"}, it starts failing.
2 questions:
Is there any way that I can read such varying formats of messages and perform the flink operations. Depending on what kind of message comes, I wish to route it to a different kafka topic.?
When I get {name: "abc", age: 20}, it should go to topic user_basic and {pin: 111, number: 999999, address: "some place"} should go to topic ** user_details**
How can I achieve the above with just 1 flink java code?

You might be interested in specifying your Deserialization Schema as:
.setDeserializer(KafkaRecordDeserializationSchema.of(new JSONKeyValueDeserializationSchema(false)))
Then you would then map and filter that source with, validating which fields are present:
Key fields can be accessed by calling objectNode.get("key").get(<name>).as(<type>)
Value fields can be accessed by calling objectNode.get("value").get(<name>).as(<type>)
Or cast the objects to existing POJOs inside your map.

You cannot use <AllDataPOJO> if you have other POJO classes with other fields.
Or, you need to add all fields from all POJO types, and make them nullable when they don't exist in your data. But that may be error prone as name and pin could potentially exist in the same record, for example, but shouldn't.
Otherwise, as the other answer says, use a more generic String/JSON deserializer, and then you can use filter/map operations to cast your data into more concrete types, depending on the fields that are available

In situations like this I normally use the SimpleStringSchema, then follow the source with a ProcessFunction where I parse the string, and use side outputs (one per each message type). The added benefit to this approach is that if the JSON isn't deserializable, or it doesn't properly map to any of the target types, you have the opportunity to flexibly handle the error (e.g. send out to an error sink).

Related

Kafka Streams API GroupBy behaviour

So I've been trying to aggregate some stream data to a KTable using Kafka stream. My JSON from the topic looks like
{
"id": "d04a6184-e805-4ceb-9aaf-b2ab0139ee84",
"person": {
"id": "d04a6184-e805-4ceb-9aaf-b2ab0139ee84",
"createdBy": "user",
"createdDate": "2023-01-01T00:28:58.161Z",
"name": "person 1",
"description": "test1"
}
}....
KStream<Object, String> firstStream = builder.stream("topic-1").mapValues(value -> {
JSONObject json = new JSONObject(String.valueOf(value));
JSONObject json2 = new JSONObject(json.getJSONObject("person").toString());
return json2.toString();
});
I get something like
null{"createdDate":"2023-01-01T00:28:58.161Z","createdBy":"user","name":"person 1","description":"test1","id":"d04a6184-e805-4ceb-9aaf-b2ab0139ee84"}
null{"createdDate":"2023-01-01T00:29:07.862Z","createdBy":"user","name":"person 2","description":"test 2","id":"48d8b895-eb27-4977-9dbc-adb8fbf649d8"}
null{"createdDate":"2023-01-01T00:29:12.261Z","createdBy":"anonymousUser","name":"person 2","description":"test 2 updated","id":"d8b895-eb27-4977-9dbc-adb8fbf649d8"}
I want to group this data in such a way such that
person 1 will hold one JSON associated with it
person 2 will hold a List of both JSON associated with it
I have checked this Kafka Streams API GroupBy behaviour which describes the same problem but the solution given there doesn't work for me. Do I have to perform any extra operations? Please help

In order to groupBy, you need a pairing key. So, use map to extract the name of each person.
Then, as the linked answer says, you need to aggregate after grouping to "combine data per person", across events.
By the way, you should setup the Streams config with JsonSerde for values rather than String Serde in order to reduce the need to manually parse each event.

Substitute ints into Dataflow via Cloudbuild yaml

I've got a streaming Dataflow pipeline, written in Java with BEAM 2.35. It commits data to BigQuery via StorageWriteApi. Initially the code looks like
BigQueryIO.writeTableRows()
.withTimePartitioning(/* some column */)
.withClustering(/* another column */)
.withMethod(BigQueryIO.Write.Method.STORAGE_WRITE_API)
.withTriggeringFrequency(Duration.standardSeconds(30))
.withNumStorageWriteApiStreams(20) // want to make this dynamic
This code runs in different environment eg Dev & Prod. When I deploy in Dev I want 2 StorageWriteApiStreams, in Prod I want 20, and I'm trying to pass/resolve these values at the moment I deploy with a Cloudbuild.
The cloudbuild-dev.yaml looks like
steps:
- lots-of-steps
args:
--numStorageWriteApiStreams=${_NUM_STORAGEWRITEAPI_STREAMS}
substitutions:
_PROJECT: dev-project
_NUM_STORAGEWRITEAPI_STREAMS: '2'
I expose the substitution in the job code with an interface
ValueProvider<String> getNumStorageWriteApiStreams();
void setNumStorageWriteApiStreams(ValueProvider<String> numStorageWriteApiStreams);
I then refactor the writeTableRows() call to invoke getNumStorageWriteApiStreams()
BigQueryIO.writeTableRows()
.withTimePartitioning(/* some column */)
.withClustering(/* another column */)
.withMethod(BigQueryIO.Write.Method.STORAGE_WRITE_API)
.withTriggeringFrequency(Duration.standardSeconds(30))
.withNumStorageWriteApiStreams(Integer.parseInt(String.valueOf(options.getNumStorageWriteApiStreams())))
Now it's dynamic but I get a build failure on account of java.lang.IllegalArgumentException: methods with same signature getNumStorageWriteApiStreams() but incompatible return types: [class java.lang.Integer, interface org.apache.beam.sdk.options.ValueProvider]
My understanding was that Integer.parseInt returns an int, which I want so I can pass it to withNumStorageWriteApiStreams() which requires an int.
I'd appreciate any help I can get here thanks

Turns out BigQueryOptions.java already has a method getNumStorageWriteApiStreams() that returns an Integer. I was unknowingly trying to rewrite it with a different return, oops.
https://github.com/apache/beam/blob/master/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/BigQueryOptions.java#L95-L98

Protobuf repeated fields to json array mapping

I'm using Java, Spring-boot, Hibernate stack and protocol buffers as DTO for communication among micro-services. At reverse proxy, I convert the protobuf object to json using protobuf's java support.
I have the following structure
message Item {
int64 id = 1;
string name = 2;
int64 price = 3;
}
message MultipleItems {
repeated Item items = 1;
}
Converting the MultipleItems DTO to json gives me the following result:
{
"items": [
{
"id": 1,
"name": "ABC",
"price": 10
},
{
"id": 2,
"name": "XYZ",
"price": 20
}
]
}
In the generated json, I've got the key items that maps to the json array.
I want to remove the key and return only json array as the result. Is there a clean way to achieve this?

I think it's not possible.
repeated must appear as a modifier on a field and fields must be named.
https://developers.google.com/protocol-buffers/docs/proto3#json
There's no obvious reason why Protobuf could not support this1 but, it would require that its grammar be extended to support use of repeated at the message level rather than its current use at the field level. This, of course, makes everything downstream of the proto messages more complex too
JSON, of course, does permit it.
It's possible that it complicates en/decoding too (an on-the-wire message could be either a message or an array of messages.
1 Perhaps the concern is that generated code (!) then must necessarily be more complex too? Methods would all need to check whether the message is an array type or a struct type, e.g.:
func (x *X) SomeMethod(ctx context.Context, []*pb.SomeMethodRequest) ...
And, in Golang pre-generics, it's not possible to overload methods this way and they would need to have distinct names:
func (x *X) SomeMethodArray(ctx context.Context, []*pb.SomeMethodRequest) ...
func (x *X) SomeMethodMessage(ctx context.Context, *pb.SomeMethodRequest) ...

Specifying keyword type on String field

I started using hibernate-search-elasticsearch(5.8.2) because it seemed easy to integrate it maintains elasticsearch indices up to date without writing any code. It's a cool lib, but I'm starting to think that it has a very small set of the elasticsearch functionalities implemented. I'm executing a query with a painless script filter which needs to access a String field, which type is 'text' in the index mapping and this is not possible without enabling field data. But I'm not very keen on enabling it as it consumes a lot of heap memory. Here's what elasticsearch team suggests to do in my case:
Fielddata documentation
Before you enable fielddata, consider why you are using a text field for aggregations, sorting, or in a script. It usually doesn’t make sense to do so.
A text field is analyzed before indexing so that a value like New York can be found by searching for new or for york. A terms aggregation on this field will return a new bucket and a york bucket, when you probably want a single bucket called New York.
Instead, you should have a text field for full text searches, and an unanalyzed keyword field with doc_values enabled for aggregations, as follows:
PUT my_index
{
"mappings": {
"_doc": {
"properties": {
"my_field": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword"
}
}
}
}
}
}
}
Unfortunately I can't find a way to do it with the hibernate-search annotations. Can someone tell me if this is possible or I have to migrate to the vanilla elasticsearch lib and not using any wrappers?

With the current version of Hibernate Search, you need to create a different field for that (e.g. you can't have different flavors of the same field). Note that that's what Elasticsearch is doing under the hood anyway.
#Field(analyzer = "your-text-analyzer") // your default full text search field with the default name
#Field(name="myPropertyAggregation", index = Index.NO, normalizer = "keyword")
#SortableField(forField = "myPropertyAggregation")
private String myProperty;
It should create an unanalyzed field with doc values. You then need to refer to the myPropertyAggregation field for your aggregations.
Note that we will expose much more Elasticsearch features in the API in the future Search 6. In Search 5, the APIs are designed with Lucene in mind and we couldn't break them.

MongoDb - Update collection atomically if set does not exist

I have the following document in my collection:
{
"_id":NumberLong(106379),
"_class":"x.y.z.SomeObject",
"name":"Some Name",
"information":{
"hotelId":NumberLong(106379),
"names":[
{
"localeStr":"en_US",
"name":"some Other Name"
}
],
"address":{
"address1":"5405 Google Avenue",
"city":"Mountain View",
"cityIdInCitiesCodes":"123456",
"stateId":"CA",
"countryId":"US",
"zipCode":"12345"
},
"descriptions":[
{
"localeStr":"en_US",
"description": "Some Description"
}
],
},
"providers":[
],
"some other set":{
"a":"bla bla bla",
"b":"bla,bla bla",
}
"another Property":"fdfdfdfdfdf"
}
I need to run through all documents in collection and if "providers": [] is empty I need to create new set based on values of information section.
I'm far from being MongoDB expert, so I have the few questions:
Can I do it as atomic operation?
Can I do this using MongoDB console? as far as I understood I can do it using $addToSet and $each command?
If not is there any Java based driver that can provide such functionality?

Can I do it as atomic operation?
Every document will be updated in an atomic fashion. There is no "atomic" in MongoDB in the sense of RDBMS, meaning all operations will succeed or fail, but you can prevent other writes interleaves using $isolated operator
Can I do this using MongoDB console?
Sure you can. To find all empty providers array you can issue a command like:
db.zz.find(providers :{ $size : 0}})
To update all documents where the array is of zero length with a fixed set of string, you can issue a query such as
db.zz.update({providers : { $size : 0}}, {$addToSet : {providers : "zz"}})
If you want to add a portion to you document based on a document's data, you can use the notorious $where query, do mind the warnings appearing in that link, or - as you had mentioned - query for empty provider array, and use cursor.forEach()
If not is there any Java based driver that can provide such functionality?
Sure, you have a Java driver, as for each other major programming language. It can practically do everything described, and basically every thing you can do from the shell. Is suggest you to get started from the Java Language Center.
Also there are several frameworks which facilitate working with MongoDB and bridge the object-document world. I will not give a least here as I'm pretty biased, but I'm sure a quick Google search can do.

db.so.find({ providers: { $size: 0} }).forEach(function(doc) {
doc.providers.push( doc.information.hotelId );
db.so.save(doc);
});
This will push the information.hotelId of the corresponding document into an empty providers array. Replace that with whatever field you would rather insert into the providers array.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.