Apache Beam/Dataflow: KVCoder corrupting Inputstream for decode - java

I have custom objects: CustomKey, CustomValue which I provided coder via Avro: CustomKeyCoder, CustomValueCoder.
Since I need to group by KV[CustomKey, CustomValue], I registered KVCoder.of(new CustomKeyCoder, new CustomValueCoder). Custom coders wraps in/out stream to data in/out stream and uses Avro Datum Writer/Reader.
Issue I am having is in the decode of the KVCoder, when we attempt to decode value part of KV I get Forbidden IOException when reading from InputStream. As noted, key part of decoding works properly, error is thrown when input stream is passed into decoding value. KVCoder reuses same input stream for both key and value I am guessing key decoding reads entire stream. Why would this be happening? Is usage of Avro a problem?
Here is some code to showcase above:
//Coder
override def decode(inputStream: InputStream): CustomValue = {
val dataInputStream = new DataInputStream(inputStream)
val id = dataInputStream.readShort
underlying.decode(dataInputStream)
}
//Underlying
override def decode(inputStream: InputStream): CustomValue = {
val decoder = DecoderFactory.get().binaryDecoder(inputStream, null)
val record = datumReader.read(null, decoder)
CustomValue.decode(record)
}

Related

Creating test data from Confluent Control Center JSON representation

I'm trying to write some unit tests for Kafka Streams and have a number of quite complex schemas that I need to incorporate into my tests.
Instead of just creating objects from scratch each time, I would ideally like to instantiate using some real data and perform tests on that. We use Confluent with records in Avro format, and can extract both schema and a text JSON-like representation from the Control Center application. The JSON is valid JSON, but it's not really in the form that you'd write it in if you were just writing JSON representations of the data, so I assume it's some representation of the underlying AVRO in text form.
I've already used the schema to create a Java SpecificRecord class (price_assessment) and would like to use the JSON string copied from the Control Center message to populate a new instance of that class to feed into to my unit test InputTopic.
The code I've tried so far is
var testAvroString = "{JSON copied from Control Center topic}";
Schema schema = price_assessment.getClassSchema();
DecoderFactory decoderFactory = new DecoderFactory();
Decoder decoder = null;
try {
DatumReader<price_assessment> reader = new SpecificDatumReader<price_assessment>();
decoder = decoderFactory.get().jsonDecoder(schema, testAvroString);
return reader.read(null, decoder);
} catch (Exception e)
{
return null;
}
which is adapted from another SO answer that was using GenericRecords. When I try running this though I get the exception Cannot invoke "org.apache.avro.Schema.equals(Object)" because "writer" is null on the reader.read(...) step.
I'm not massively familiar with streams testing or Java and I'm not sure what exactly I've done wrong. Written in Java 17, streams 3.1.0, though flexible with version
The solution that I've managed to come up with is the following, which seems to work:
private static <T> T avroStringToInstance(Schema classSchema, String testAvroString) {
DecoderFactory decoderFactory = new DecoderFactory();
GenericRecord genericRecord = null;
try {
Decoder decoder = decoderFactory.jsonDecoder(classSchema, testAvroString);
DatumReader<GenericData.Record> reader =
new GenericDatumReader<>(classSchema);
genericRecord = reader.read(null, decoder);
} catch (Exception e)
{
return null;
}
var specific = (T) SpecificData.get().deepCopy(genericRecord.getSchema(), genericRecord);
return specific;
}

How to deserialize a java map

I am trying to de-serialize bytes into an object in Go, which was serialized into bytes in Java, in the following way:
//myMap is an instance of Java TreeMap<String, Object>
ByteArrayOutputStream a = new ByteArrayOutputStream();
GZIPOutputStream b = new GZIPOutputStream(a);
ObjectOutputStream c = new ObjectOutputStream(b);
c.writeObject(myMap);
c.close();
byte[] bytes = a.toByteArray()
Below are the attempts I made
step1 - uncompressed the bytes (in the variable result) using
//att is the byte array received
buf := bytes.NewBuffer(att)
reader, _ := gzip.NewReader(buf)
defer reader.Close()
result , _ := ioutil.ReadAll(reader)
step2 - read object out of uncompressed bytes - but failed
var decodedMap map[string]interface{}
d := gob.NewDecoder(bytes.NewBuffer(*result*))
err = d.Decode(&decodedMap)
if err != nil {
panic(err)
}
error = gob: encoded unsigned integer out of range
But when I convert the (byte array) result to string in Go, I see the encoded treemap details and the contents
map: �� sr java.util.TreeMap��>-%j� Lt NAMEt JOHNt AGEt 32t LOCODEsr java.lang.Long;���̏#� J valuexr java.lang.Number������ xp y
Can someone help me out here?
You can't (easily) deserialize those maps in Go, because the serialized data contains Java-specific data, data required to instantiate and initialize the original Java class (java.util.TreeMap in this case), which is obviously unknown to a Go app. Java object serialization and the encoding implemented by encoding/gob have nothing to do with each other; the former is specific to Java and the latter is specific to Go.
Instead try to serialize the Java object in a language-neutral way, e.g. to JSON, which you can decode in Go (or in any other language).

Convert java byte array to Go struct

I have a system designed around a broker such that my producer is in Java and consumer in Go.
I am using apache-pulsar as my broker
Java - Producer
MessageJava class is converted to byte array before sending it to pulsar: An object of MessageJava class calls getBytes() method defined in same class to convert it to byte[] and then this array is sent to apache-pulsar
class MessageJava {
String id;
int entityId;
Date timestamp;
public byte[] getBytes() throws Exception{
ByteArrayOutputStream bos = new ByteArrayOutputStream();
ObjectOutputStream oos = new ObjectOutputStream(bos);
oos.writeObject(this);
oos.flush();
return bos.toByteArray();
}
}
My consumer is written in Go.
Go - Consumer
byte array is read from pulsar and converted to MessageGo struct using ConvertAfterReceiving method [defined below], I am using gob for decoding:
type MessageGo struct {
Id string
EntityId int
Timestamp time.Time
}
func ConvertAfterReceiving(msg pulsar.Message) *MessageGo {
payload := msg.Payload()
messageBytes := bytes.NewBuffer(payload)
dec := gob.NewDecoder(messageBytes)
var message MessageGo
err := dec.Decode(&message)
if err != nil {
logging.Log.Error("error occurred in consumer while decoding message:", err)
}
return &message
}
The issue is I am not able to decode byte[] and convert it to MessageGo struct. It shows error encoded unsigned integer out of range
I have tried changing MessageJava.entityId to short/long and MessageGo.EntityId to int8/int16/int32/int64/uint8/uint16/uint32/uint64 [all permutations], but all in vain.
A Java ObjectOutputStream and a Go Decoder do not speak the same language, even if at the base they're both made up of bytes; the same way that "these words" and "эти слова" are made up of lines yet knowing one doesn't let you know the other.
An ObjectOutputStream transforms objects into a form meant to be read by a Java ObjectInputStream, while a Go Decoder expects data in a format created by a Go Encoder.
What's needed is a language that they both speak like JSON, which both Java and Go know how to work with. Then instead of serializing the object straight into bytes, you transform it into a string representation, send the bytes of that string and convert that string in Go into the desired struct.

Read S3 Object and write into InMemory Buffer

I am trying to read from S3 and writing into InMemory buffer like:
def inMemoryDownload(bucketName: String, key: String): String = {
val s3Object = s3client.getObject(new GetObjectRequest(bucketName, key))
val s3Stream = s3Object.getObjectContent()
val outputStream = new ByteArrayOutputStream()
val buffer = new Array[Byte](10* 1024)
var bytesRead:Int =s3Stream.read(buffer)
while (bytesRead > -1) {
info("writing.......")
outputStream.write(buffer)
info("reading.......")
bytesRead = ss3Stream.read(buffer)
}
val data = new String(outputStream.toByteArray)
outputStream.close()
s3Object.getObjectContent.close()
data
}
But It is giving me heap space error(Size of file on S3 is 4MB)
You should be using thbytes you just read, when writing into the stream. The way you have it written, writes the entire buffer every time. I doubt that is the cause of your memory problem, but it could be. Imagine that read returns a single byte to you every time, and you write 10K into the stream. That's 40G, right there.
Another problem is, that, I am not 100% sure, but I suspect, that getObjectObject creates a new input stream every time. Basically, you just keep reading the same bytes over and over again in the loop. You should put it into a variable instead.
Also, if I may make a suggestion, try rewriting your code in actual scala, not just syntactically, but idiomatically. Avoid mutable state, and use functional transformations. If you are going to write scala code might as well take some time to get into the right mind set. You'll grow to appreciate it eventually, I promise :)
Something like this, perhaps?
val input = s3Object.getObjectContent
Stream
.continually(input.read(buffer))
.takeWhile(_ > 0)
.foreach { output.write(buffer, 0, _) }

Avro: SpecificDatumReader/Writer vs ReflectDatumReader/Writer

I have a byte[] representation of Avro payload and the schema which is used to encode/decode this payload. There are a few ways to convert this payload to a SpecificRecord, using ReflectDatumReader/Writer and SpecificDatumReader/Writer.
We get an error when we decode using SpecificDatumReader and try to encode a SpecificRecord returned using ReflectDatumReader.
Below is the sample code. Using SpecificDatumReader to get a SpecificRecord (Testing) then using ReflectDatumWriter to attempt to encode it into a byte[], which fails. Finally using a SpecificDatumReader to re-decode it.
DatumWriter<Testing> refWriter = new ReflectDatumWriter<>(expectedSchema);
DatumReader<Testing> specReader = new SpecificDatumReader<>(expectedSchema);
// Use SpecificDatumReader to decode to a SpecificRecord
Testing r1 = specReader.read(null, DecoderFactory.get().binaryDecoder(expectedBytePayload, null));
// Use ReflectDatumWriter to write SpecificRecord encode as byte[].
// Note: this step fails
ByteArrayOutputStream outStream = new ByteArrayOutputStream();
refWriter.write(r1, EncoderFactory.get().binaryEncoder(outStream, null));
// Use SpecificDatumReader to decode to a SpecificRecord
Testing r2 = specReader.read(null, DecoderFactory.get().binaryDecoder(outStream.toByteArray(), null));
Error Message:
java.io.EOFException
at org.apache.avro.io.BinaryDecoder.ensureBounds(BinaryDecoder.java:473)
at org.apache.avro.io.BinaryDecoder.readDouble(BinaryDecoder.java:243)
at org.apache.avro.io.ResolvingDecoder.readDouble(ResolvingDecoder.java:190)
at org.apache.avro.generic.GenericDatumReader.readWithoutConversion(GenericDatumReader.java:185)
at org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:152)
at org.apache.avro.generic.GenericDatumReader.readField(GenericDatumReader.java:240)
at org.apache.avro.generic.GenericDatumReader.readRecord(GenericDatumReader.java:230)
at org.apache.avro.generic.GenericDatumReader.readWithoutConversion(GenericDatumReader.java:174)
at org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:152)
at org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:144)
Why aren't they interchangeable?
When do we use one over another?

Categories