Creating test data from Confluent Control Center JSON representation

Creating test data from Confluent Control Center JSON representation - java

I'm trying to write some unit tests for Kafka Streams and have a number of quite complex schemas that I need to incorporate into my tests.
Instead of just creating objects from scratch each time, I would ideally like to instantiate using some real data and perform tests on that. We use Confluent with records in Avro format, and can extract both schema and a text JSON-like representation from the Control Center application. The JSON is valid JSON, but it's not really in the form that you'd write it in if you were just writing JSON representations of the data, so I assume it's some representation of the underlying AVRO in text form.
I've already used the schema to create a Java SpecificRecord class (price_assessment) and would like to use the JSON string copied from the Control Center message to populate a new instance of that class to feed into to my unit test InputTopic.
The code I've tried so far is
var testAvroString = "{JSON copied from Control Center topic}";
Schema schema = price_assessment.getClassSchema();
DecoderFactory decoderFactory = new DecoderFactory();
Decoder decoder = null;
try {
DatumReader<price_assessment> reader = new SpecificDatumReader<price_assessment>();
decoder = decoderFactory.get().jsonDecoder(schema, testAvroString);
return reader.read(null, decoder);
} catch (Exception e)
{
return null;
}
which is adapted from another SO answer that was using GenericRecords. When I try running this though I get the exception Cannot invoke "org.apache.avro.Schema.equals(Object)" because "writer" is null on the reader.read(...) step.
I'm not massively familiar with streams testing or Java and I'm not sure what exactly I've done wrong. Written in Java 17, streams 3.1.0, though flexible with version

The solution that I've managed to come up with is the following, which seems to work:
private static <T> T avroStringToInstance(Schema classSchema, String testAvroString) {
DecoderFactory decoderFactory = new DecoderFactory();
GenericRecord genericRecord = null;
try {
Decoder decoder = decoderFactory.jsonDecoder(classSchema, testAvroString);
DatumReader<GenericData.Record> reader =
new GenericDatumReader<>(classSchema);
genericRecord = reader.read(null, decoder);
} catch (Exception e)
{
return null;
}
var specific = (T) SpecificData.get().deepCopy(genericRecord.getSchema(), genericRecord);
return specific;
}

Related

Avro vs Protobuf Performance

I wrote a JMH benchmark to compare the serialization performance of Avro (1.8.2) & Protobuf (3.5.0) in java 1.8. According to JMH, Protobuf can serialize some data 4.7 million times in a second where as Avro can only do 800k per second.
The test data that was serialized is around 200 bytes and I generated schema for both Avro and Protobuf.
Here is my Avro serialization code, can someone familiar with Avro ensure that I haven't made some cardinal mistake?
The method called serialize is what JMH benchmarked. Also, I have posted this at https://groups.google.com/forum/#!topic/protobuf/skmE78F-XbE
Many Thanks
public final class AvroSerialization{
private BinartEncoder encoder;
private final SpecificDatumWriter writer;
public AvroSerialization( ){
this.writer = new SpecificDatumWriter( AvroGeneratedClass.class );
}
//MyDataObject = A pojo that contains the data to be serialized
public final byte[] serialize( MyDataObject data ){
ByteArrayOutputStream out = new ByteArrayOutputStream( 1024 );
encoder = EncoderFactory.get().binaryEncoder( out, encoder );
AvroGeneratedClass avroData = createAvro( data );
writer.write( avroData, encoder );
encoder.flush();
return out.toByteArray();
}
//AvroGeneratedClass = Class generated by the Avro Schema
public final static AvroGeneratedClass createAvro( MyDataObject data ){
AvroGeneratedClass avroData = AvroGeneratedClass.newBuilder()
.setXXX( data.getXXX )
.setXXX( data.getXXX )
...
return avroData;
}
}

AVRO always serialize data with its schema.
In the protobuf approach the server assumes the client already knows the schema so it just serialize the data to binary format.
For transactional workloads protobuf is usually better.
AVRO is usually better for analytical workloads where you need to serialize a huge amount of records. In this case, the schema serialization is often negligible and AVRO serialization is slightly more compact.

How to serialize the data to AVRO schema in Spark (with Java)?

I have defined an AVRO schema, and generated some classes with avro-tools for the schemes. Now, I want to serialize the data to disk. I found some answers about scala for this, but not for Java. The class Article is generated with avro-tools, and is made from a schema defined by me.
Here's a simplified version of the code of how I try to do it:
JavaPairRDD<String, String> filesRDD = context.wholeTextFiles(inputDataPath);
JavaRDD<Article> processingFiles = filesRDD.map(fileNameContent -> {
// The name of the file
String fileName = fileNameContent._1();
// The content of the file
String fileContent = fileNameContent._2();
// An object from my avro schema
Article a = new Article(fileContent);
Processing processing = new Processing();
// .... some processing of the content here ... //
processing.serializeArticleToDisk(avroFileName);
return a;
});
where serializeArticleToDisk(avroFileName) is defined as follows:
public void serializeArticleToDisk(String filename) throws IOException{
// Serialize article to disk
DatumWriter<Article> articleDatumWriter = new SpecificDatumWriter<Article>(Article.class);
DataFileWriter<Article> dataFileWriter = new DataFileWriter<Article>(articleDatumWriter);
dataFileWriter.create(this.article.getSchema(), new File(filename));
dataFileWriter.append(this.article);
dataFileWriter.close();
}
where Article is my avro schema.
Now, the mapper throws me the error:
java.io.FileNotFoundException: hdfs:/...path.../avroFileName.avro (No such file or directory)
at java.io.FileOutputStream.open0(Native Method)
at java.io.FileOutputStream.open(FileOutputStream.java:270)
at java.io.FileOutputStream.<init>(FileOutputStream.java:213)
at java.io.FileOutputStream.<init>(FileOutputStream.java:162)
at org.apache.avro.file.SyncableFileOutputStream.<init>(SyncableFileOutputStream.java:60)
at org.apache.avro.file.DataFileWriter.create(DataFileWriter.java:129)
at org.apache.avro.file.DataFileWriter.create(DataFileWriter.java:129)
at sentences.ProcessXML.serializeArticleToDisk(ProcessXML.java:207)
. . . rest of the stacktrace ...
although the file path is correct.
I use a collect() method afterwards, so everything else within the map function works fine (except for the serialization part).
I am quite new with Spark, so I am not sure if this might be something trivial actually. I suspect that I need to use some writing functions, not to do the writing in the mapper (not sure if this is true, though). Any ideas how to tackle this?
EDIT:
The last line of the error stack-trace I showed, is actually on this part:
dataFileWriter.create(this.article.getSchema(), new File(filename));
This is the part that throws the actual error. I am assuming the dataFileWriter needs to be replaced with something else. Any ideas?

This solution is not using data-frames and is not throwing any errors:
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.io.NullWritable;
import org.apache.avro.mapred.AvroKey;
import org.apache.spark.api.java.JavaPairRDD;
import scala.Tuple2;
. . . . .
// Serializing to AVRO
JavaPairRDD<AvroKey<Article>, NullWritable> javaPairRDD = processingFiles.mapToPair(r -> {
return new Tuple2<AvroKey<Article>, NullWritable>(new AvroKey<Article>(r), NullWritable.get());
});
Job job = AvroUtils.getJobOutputKeyAvroSchema(Article.getClassSchema());
javaPairRDD.saveAsNewAPIHadoopFile(outputDataPath, AvroKey.class, NullWritable.class, AvroKeyOutputFormat.class,
job.getConfiguration());
where AvroUtils.getJobOutputKeyAvroSchema is:
public static Job getJobOutputKeyAvroSchema(Schema avroSchema) {
Job job;
try {
job = new Job();
} catch (IOException e) {
throw new RuntimeException(e);
}
AvroJob.setOutputKeySchema(job, avroSchema);
return job;
}
Similar things for Spark + Avro can be found here -> https://github.com/CeON/spark-utils.

It seems that you use Spark in a wrong way.
Map is a transformation function. Just calling map doesn't invoke calulation of RDD. You have to call action like forEach() or collect().
Also note, that lambda supplied to map will be serialized at driver and transferred to some Node in a cluster.
ADDED
Try to use Spark SQL and Spark-Avro to save Spark DataFrame in Avro format:
// Load a text file and convert each line to a JavaBean.
JavaRDD<Person> people = sc.textFile("/examples/people.txt")
.map(Person::parse);
// Apply a schema to an RDD
DataFrame peopleDF = sqlContext.createDataFrame(people, Person.class);
peopleDF.write()
.format("com.databricks.spark.avro")
.save("/output");

convert rdfxml into turtle triples

I've written a java program that ingests data from a .csv, and converts those data into RDFXML. I used sesame's framework when writing this program, and the program successfully does what it was written to do.
However, I am trying to unit test this program using jUnit, and I need to test a method which converts RDF triples (in turtle format) to RDFXML. To show that the method works correctly, I would like to do this by converting RDFXML back into triples and comparing them to the original triples I passed into the method. So far, I have not found anything in sesame's documentation does this. Any suggestions?

I just solved the problem a few minutes ago. Here's my solution:
#Test
public void testWriteStmtToRDFPos(){
RDFParser parser = new RDFXMLParser();
String baseURI = "";
Model origStmts = new LinkedHashModel();
Model processedStmts = new LinkedHashModel();
StatementCollector collector = new StatementCollector(processedStmts);
parser.setRDFHandler(collector);
origStmts.add(sexOffend,predicate,object);
try{
converter.writeStmtToRDF(origStmts, rdfFile);
FileReader reader = new FileReader(rdfFile);
parser.parse(reader, baseURI);
if(origStmts.equals(processedStmts)){
assert(true);
}
}catch(FileNotFoundException e){
e.printStackTrace();
fail();
}catch(Exception e){
e.printStackTrace();
fail();
}
}
When you set the collector for the parser above, it simply collects any statements that the parser ingests. After doing this, you can compare the collector with origStmts. This wasn't immediately obvious, but is really useful after finding it!

Parse ~1 MB JSON on Android very slow

I have an approximately 1MB JSON file stored in my assets folder that I need to load in my app every time it runs. I find that the built-in JSON parser (org.json) parses the file very slowly, however once it's parsed, I can access and manipulate the data very quickly. I've counted out as many as 7 or 8 seconds from the moment I click on the app to the moment the Activity1 is brought up, but just a few milliseconds to go from Activity1 to Activity2 which depends on data processed from the data loaded in Activity1.
I'm reading the file into memory and parsing it using:
String jsonString = readFileToMemory(myFilename)
JSONArray array = new JSONArray(jsonString);
where readFileToMemory(String) looks like this:
private String readFileToMemory(String filename) {
StringBuilder data = new StringBuilder();
BufferedReader reader = null;
try {
InputStream stream = myContext.getAssets().open(filename);
reader = new BufferedReader(new InputStreamReader(stream, "UTF-8"));
int result = 0;
do
{
char[] chunk = new char[512];
result = reader.read(chunk);
data.append(chunk);
}
while(result != -1);
} catch (FileNotFoundException e) {
e.printStackTrace();
} catch (IOException e) {
e.printStackTrace();
}
return data.toString();
}
Does anyone have any suggestions to how I can speed up the initial loading and parsing of the data? Should I perhaps mask the whole process behind a loading screen?

JSONObject -- the one from json.org, is the simplest API to use to parse JSON. However it comes with a cost -- performace. I have done extensive experiments with JSONObject, Gson and Jackson. It seems no matter what you do, JSONObject (and hence JSONArray) will be the slowest. Please switch to Jackson or Gson.
Here is the relative performance of the two
(fastest) Jackson > Gson >> JSONObject (slowest)
Refer:
- Jackson
- Gson

You should make an SQLite table to store the data and move it from JSON to SQL the first time the app runs. As an added benefit, this makes the data easier to search through and makes it possible for you to modify the data from within the app.

HL7 parsing to get ORC-2

I am having trouble reading the ORC-2 field from ORM^O01 order message. I am using HapiStructures-v23-1.2.jar to read but this method(getFillerOrdersNumber()) is returning null value
MSH|^~\\&|recAPP|20010|BIBB|HCL|20110923192607||ORM^O01|11D900220|D|2.3|1\r
PID|1|11D900220|11D900220||TEST^FOURTYONE||19980808|M|||\r
ZRQ|1|11D900220||CHARTMAXX TESTING ACCOUNT 2|||||||||||||||||Y\r
ORC|NW|11D900220||||||||||66662^NOT INDICATED^X^^^^^^^^^^U|||||||||CHARTMAXX
TESTING ACCOUNT 2|^695 S.BROADWAY^DENVER^CO^80209\r
OBR|1|11D900220||66^BHL, 9P21 GENOTYPE^L|NORMAL||20110920001800|
||NOTAVAILABLE|N||Y|||66662^NOT INDICATED^X^^^^^^^^^^U\r
I want to parse this message and read the ORC-2 field and save it in the database
public static string getOrderNumber(){
Message hapiMsg = null;
Parser p = new GenericParser();
p.setValidationContext(null);
try {
hapiMsg = p.parse(hl7Message);
} catch (Exception e) {
Logger.error(e);
}
Terser terser = new Terser(hapiMsg);
try {
ORM_O01 getOrc = (ORM_O01)hapiMsg;
ORC orc = new ORC(getOrc, null);
String fn= orc.getFillerOrderNumber().toString();
}catch(Exception e){
logger.error(e);
}
return fn;
}
I read in some posts that I have to ladder through to reach the ORC OBR and NTE segments. can someone help me how to do this with a piece of code. Thanks in advance

First I have to point out that ORC-2 is Placer Order Number and ORC-3 is Filler Order Number, not the other way round. So, what you might want to do is this:
ORM_O01 msg = ...
ORC orc = msg.getORDER().getORC();
String placerOrderNumber =
orc.getPlacerOrderNumber().getEntityIdentifier().getValue();
String fillerOrderNumber =
orc.getFillerOrderNumber().getEntityIdentifier().getValue();
I would suggest you to read Hapi documentation yourself: http://hl7api.sourceforge.net/v23/apidocs/index.html

Based on this code:
ORM_O01 getOrc = (ORM_O01)hapiMsg;
ORC orc = new ORC(getOrc, null);
String fn= orc.getFillerOrderNumber().toString();
It looks like you are creating a new ORC rather than pulling out the existing one from the message. I unfortunately can't provide the exact code as I'm only familiar with HL7, not HAPI.
EDIT: It looks like you may be able to do ORC orc = getOrc.getORDER().getORC();

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Creating test data from Confluent Control Center JSON representation - java

Related

Avro vs Protobuf Performance

How to serialize the data to AVRO schema in Spark (with Java)?

convert rdfxml into turtle triples

Parse ~1 MB JSON on Android very slow

HL7 parsing to get ORC-2

Categories

Resources