How to serialize the data to AVRO schema in Spark (with Java)?

How to serialize the data to AVRO schema in Spark (with Java)? - java

I have defined an AVRO schema, and generated some classes with avro-tools for the schemes. Now, I want to serialize the data to disk. I found some answers about scala for this, but not for Java. The class Article is generated with avro-tools, and is made from a schema defined by me.
Here's a simplified version of the code of how I try to do it:
JavaPairRDD<String, String> filesRDD = context.wholeTextFiles(inputDataPath);
JavaRDD<Article> processingFiles = filesRDD.map(fileNameContent -> {
// The name of the file
String fileName = fileNameContent._1();
// The content of the file
String fileContent = fileNameContent._2();
// An object from my avro schema
Article a = new Article(fileContent);
Processing processing = new Processing();
// .... some processing of the content here ... //
processing.serializeArticleToDisk(avroFileName);
return a;
});
where serializeArticleToDisk(avroFileName) is defined as follows:
public void serializeArticleToDisk(String filename) throws IOException{
// Serialize article to disk
DatumWriter<Article> articleDatumWriter = new SpecificDatumWriter<Article>(Article.class);
DataFileWriter<Article> dataFileWriter = new DataFileWriter<Article>(articleDatumWriter);
dataFileWriter.create(this.article.getSchema(), new File(filename));
dataFileWriter.append(this.article);
dataFileWriter.close();
}
where Article is my avro schema.
Now, the mapper throws me the error:
java.io.FileNotFoundException: hdfs:/...path.../avroFileName.avro (No such file or directory)
at java.io.FileOutputStream.open0(Native Method)
at java.io.FileOutputStream.open(FileOutputStream.java:270)
at java.io.FileOutputStream.<init>(FileOutputStream.java:213)
at java.io.FileOutputStream.<init>(FileOutputStream.java:162)
at org.apache.avro.file.SyncableFileOutputStream.<init>(SyncableFileOutputStream.java:60)
at org.apache.avro.file.DataFileWriter.create(DataFileWriter.java:129)
at org.apache.avro.file.DataFileWriter.create(DataFileWriter.java:129)
at sentences.ProcessXML.serializeArticleToDisk(ProcessXML.java:207)
. . . rest of the stacktrace ...
although the file path is correct.
I use a collect() method afterwards, so everything else within the map function works fine (except for the serialization part).
I am quite new with Spark, so I am not sure if this might be something trivial actually. I suspect that I need to use some writing functions, not to do the writing in the mapper (not sure if this is true, though). Any ideas how to tackle this?
EDIT:
The last line of the error stack-trace I showed, is actually on this part:
dataFileWriter.create(this.article.getSchema(), new File(filename));
This is the part that throws the actual error. I am assuming the dataFileWriter needs to be replaced with something else. Any ideas?

This solution is not using data-frames and is not throwing any errors:
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.io.NullWritable;
import org.apache.avro.mapred.AvroKey;
import org.apache.spark.api.java.JavaPairRDD;
import scala.Tuple2;
. . . . .
// Serializing to AVRO
JavaPairRDD<AvroKey<Article>, NullWritable> javaPairRDD = processingFiles.mapToPair(r -> {
return new Tuple2<AvroKey<Article>, NullWritable>(new AvroKey<Article>(r), NullWritable.get());
});
Job job = AvroUtils.getJobOutputKeyAvroSchema(Article.getClassSchema());
javaPairRDD.saveAsNewAPIHadoopFile(outputDataPath, AvroKey.class, NullWritable.class, AvroKeyOutputFormat.class,
job.getConfiguration());
where AvroUtils.getJobOutputKeyAvroSchema is:
public static Job getJobOutputKeyAvroSchema(Schema avroSchema) {
Job job;
try {
job = new Job();
} catch (IOException e) {
throw new RuntimeException(e);
}
AvroJob.setOutputKeySchema(job, avroSchema);
return job;
}
Similar things for Spark + Avro can be found here -> https://github.com/CeON/spark-utils.

It seems that you use Spark in a wrong way.
Map is a transformation function. Just calling map doesn't invoke calulation of RDD. You have to call action like forEach() or collect().
Also note, that lambda supplied to map will be serialized at driver and transferred to some Node in a cluster.
ADDED
Try to use Spark SQL and Spark-Avro to save Spark DataFrame in Avro format:
// Load a text file and convert each line to a JavaBean.
JavaRDD<Person> people = sc.textFile("/examples/people.txt")
.map(Person::parse);
// Apply a schema to an RDD
DataFrame peopleDF = sqlContext.createDataFrame(people, Person.class);
peopleDF.write()
.format("com.databricks.spark.avro")
.save("/output");

Related

Creating test data from Confluent Control Center JSON representation

I'm trying to write some unit tests for Kafka Streams and have a number of quite complex schemas that I need to incorporate into my tests.
Instead of just creating objects from scratch each time, I would ideally like to instantiate using some real data and perform tests on that. We use Confluent with records in Avro format, and can extract both schema and a text JSON-like representation from the Control Center application. The JSON is valid JSON, but it's not really in the form that you'd write it in if you were just writing JSON representations of the data, so I assume it's some representation of the underlying AVRO in text form.
I've already used the schema to create a Java SpecificRecord class (price_assessment) and would like to use the JSON string copied from the Control Center message to populate a new instance of that class to feed into to my unit test InputTopic.
The code I've tried so far is
var testAvroString = "{JSON copied from Control Center topic}";
Schema schema = price_assessment.getClassSchema();
DecoderFactory decoderFactory = new DecoderFactory();
Decoder decoder = null;
try {
DatumReader<price_assessment> reader = new SpecificDatumReader<price_assessment>();
decoder = decoderFactory.get().jsonDecoder(schema, testAvroString);
return reader.read(null, decoder);
} catch (Exception e)
{
return null;
}
which is adapted from another SO answer that was using GenericRecords. When I try running this though I get the exception Cannot invoke "org.apache.avro.Schema.equals(Object)" because "writer" is null on the reader.read(...) step.
I'm not massively familiar with streams testing or Java and I'm not sure what exactly I've done wrong. Written in Java 17, streams 3.1.0, though flexible with version

The solution that I've managed to come up with is the following, which seems to work:
private static <T> T avroStringToInstance(Schema classSchema, String testAvroString) {
DecoderFactory decoderFactory = new DecoderFactory();
GenericRecord genericRecord = null;
try {
Decoder decoder = decoderFactory.jsonDecoder(classSchema, testAvroString);
DatumReader<GenericData.Record> reader =
new GenericDatumReader<>(classSchema);
genericRecord = reader.read(null, decoder);
} catch (Exception e)
{
return null;
}
var specific = (T) SpecificData.get().deepCopy(genericRecord.getSchema(), genericRecord);
return specific;
}

EDI to XML Huge file conversions

I am converting an EDI file to XML. However my input file which happens to also be in BIF is approximately 100Mb is giving me a JAVA out of memory error.
I tried to consult Smook's Documentation for the huge file conversion, however it is a conversion from XML to EDI.
Below is the response I am getting when running my main
Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
at java.util.Arrays.copyOf(Arrays.java:3332)
at java.lang.AbstractStringBuilder.ensureCapacityInternal(AbstractStringBuilder.java:124)
at java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:596)
at java.lang.StringBuffer.append(StringBuffer.java:367)
at java.io.StringWriter.write(StringWriter.java:94)
at java.io.Writer.write(Writer.java:127)
at freemarker.core.TextBlock.accept(TextBlock.java:56)
at freemarker.core.Environment.visit(Environment.java:257)
at freemarker.core.MixedContent.accept(MixedContent.java:57)
at freemarker.core.Environment.visitByHiddingParent(Environment.java:278)
at freemarker.core.IteratorBlock$Context.runLoop(IteratorBlock.java:157)
at freemarker.core.Environment.visitIteratorBlock(Environment.java:501)
at freemarker.core.IteratorBlock.accept(IteratorBlock.java:67)
at freemarker.core.Environment.visit(Environment.java:257)
at freemarker.core.Macro$Context.runMacro(Macro.java:173)
at freemarker.core.Environment.visit(Environment.java:686)
at freemarker.core.UnifiedCall.accept(UnifiedCall.java:80)
at freemarker.core.Environment.visit(Environment.java:257)
at freemarker.core.MixedContent.accept(MixedContent.java:57)
at freemarker.core.Environment.visit(Environment.java:257)
at freemarker.core.Environment.process(Environment.java:235)
at freemarker.template.Template.process(Template.java:262)
at org.milyn.util.FreeMarkerTemplate.apply(FreeMarkerTemplate.java:92)
at org.milyn.util.FreeMarkerTemplate.apply(FreeMarkerTemplate.java:86)
at org.milyn.event.report.HtmlReportGenerator.applyTemplate(HtmlReportGenerator.java:76)
at org.milyn.event.report.AbstractReportGenerator.processFinishEvent(AbstractReportGenerator.java:197)
at org.milyn.event.report.AbstractReportGenerator.processLifecycleEvent(AbstractReportGenerator.java:157)
at org.milyn.event.report.AbstractReportGenerator.onEvent(AbstractReportGenerator.java:92)
at org.milyn.Smooks._filter(Smooks.java:558)
at org.milyn.Smooks.filterSource(Smooks.java:482)
at com.***.xfunctional.EdiToXml.runSmooksTransform(EdiToXml.java:40)
at com.***.xfunctional.EdiToXml.main(EdiToXml.java:57)
import java.io.*;
import java.util.Arrays;
import java.util.Locale;
import javax.xml.transform.stream.StreamSource;
import org.milyn.Smooks;
import org.milyn.SmooksException;
import org.milyn.container.ExecutionContext;
import org.milyn.event.report.HtmlReportGenerator;
import org.milyn.io.StreamUtils;
import org.milyn.payload.StringResult;
import org.milyn.payload.SystemOutResult;
import org.xml.sax.SAXException;
public class EdiToXml {
private static byte[] messageIn = readInputMessage();
protected static String runSmooksTransform() throws IOException, SAXException, SmooksException {
Locale defaultLocale = Locale.getDefault();
Locale.setDefault(new Locale("en", "EN"));
// Instantiate Smooks with the config...
Smooks smooks = new Smooks("smooks-config.xml");
try {
// Create an exec context - no profiles....
ExecutionContext executionContext = smooks.createExecutionContext();
StringResult result = new StringResult();
// Configure the execution context to generate a report...
executionContext.setEventListener(new HtmlReportGenerator("target/report/report.html"));
// Filter the input message to the outputWriter, using the execution context...
smooks.filterSource(executionContext, new StreamSource(new ByteArrayInputStream(messageIn)),result);
Locale.setDefault(defaultLocale);
return result.getResult();
} finally {
smooks.close();
}
}
public static void main(String[] args) throws IOException, SAXException, SmooksException {
System.out.println("\n\n==============Message In==============");
System.out.println("======================================\n");
pause(
"The EDI input stream can be seen above. Press 'enter' to see this stream transformed into XML...");
String messageOut = EdiToXml.runSmooksTransform();
System.out.println("==============Message Out=============");
System.out.println(messageOut);
System.out.println("======================================\n\n");
pause("And that's it! Press 'enter' to finish...");
}
private static byte[] readInputMessage() {
try {
InputStream input = new BufferedInputStream(new FileInputStream("/home/****/Downloads/BifInputFile.DATA"));
return StreamUtils.readStream(input);
} catch (IOException e) {
e.printStackTrace();
return "<no-message/>".getBytes();
}
}
private static void pause(String message) {
try {
BufferedReader in = new BufferedReader(new InputStreamReader(System.in));
System.out.print("> " + message);
in.readLine();
} catch (IOException e) {
}
System.out.println("\n");
}
}
<?xml version="1.0"?>
<smooks-resource-list xmlns="http://www.milyn.org/xsd/smooks-1.1.xsd" xmlns:edi="http://www.milyn.org/xsd/smooks/edi-1.4.xsd">
<!--
Configure the EDI Reader to parse the message stream into a stream of SAX events.
-->
<edi:reader mappingModel="edi-to-xml-bif-mapping.xml" validate="false"/>
</smooks-resource-list>
I edited this line in the code to reflect the usage of a stream :-
smooks.filterSource(executionContext, new StreamSource(new FileInputStream("/home/***/Downloads/sample-text-file.txt")), result);
However I now have this below as error. Anybody any guess what is the best approach ?
Exception in thread "main" org.milyn.SmooksException: Failed to filter source.
at org.milyn.delivery.sax.SmooksSAXFilter.doFilter(SmooksSAXFilter.java:97)
at org.milyn.delivery.sax.SmooksSAXFilter.doFilter(SmooksSAXFilter.java:64)
at org.milyn.Smooks._filter(Smooks.java:526)
at org.milyn.Smooks.filterSource(Smooks.java:482)
at ****.EdiToXml.runSmooksTransform(EdiToXml.java:41)
at com.***.***.EdiToXml.main(EdiToXml.java:58)
Caused by: org.milyn.edisax.EDIParseException: EDI message processing failed [EDIFACT-BIF-TO-XML][1.0]. Must be a minimum of 1 instances of segment [UNH]. Currently at segment number 1.
at org.milyn.edisax.EDIParser.mapSegments(EDIParser.java:504)
at org.milyn.edisax.EDIParser.mapSegments(EDIParser.java:453)
at org.milyn.edisax.EDIParser.parse(EDIParser.java:428)
at org.milyn.edisax.EDIParser.parse(EDIParser.java:386)
at org.milyn.smooks.edi.EDIReader.parse(EDIReader.java:111)
at org.milyn.delivery.sax.SAXParser.parse(SAXParser.java:76)
at org.milyn.delivery.sax.SmooksSAXFilter.doFilter(SmooksSAXFilter.java:86)
... 5 more

The message was valid and the xml mapping was good. I was just not using the optimal method for message reading and writing.
I came to realize the filterSource method of Smooks can directly be fed with an InputStream & OutputStream as variables. Kindly find below the piece of code that led to an efficient running of the program without going through JAVA memory error.
//Instantiate a FileInputStream
FileInputStream inputStream = new FileInputStream(inputFileName);
//Instantiate an FileOutputStream
FileOutputStream outputStream = new FileOutputStream(outputFileName);
try {
// Filter the input message to the outputWriter...
smooks.filterSource(new StreamSource(inputStream), new StreamResult(outputStream));
Locale.setDefault(defaultLocale);
} finally {
smooks.close();
inputStream.close();
outputStream.close();
}
Thanks to the community.
Regards.

I'm the original author of Smooks and that Edifact parsing stuff. Jason emailed me asking for advice on this but I haven't been involved in it for a number of years now, so not sure how helpful I’d be.
Smooks doesn’t read the full message into memory. It streams it though a parser that converts it to a stream of SAX events, making it “look like” XML to anything downstream of it. If those events are then used to build a big Java object model in men then that might result in OOM errors etc.
Looking at the Exception message, it simply looks like the EDIFACT input doesn’t match the definition file being used.
Caused by: org.milyn.edisax.EDIParseException: EDI message processing failed [EDIFACT-BIF-TO-XML][1.0]. Must be a minimum of 1 instances of segment [UNH]. Currently at segment number 1.
Those EDIFACT definition files were originally generated directly from the definitions published by the EDIFACT group, but I do remember that many people “tweak” the message formats, which seems like what might be happening here (and hence the above error). One solution to that would be to tweak the pre-generated definitions to match.
I know that a lot of changes have been made in Smooks in this area in the last year or two (using Apache Daffodil for the definitions) but I wouldn’t be the best person to talk about that. You can try the Smooks mailing list for help on that.

How to convert JavaDStream into RDD ? OR Is there a way i can create new RDD inside map function of JavaDStream?

The streaming data which i am getting from kafka is the path of hdfs file and i need to get the data of that file .
batchInputDStream.map(new Function<Tuple2<String,String>, FreshBatchInput>() {
#Override
public String call(Tuple2<String, String> arg0)
throws Exception {
StringReader reader = new StringReader(arg0._2);
JAXBContext jaxbContext = JAXBContext.newInstance(FreshBatchInput.class);
Unmarshaller jaxbUnmarshaller = jaxbContext.createUnmarshaller();
FreshBatchInput input = (FreshBatchInput)jaxbUnmarshaller.unmarshal(reader);
return input.getPath();
}
});
here input.getPath() is the hdfs path of file .
There is no option to collect JavaDstream Object otherwise i would have used that by first collecting data and than getting data from file.
Iam not able to create new RDD inside map function it is giving error Task Not Serializable.
Is there any other option ?

You can use foreachRDD. It is executed on driver, so rdd actions are allowed
transformed.foreachRDD (rdd -> {
String inputPath = doSomethingWithRDD(rdd)
rdd.sparkContext.textFile(inputPath) ...
});
Remember that you cannot create RDD inside transformations or actions - RDDs can be created only on the driver. Similar question with example of foreachRDD is here. This means, you cannot use SparkContext inside map, filter or foreachPartition

SequenceFile Compactor of several small files in only one file.seq

Novell in HDFS and Hadoop:
I am developing a program which one should get all the files of a specific directory, where we can find several small files of any type.
Get everyfile and make append in a SequenceFile compressed, where the key must be the path of the file, and the value must be the file got, For now my code is:
import java.net.*;
import org.apache.hadoop.fs.*;
import org.apache.hadoop.conf.*;
import org.apache.hadoop.io.*;
import org.apache.hadoop.io.compress.BZip2Codec;
public class Compact {
public static void main (String [] args) throws Exception{
try{
Configuration conf = new Configuration();
FileSystem fs =
FileSystem.get(new URI("hdfs://quickstart.cloudera:8020"),conf);
Path destino = new Path("/user/cloudera/data/testPractice.seq");//test args[1]
if ((fs.exists(destino))){
System.out.println("exist : " + destino);
return;
}
BZip2Codec codec=new BZip2Codec();
SequenceFile.Writer outSeq = SequenceFile.createWriter(conf
,SequenceFile.Writer.file(fs.makeQualified(destino))
,SequenceFile.Writer.compression(SequenceFile.CompressionType.BLOCK,codec)
,SequenceFile.Writer.keyClass(Text.class)
,SequenceFile.Writer.valueClass(FSDataInputStream.class));
FileStatus[] status = fs.globStatus(new Path("/user/cloudera/data/*.txt"));//args[0]
for (int i=0;i<status.length;i++){
FSDataInputStream in = fs.open(status[i].getPath());
outSeq.append(new org.apache.hadoop.io.Text(status[i].getPath().toString()), new FSDataInputStream(in));
fs.close();
}
outSeq.close();
System.out.println("End Program");
}catch(Exception e){
System.out.println(e.toString());
System.out.println("File not found");
}
}
}
But after of every execution I receive this exception:
java.io.IOException: Could not find a serializer for the Value class: 'org.apache.hadoop.fs.FSDataInputStream'. Please ensure that the configuration 'io.serializations' is properly configured, if you're using custom serialization.
File not found
I understand the error must be in the type of the file I am creating and the type of object I define for adding to the sequenceFile, but I don't know which one should add, can anyone help me?

FSDataInputStream, like any other InputStream, is not intended to be serialized. What serializing an "iterator" over a stream of byte should do ?
What you most likely want to do, is to store the content of the file as the value. For example you can switch the value type from FsDataInputStream to BytesWritable and just get all the bytes out of the FSDataInputStream. One drawback of using Key/Value SequenceFile for a such purpose is that the content of each file has to fit in memory. It could be fine for small files but you have to be aware of this issue.
I am not sure what you are really trying to achieve but perhaps you could avoid reinventing the wheel by using something like Hadoop Archives ?

Thanks a lot by your comments, the problem was the serializer like you say, and finally I used BytesWritable:
FileStatus[] status = fs.globStatus(new Path("/user/cloudera/data/*.txt"));//args[0]
for (int i=0;i<status.length;i++){
FSDataInputStream in = fs.open(status[i].getPath());
byte[] content = new byte [(int)fs.getFileStatus(status[i].getPath()).getLen()];
outSeq.append(new org.apache.hadoop.io.Text(status[i].getPath().toString()), new org.apache.hadoop.io.BytesWritable(in));
}
outSeq.close();
Probably there are other better solutions in the hadoop ecosystem but this problem was an exercise of a degree I am developing, and for now We are remaking the wheel for understanding concepts ;-).

Pig: Reparse Strings into Tuples in Java

I'll have a Pig script that ends with storing it's contents in a text file.
STORE foo into 'outputLocation';
During a completely different job I want to read lines of this file, and parse them back into Tuples. The data in foo might contains chararrays with characters used when you save Pig Bags/tuples like { } ( ) , etc. I can read the previously saved file using code like.
FileSystem fs = FileSystem.get(UDFContext.getUDFContext().getJobConf());
FileStatus[] fileStatuses = fs.listStatus(new Path("outputLocation"));
for (FileStatus fileStatus : fileStatuses) {
if (fileStatus.getPath().getName().contains("part")) {
DataInputStream in = fs.open(fileStatus.getPath());
String line;
while ((line = in.readLine()) != null) {
// Do stuff
}
}
}
Now where // Do stuff is, I'd like to parse my String back into a Tuple. Is this possible/ does Pig provide an API? The closest I could find is the StorageUtil class textToTuple function, but that just makes a Tuple containing one DataByteArray. I want a tuple containing other bags, tuples, chararrays like it originally had so I can refetch the original fields easily. I can change the StoreFunc I save the original file in, if that helps.

This is the plain Pig solution without using JSON or UDF. I have found it the hard way.
import org.apache.pig.ResourceSchema.ResourceFieldSchema;
import org.apache.pig.builtin.Utf8StorageConverter;
import org.apache.pig.data.DataBag;
import org.apache.pig.data.Tuple;
import org.apache.pig.newplan.logical.relational.LogicalSchema;
import org.apache.pig.impl.util.Utils;
Let's say your string to be parsed is this:
String tupleString = "(quick,123,{(brown,1.0),(fox,2.5)})";
First, parse your schema string. Note that you have an enclosing tuple.
LogicalSchema schema = Utils.parseSchema("a0:(a1:chararray, a2:long, a3:{(a4:chararray, a5:double)})");
Then parse your tuple with your schema.
Utf8StorageConverter converter = new Utf8StorageConverter();
ResourceFieldSchema fieldSchema = new ResourceFieldSchema(schema.getField("a0"));
Tuple tuple = converter.bytesToTuple(tupleString.getBytes("UTF-8"), fieldSchema);
Voila! Check your data.
assertEquals((String) tuple.get(0), "quick");
assertEquals(((DataBag) tuple.get(2)).size(), 2L);

I would just output the data into JSON format. Pig has native support for parsing JSON until tuples. It would avoid you having to write a UDF.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

How to serialize the data to AVRO schema in Spark (with Java)? - java

Related

Creating test data from Confluent Control Center JSON representation

EDI to XML Huge file conversions

How to convert JavaDStream into RDD ? OR Is there a way i can create new RDD inside map function of JavaDStream?

SequenceFile Compactor of several small files in only one file.seq

Pig: Reparse Strings into Tuples in Java

Categories

Resources