Deserialize Avro Data from bytes

Deserialize Avro Data from bytes - java

I am trying to deserialize, i.e., get an object of class org.apache.avro.generic.GenericRecord from byte array Avro data. This data contains a header with the full schema.
So far, I have tried this:
public List<GenericRecord> deserializeGenericWithSchema(byte[] message) throws IOException {
List<GenericRecord> listOfRecords = new ArrayList<>();
DatumReader<GenericRecord> reader = new GenericDatumReader<>();
DataFileReader<GenericRecord> fileReader =
new DataFileReader<>(new SeekableByteArrayInput(message), reader);
GenericRecord record = null;
while (fileReader.hasNext()) {
listOfRecords.add(fileReader.next(record));
}
return listOfRecords;
}
But I am getting an error:
java.io.IOException: Invalid int encoding at
org.apache.avro.io.BinaryDecoder.readInt(BinaryDecoder.java:145) at
org.apache.avro.io.BinaryDecoder.readBytes(BinaryDecoder.java:282) at
org.apache.avro.file.DataFileStream.initialize(DataFileStream.java:112)
at org.apache.avro.file.DataFileReader.(DataFileReader.java:97)
However, if I write to disk the byte array message and change my function like:
public List<GenericRecord> deserializeGenericWithSchema(String fileName) throws IOException {
byte[] file = new File(fileName);
List<GenericRecord> listOfRecords = new ArrayList<>();
DatumReader<GenericRecord> reader = new GenericDatumReader<>();
DataFileReader<GenericRecord> fileReader =
new DataFileReader<>(file, reader);
GenericRecord record = null;
while (fileReader.hasNext()) {
listOfRecords.add(fileReader.next(record));
}
return listOfRecords;
}
It works flawlessly. I really don't want to write to disk every avro message I get because this is intended to work in a real time basis.
What am I doing wrong in my first approach?

Do you have any follow up on the issue? My assumption is encoding issue. Where the byte[] came from? Is it the exact byte[] you are writing to the disk? Maybe the explanation is on both File writer and reader default encoding settings.

Related

Strings in downloadfile weird symbols

I've got a String array that contains the content for a downloadable file. I am converting it to a Stream for the download but there are some random values in the downloadfile. I don't know if it is due to the encoding and if yes, how can I change it?
var downloadButton = new DownloadLink(btn, "test.csv", () -> {
try {
ByteArrayOutputStream byteArrayOutputStream = new ByteArrayOutputStream();
ObjectOutputStream objectOutputStream = new ObjectOutputStream(byteArrayOutputStream);
for (int i = 0; i < downloadContent.size(); i++) {
objectOutputStream.writeUTF(downloadContent.get(i));
}
objectOutputStream.flush();
objectOutputStream.close();
byte[] byteArray = byteArrayOutputStream.toByteArray();
ByteArrayInputStream byteArrayInputStream = new ByteArrayInputStream(byteArray);
ObjectInputStream objectInputStream = new ObjectInputStream(byteArrayInputStream);
objectInputStream.close();
return new ByteArrayInputStream(byteArray);
This is the DownloadLink class.
public class DownloadLink extends Anchor {
public DownloadLink(Button button, String fileName, InputStreamFactory fileProvider) {
super(new StreamResource(fileName, fileProvider), "");
getElement().setAttribute("download", fileName);
add(button);
getStyle().set("display", "contents");
}
}
this is the output file

ObjectOutputStream is part of the Java serialization system. In addition to the data itself, it also includes metadata about the original Java types and such. It's only intended for writing data that will later be read back using ObjectInputStream.
To create a file for others to download, you could instead use a PrintWriter that wraps the original output stream. On the other hand, you're using the output stream to create a byte[] which means that a more straightforward, but slightly less efficient, way would be to create a concatenated string from all the array elements and then use getBytes(StandardCharsets.UTF_8) on it to directly get a byte array.

Smooks : return an OutputStream

I am currently writing a JAVA application that will input an EDI file and return an OutputStream using Smooks library for this purpose. I am struggling to return the output stream and use it without killing memory. The goal of an output stream is to allow users to convert it into an InputStream and manipulate the stream into object creation, files, push to db, etc, ... I would really appreciate if somebody with any considerable input could give me an insight what am I doing wrong. Thanks in advance ..
public class EdiToXml {
private static final int headerBufferSize = 100;
private static final byte[] buf = new byte[headerBufferSize];
private static Smooks smooks;
private static final String headerVersion1 = "IFLIRR\u001F15\u001F2\u001F1A";
private static StreamSource stream;
protected static ByteArrayOutputStream TransformBifToJava(FileInputStream inputStream) throws IOException, SAXException, SmooksException {
Locale defaultLocale = Locale.getDefault();
Locale.setDefault(new Locale("en", "EN"));
//Creating a bufferedInputStream
BufferedInputStream bufferedInputStream = new BufferedInputStream(inputStream);
//Marking the bufferedInputStream
bufferedInputStream.mark(0);
//Obtaining the first 100 bytes to detect the file version
bufferedInputStream.read(buf);
//Reading first 100 bytes
String value = new String(buf);
if(value.indexOf(headerVersion1) > 0) {
// Instantiate Smooks with the config for 15.2.1A
smooks = new Smooks("smooks-config.xml");
}
bufferedInputStream.reset();
stream = new StreamSource(bufferedInputStream);
try {
return Parse1(defaultLocale, smooks, stream);
}finally {
bufferedInputStream.close();
inputStream.close();
}
}
protected static ByteArrayOutputStream Parse1(Locale locale, Smooks smooks, StreamSource streamSource) throws IOException, SAXException, SmooksException {
try {
ByteArrayOutputStream byteArrayOutputStream = new ByteArrayOutputStream();
// Create an exec context - no profiles....
ExecutionContext executionContext = smooks.createExecutionContext();
// Filter the input message
smooks.filterSource(executionContext, streamSource, new StreamResult(byteArrayOutputStream));
Locale.setDefault(locale);
System.out.println(byteArrayOutputStream.size());
return byteArrayOutputStream;
} finally {
smooks.close();
}
}
public static void main(String[] args) throws IOException, SAXException, SmooksException {
ByteArrayOutputStream byteArrayOutputStream = EdiToXml.TransformBifToJava(new FileInputStream("xxxx/BifInputFile.DATA"));
InputStream is = new ByteArrayInputStream(byteArrayOutputStream.toByteArray());
byteArrayOutputStream.close();
int b = is.read();
while (b != -1) {
System.out.printf("%c", b);
b = is.read();
}
is.close();
System.out.println("======================================\n\n");
System.out.print("Finished");
System.out.println("======================================\n\n");
}
}
Exception in thread "main" org.milyn.SmooksException: Smooks Filtering operation failed.
at org.milyn.Smooks._filter(Smooks.java:548)
at org.milyn.Smooks.filterSource(Smooks.java:482)
at com.maureva.xfunctional.EdiToXml.Parse1(EdiToXml.java:102)
at com.maureva.xfunctional.EdiToXml.TransformBifToJava(EdiToXml.java:86)
at com.maureva.xfunctional.EdiToXml.main(EdiToXml.java:173)
Caused by: java.lang.OutOfMemoryError: Java heap space
at java.util.Arrays.copyOf(Arrays.java:3236)
at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:118)
at java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.java:93)
at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:153)
at sun.nio.cs.StreamEncoder.writeBytes(StreamEncoder.java:221)
at sun.nio.cs.StreamEncoder.implFlushBuffer(StreamEncoder.java:291)
at sun.nio.cs.StreamEncoder.implFlush(StreamEncoder.java:295)
at sun.nio.cs.StreamEncoder.flush(StreamEncoder.java:141)
at java.io.OutputStreamWriter.flush(OutputStreamWriter.java:229)
at org.milyn.delivery.sax.SAXHandler.flushCurrentWriter(SAXHandler.java:503)
at org.milyn.delivery.sax.SAXHandler.endElement(SAXHandler.java:234)
at org.milyn.delivery.SmooksContentHandler.endElement(SmooksContentHandler.java:96)
at org.milyn.edisax.EDIParser.endElement(EDIParser.java:897)
at org.milyn.edisax.EDIParser.endElement(EDIParser.java:883)
at org.milyn.edisax.EDIParser.mapComponent(EDIParser.java:693)
at org.milyn.edisax.EDIParser.mapField(EDIParser.java:636)
at org.milyn.edisax.EDIParser.mapFields(EDIParser.java:603)
at org.milyn.edisax.EDIParser.mapSegment(EDIParser.java:564)
at org.milyn.edisax.EDIParser.mapSegments(EDIParser.java:535)
at org.milyn.edisax.EDIParser.mapSegments(EDIParser.java:453)
at org.milyn.edisax.EDIParser.mapSegment(EDIParser.java:566)
at org.milyn.edisax.EDIParser.mapSegments(EDIParser.java:535)
at org.milyn.edisax.EDIParser.mapSegments(EDIParser.java:453)
at org.milyn.edisax.EDIParser.mapSegment(EDIParser.java:566)
at org.milyn.edisax.EDIParser.mapSegments(EDIParser.java:535)
at org.milyn.edisax.EDIParser.mapSegments(EDIParser.java:453)
at org.milyn.edisax.EDIParser.mapSegment(EDIParser.java:566)
at org.milyn.edisax.EDIParser.mapSegments(EDIParser.java:535)
at org.milyn.edisax.EDIParser.mapSegments(EDIParser.java:453)
at org.milyn.edisax.EDIParser.mapSegment(EDIParser.java:566)
at org.milyn.edisax.EDIParser.mapSegments(EDIParser.java:535)
at org.milyn.edisax.EDIParser.mapSegments(EDIParser.java:453)
Process finished with exit code 1

I suggest you upgrade to the latest version of Smooks (v2.0.0-RC1) given that the EDI cartridge has been totally overhauled. The app is running out of memory because you're writing to a java.io.ByteArrayOutputStream which keeps the written bytes in-memory. I haven't understood what you're trying to accomplish. The things you mentioned like, creating objects, writing to files, and saving to a database, can be done from within Smooks.
If you only want to use Smooks for converting the EDI into XML then you should write the result to an output stream that doesn't keep the data in-memory like a FileOutputStream or implement your own OutputStream should you want to do something funky with the result. Having said this, it doesn't make too much sense to me to use Smooks only for transforming the input into XML.

Convert InputStream from ISO-8859-1 to UTF-8

I have a file in ISO-8859-1 containing german umlauts and I need to unmarshall it using JAXB. But before I need the content in UTF-8.
#Override
public List<Usage> convert(InputStream input) {
try {
InputStream inputWithNamespace = addNamespaceIfMissing(input);
inputWithNamespace = convertFileToUtf(inputWithNamespace);
ORDR order = xmlUnmarshaller.unmarshall(inputWithNamespace, ORDR.class);
...
I get the "file" as an InputStream. My idea was to read the file's content in UTF-8 and make another InputStream to use. This is what I've tried:
private InputStream convertFileToUtf(InputStream inputStream) throws IOException {
byte[] bytesInIso = ByteStreams.toByteArray(inputStream);
String stringIso = new String(bytesInIso);
byte[] bytesInUtf = new String(bytesInIso, ISO_8859_1).getBytes(UTF_8);
String stringUtf = new String(bytesInUtf);
return new ByteArrayInputStream(bytesInUtf);
}
I have those 2 Strings to check the contents, but even just reading the ISO file, it gives question marks where umlauts are (?) and converting that to UTF_8 gives strange characters like 1/2 and so on.
UPDATE
byte[] bytesInIso = ByteStreams.toByteArray(inputWithNamespace);
String contentInIso = new String(bytesInIso);
byte[] bytesInUtf = new String(bytesInIso, ISO_8859_1).getBytes(UTF_8);
String contentInUtf = new String(bytesInUtf);
Verifying contentInIso prints question marks instead of the umlauts and by checking contentInIso instead of umlauts, it has characters like "ï¿½".
#Override
public List<Usage> convert(InputStream input) {
try {
InputStream inputWithNamespace = addNamespaceIfMissing(input);
byte[] bytesInIso = ByteStreams.toByteArray(inputWithNamespace);
String contentInIso = new String(bytesInIso);
byte[] bytesInUtf = new String(bytesInIso, ISO_8859_1).getBytes(UTF_8);
String contentInUtf = new String(bytesInUtf);
ORDR order = xmlUnmarshaller.unmarshall(inputWithNamespace, ORDR.class);
This method convert it's called by another one called processUsageFile:
private void processUsageFile(File usageFile) {
try (FileInputStream fileInputStream = new FileInputStream(usageFile)) {
usageImporterService.importUsages(usageFile.getName(), fileInputStream, getUsageTypeValidated(usageFile.getName()));
log.info("Usage file {} imported successfully. Moving to archive directory", usageFile.getName());
If i take the code I have written under the UPDATE statement and put it immediately after the try, the first contentInIso has question marks but the contentInUtf has the umlauts. Then, by going into the convert, jabx throws an exception that the file has a premature end of line.

Regarding the behaviour you are getting,
String stringIso = new String(bytesInIso);
In this step, you construct a new String by decoding the specified array of bytes using the platform's default charset.
Since this is probably not ISO_8859_1, I think the String you are looking at becomes garbled here.

How to deserialize avro files

I would like to read a hdfs folder containing avro files with spark . Then I would like to deserialize the avro events contained in these files. I would like to do it without the com.databrics library (or any other that allow to do it easely).
The problem is that I have difficulties with the deserialization.
I assume that my avro file is compressed with snappy because at the begining of the file (just after the schema), I have
avro.codecsnappy
written. Then it's followed by readable or unreadable charaters.
My first attempt to deserialize the avro event is the following :
public static String deserialize(String message) throws IOException {
Schema.Parser schemaParser = new Schema.Parser();
Schema avroSchema = schemaParser.parse(defaultFlumeAvroSchema);
DatumReader<GenericRecord> specificDatumReader = new SpecificDatumReader<GenericRecord>(avroSchema);
byte[] messageBytes = message.getBytes();
Decoder decoder = DecoderFactory.get().binaryDecoder(messageBytes, null);
GenericRecord genericRecord = specificDatumReader.read(null, decoder);
return genericRecord.toString();
}
This function works when I want to deserialise an avro file that doesn't have the avro.codecsbappy in it. When it's the case I have the error :
Malformed data : length is negative : -50
So I tried another way of doing it which is :
private static void deserialize2(String path) throws IOException {
DatumReader<GenericRecord> reader = new GenericDatumReader<>();
DataFileReader<GenericRecord> fileReader =
new DataFileReader<>(new File(path), reader);
System.out.println(fileReader.getSchema().toString());
GenericRecord record = new GenericData.Record(fileReader.getSchema());
int numEvents = 0;
while (fileReader.hasNext()) {
fileReader.next(record);
ByteBuffer body = (ByteBuffer) record.get("body");
CharsetDecoder decoder = Charsets.UTF_8.newDecoder();
System.out.println("Positon of the index " + body.position());
System.out.println("Size of the array : " + body.array().length);
String bodyStr = decoder.decode(body).toString();
System.out.println("THE BODY STRING ---> " bodyStr);
numEvents++;
}
fileReader.close();
}
and it returns the follwing output :
Positon of the index 0
Size of the array : 127482
THE BODY STRING --->
I can see that the array isn't empty but it just return an empty string.
How can I proceed ?

Use this when converting to string:
String bodyStr = new String(body.array());
System.out.println("THE BODY STRING ---> " + bodyStr);
Source: https://www.mkyong.com/java/how-do-convert-byte-array-to-string-in-java/

Well, it seems that you are on a good way. However, your ByteBuffer might not have a proper byte[] array to decode, so let's try the following instead:
byte[] bytes = new byte[body.remaining()];
buffer.get(bytes);
String result = new String(bytes, "UTF-8"); // Maybe you need to change charset
This should work, you have shown in your question that ByteBuffer contains actual data, as given in the code example you might have to change the charset.
List of charsets: https://docs.oracle.com/javase/7/docs/api/java/nio/charset/Charset.html
Also usful: https://docs.oracle.com/javase/7/docs/api/java/nio/ByteBuffer.html

Getting different byte array than written to file when reading from file

I'm writing my byte array to a file:
PrintWriter pw = new PrintWriter(new FileOutputStream(fileOutput, true));
pw.write(new String(cryptogram, Charset.defaultCharset()));
pw.close();
Then, I am reading it from the file like this:
String cryptogramString = new String();
while (scPriv.hasNext()) {
linePriv = scPriv.nextLine();
cryptogramString += linePriv;
}
But I don't know how to make byte[] from cryptogramString. I'am trying this:
byte[] b = cryptogramString.getBytes(Charset.defaultCharset());
System.out.println(Arrays.toString(b));
System.out.println(Arrays.toString(cryptogram));
But it doesn't return the same values. Does anyone have an idea how to make this right?

You should decide whether you are writing text or binary.
Encrypted data is always binary which means you shouldn't be using Reader/Writer/String classes.
try (FileOutputstream out = new FileOutputStream(filename)) {
out.write(bytes);
}
to read back in
byte[] bytes = new byte[(int) (new File(filename).length())];
try (FileInputstream in = new FileInputStream(filename)) {
in.read(bytes);
}
I have a file that contains xml and then plain text, so i cant read a file as a whole
You also can't write binary into a text file. You can encode it using base64.
Storing base64 data in XML?

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Deserialize Avro Data from bytes - java

Do you have any follow up on the issue? My assumption is encoding issue. Where the byte[] came from? Is it the exact byte[] you are writing to the disk? Maybe the explanation is on both File writer and reader default encoding settings.

Related

Strings in downloadfile weird symbols

Smooks : return an OutputStream

Convert InputStream from ISO-8859-1 to UTF-8

How to deserialize avro files

Getting different byte array than written to file when reading from file

Categories

Resources