java.io.IOException: Not a data file while reading Avro from file

java.io.IOException: Not a data file while reading Avro from file - java

The following code is used to serialize the data.
ByteArrayOutputStream byteArrayOutputStream = new ByteArrayOutputStream();
BinaryEncoder binaryEncoder =
EncoderFactory.get().binaryEncoder(byteArrayOutputStream, null);
DatumWriter<GenericRecord> datumWriter = new GenericDatumWriter<>(data.getSchema());
datumWriter.setSchema(data.getSchema());
datumWriter.write(data, binaryEncoder);
binaryEncoder.flush();
byteArrayOutputStream.close();
result = byteArrayOutputStream.toByteArray();
I used the following command
FileUtils.writeByteArrayToFile(new File("D:/sample.avro"), data);
to write avro byte array to a file. But when I try to read the same using
File file = new File("D:/sample.avro");
try {
dataFileReader = new DataFileReader(file, datumReader);
} catch (IOException exp) {
System.out.println(exp);
System.exit(1);
}
it throws exception
java.io.IOException: Not a data file.
at org.apache.avro.file.DataFileStream.initialize(DataFileStream.java:105)
at org.apache.avro.file.DataFileReader.<init>(DataFileReader.java:97)
at org.apache.avro.file.DataFileReader.<init>(DataFileReader.java:89)
What is the problem happening here. I refered two other similar stackoverflow questions this and this but haven't been of much help to me. Can someone help me understand this.

The actual data is encoded in the Avro binary format, but typically what's passed around is more than just the encoded data.
What most people think of an "avro file" is a format that includes the header (which has things like the writer schema) and then the actual data: https://avro.apache.org/docs/current/spec.html#Object+Container+Files. The first four bytes of an avro file should be b"Obj1" or 0x4F626A01. The error you are getting is because the binary you are trying to read as a data file doesn't start with the standard magic bytes.
Another standard format is the single object encoding: https://avro.apache.org/docs/current/spec.html#single_object_encoding. This type of binary format should start with 0xC301.
But if I had to guess, the binary you have could just be the raw serialized data without any sort of header information. Though it's hard to know for sure without knowing how the byte array that you have was created.

You'd need to utilize Avro to write the data as well as read it otherwise the schema isn't written (hence the "Not a data file" message). (see: https://cwiki.apache.org/confluence/display/AVRO/FAQ#FAQ-HowcanIserializedirectlyto/fromabytearray?)
If you're just looking to serialize an object, see: https://mkyong.com/java/how-to-read-and-write-java-object-to-a-file/

Related

Trying to read a serialized Java object that I did not create

I'm trying to create a web GUI for a Minecraft game server I run. The data I'm trying to read is from CoreProtect, a logging plugin.
I'm mainly using PHP and then trying to write a small Java service that can convert the serialized data into a JSON string that I can then use - since I can't deserialize a Java object directly in PHP and it's only some meta data that's stored as a Java serialized object, the rest is normal non-blob columns.
I've identified that the CoreProtect plugin is using ObjectOutputStream to serialize the object and then writes it to a MySQL BLOB field, this is the code I've identified from CoreProtect that's handling this:
try {
ByteArrayOutputStream bos = new ByteArrayOutputStream();
ObjectOutputStream oos = new ObjectOutputStream(bos);
oos.writeObject(data);
oos.flush();
oos.close();
bos.close();
byte[] byte_data = bos.toByteArray();
preparedStmt.setInt(1, time);
preparedStmt.setObject(2, byte_data);
preparedStmt.executeUpdate();
preparedStmt.clearParameters();
} catch (Exception e) {
e.printStackTrace();
}
This is then outputting the bytes to the database. All of the rows in the database start with the same few characters (from what I've seen this should be Java's 'magic' header). However, when trying to use the below code to unserialize the data I receive an error stating that the header is corrupt 'invalid stream header: C2ACC3AD'
byte[] serializedData = ctx.bodyAsBytes();
ByteArrayInputStream bais = new ByteArrayInputStream(serializedData);
try {
ObjectInputStream ois = new ObjectInputStream(bais);
Object object = ois.readObject();
Gson gson = new Gson();
ctx.result(gson.toJson(object));
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
} catch (ClassNotFoundException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
I'm using Javalin as the web service and am just sending the raw output from the BLOB column to a post route, I'm then reading this with the bodyAsBytes method. I'm tried testing this by writing the BLOB column to a file and then copying the contents of the file into a test POST request using Postman. I've also tried using a PHP script to read directly from the DB and then send that as a POST request and I just get the same error.
I've looked into this and everything is pointing to corrupt data. However, the odd thing is when triggering a 'restore' via the CoreProtect plugin it correctly restores what it needs to, reading all of the relevant data from the database including this column. From what I've seen in CoreProtect's JAR it's just doing the same process with the InputStream method calls.
I'm not very familiar with Java and thought this would be a fairly simple process. Am I missing something here? I don't see anything in the CoreProtect plugin that may be overriding the stream header. It's unfortunately not open source so I'm having to use a Java decompiler to try and see how it's serializing the object so that I can then try and read it, I assume it's possible the decompiler is not reading how this is serialized/unserialized.
My other thought was maybe the 'magic' header changed between Java versions? Although I couldn't seem to confirm this online. The specific header I'm receiving I've also seen in some other similar posts, although those all lead to data corruption/using a different output stream.
I appreciate any help with this, it's not an essential feature but would be nice if I can read all of the data related to the logs generated by the server/plugin.
I understand the use case is niche, but hoping the issue/resolution is a general Java problem :).
Update ctx is an instance of Javelin's Context class. Since I'm trying to send a raw POST request to Java I needed some kind of web service and Javelin looked easy/lightweight for what I needed. On the PHP side I'm just reading the column from the database and then using Guzzle to send a raw body with the result to the Javelin service that's running.

Something, apparently ctx, is treating the binary data as a String. bodyAsBytes() is converting that String to bytes using the String’s getBytes() method, which immediately corrupts the data.
The first two bytes of a serialized Java object stream are always AC ED. getAsBytes() is treating these as characters, namely U+00AC NOT SIGN and U+00ED LATIN SMALL LETTER I WITH ACUTE. When encoded in UTF-8, these two characters produce the bytes C2 AC C3 AD, which is what you’re seeing.
Solution: Do not treat your data as a String under any circumstances. Binary data should only be treated as a byte stream. (I’m well aware that it was common to store bytes in a string in C, but Java is not C, and String and byte[] are distinct types.)
If you update your question and state what the type of ctx is, we may be able to suggest an exact course of action.

Do I have any performance gain using the BufferedReader in this case?

I want to send a CSV file encoded in base64 from Client to Server, in order to parse it and use the data.
I want to get the InputStream directly from the Request object and pipe it to the reader used by the CSV parser.
Is there any performance or memory gain using this method?
Can the following code achieve this ? I feel like there's something missing while decoding the content.
Is BufferedReader really needed in this example ?
/* Suppose I get a Base64 encoded CSV file from the client */
String csvContent = "Column 1;Column 2;Column 3\r\nValue 1;Value 2;Value 3\r\n";
ByteArrayInputStream inputStream = new ByteArrayInputStream(Base64.encodeBase64(csvContent.getBytes()));
/* retrieving the content UPDATED */
Base64InputStream b64InputStream = new Base64InputStream(inputStream, false);
/* Parsing the CSV content */
Reader reader = new BufferedReader(
new InputStreamReader(b64InputStream));
CSVParser csvParser = new CSVParser(reader, FORMAT_EXCEL_FR);
/* printing results */
csvParser.forEach(record -> printRecord(record));
Update
I replaced the byte[] array with a Base64InputStream from org.apache.commons.codec

Probably not. A BufferedReader ... uses a buffer. It is commonly used when your data is not in java memory yet. ( e.g. socket communication, reading data from a file , ... )
In your case, you are wrapping a byte[], which means that the data is already in memory. So there is no point in adding a buffer.
The javadoc describes a BufferedReader as follows:
Reads text from a character-input stream, buffering characters so as to provide for the efficient reading of characters, arrays, and lines.
Now, let's say for example you want to read the content of a file, and want to check something byte-per-byte. So you do a lot of byte b = in.read(); calls. In that case, a buffered reader will actually fetch those bytes in chunks internally.
So, basically, whenever it is more efficient to fetch data in chunks, use a BufferedReader.
Update
In response to your update. No, also in this case it's not necessary to add a BufferedReader. As Holger pointed out:
It's likely that the CSVParser does that already (i.e. buffering).
I checked the source code of the CSVParser, and look what's in the constructor.
public CSVParser(final Reader reader, final CSVFormat format, final long characterOffset, final long recordNumber)
throws IOException {
...
this.lexer = new Lexer(format, new ExtendedBufferedReader(reader));
...
}
It wraps some kind of buffered reader by default. So, there's no need to add one yourself.

Avro doesn't provide backward compatibility

I need to send my data through stream, So I chose Avro for data serialization and deserialization. But the existing implementation using avro readers, doesn't support backward compatibility. Write serialized data into file and read from file support backward compatibility. How can I achieve backward compatibility, without knowing the writer's schema. I found many stackoverflow questions related to this. But I didn't find any solution for this issue. Can someone help me to solve this.
Following is my serializer and deserializer methods.
public static byte[] serialize(String json, Schema schema) throws IOException {
GenericDatumWriter<Object> writer = new GenericDatumWriter<>(schema);
ByteArrayOutputStream output = new ByteArrayOutputStream();
Encoder encoder = EncoderFactory.get().binaryEncoder(output, null);
DatumReader<Object> reader = new GenericDatumReader<>(schema);
Decoder decoder = DecoderFactory.get().jsonDecoder(schema, json);
Object datum = reader.read(null, decoder);
writer.write(datum, encoder);
encoder.flush();
output.flush();
return output.toByteArray();
}
public static String deserialize(byte[] avro, Schema schema) throws IOException {
GenericDatumReader<Object> reader = new GenericDatumReader(schema);
Decoder decoder = DecoderFactory.get().binaryDecoder(avro, null);
Object datum = reader.read(null, decoder);
ByteArrayOutputStream output = new ByteArrayOutputStream();
JsonEncoder encoder = EncoderFactory.get().jsonEncoder(schema, output);
DatumWriter<Object> writer = new GenericDatumWriter(schema);
writer.write(datum, encoder);
encoder.flush();
output.flush();
return new String(output.toByteArray(), "UTF-8");
}

You may have to define what scope you are looking for backward compatibility. Are you expecting new attributes to be added? OR are you going to remove any attributes? To handle both of these scenarios, there are different options available.
As described on the confluent blog, addition of new attributes can be achieved and avro serialization/deserialization activity can be made backward compatible, you must specify default value for the new attribute. Something like below
{"name": "size", "type": "string", "default": "XL"}
The other option is to specify, reader and writer schemas exclusively. But as described in your question, it doesn't seems to be an option you are looking for.
If you are planning to remove an attribute, you can continue to parse the attribute but don't use it in application. Note this has to happen for a definite period and consumers must be given enough time to make changes to their program, before you completely retire the attribute. Make sure to log a statement to indicate the attribute was found when it was not supposed to be sent (or better send a notification to client system with a warning).
Besides above points, there is an excellent blog which talks about backward/forward compatibility.

Backward compatibility means that you can encode data with an older schema and the data can still be decoded by a reader that knows the latest schema.
Explanation from Confluent's website
So in order to decode Avro data with backward compatibility, your reader needs access to the latest schema. This can be done for example using a Schema Registry.

How to convert an java serialized file to json file

I have a java class already serialized and stored as .ser format file but i want to get this converted in json file (.json format) , this is because serialization seems to be inefficient in terms of appending in direct manner, and further cause corruption of file due streamcorruption errors. Is there a possible efficient way to convert this java serialized file to json format.

You can read the .ser file as an InputStream and map the object received with key/value using Gson and write to .json file
InputStream ins = new ObjectInputStream(new FileInputStream("c:\\student.ser"));
Student student = (Student) ins.readObject();
Gson gson = new Gson();
// convert java object to JSON format,
// and returned as JSON formatted string
String json = gson.toJson(student );
try {
//write converted json data to a file named "file.json"
FileWriter writer = new FileWriter("c:\\file.json");
writer.write(json);
writer.close();
} catch (IOException e) {
e.printStackTrace();
}

There is no standard way to do it in Java and also there is no silver bullet - there are a lot of libraries for this. I prefer jackson https://github.com/FasterXML/jackson
ObjectMapper mapper = new ObjectMapper();
// object == ??? read from *.ser
String s = mapper.writeValueAsString(object);
You can see the list of libraries for JSON serialization/deserialization (for java and not only for java) here http://json.org/
this is because serialization seems to be inefficient in terms of appending in direct manner
Not sure if JSON is the answer for you. Could you share with us some examples of data and what manipulations you do with it?

You can try Google Protocol Buffers as alternative to Java serialization and JSON.
In my answer in topic bellow there is an overview of what GPB is and how to use, so you may check that and see if it suits you:
How to write/read binary files that represent objects?

Asn to text file convertor in java

I have to fetch file from ftp server, file format is of binary asn type.
I need them to convert in text file, and parse the relevant data.
I am using jdk1.7. I can also use third party jar but it should be license free.
If someone give me an example, it would be better.

I would like to suggest you using: http://www.bouncycastle.org/java.html
After fetching from ftp server, for a quick check use org.bouncycastle.asn1.util.ASN1Dump class:
ASN1InputStream stream = new ASN1InputStream(new ByteArrayInputStream(data));
ASN1Primitive object = stream.readObject();
System.out.println(ASN1Dump.dumpAsString(object));
This will print the structure of your file.
If you know the structure of your file you gonna need to use a parser like:
ASN1InputStream stream = new ASN1InputStream(new ByteArrayInputStream(data));
DERApplicationSpecific application = (DERApplicationSpecific) stream.readObject();
ASN1Sequence sequence = (ASN1Sequence) application.getObject(BERTags.SEQUENCE);
Enumeration enum = sequence.getObjects();
while (enum.hasMoreElements()) {
ASN1Primitive object = (ASN1Primitive) secEnum.nextElement();
System.out.println(object);
}
By the way, ASN1Primitive is a base ASN.1 object from a byte stream. It has a plenty of types (http://www.borelly.net/cb/docs/javaBC-1.4.8/prov/org/bouncycastle/asn1/ASN1Primitive.html) that you can inherit to and get right type.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

java.io.IOException: Not a data file while reading Avro from file - java

Related

Trying to read a serialized Java object that I did not create

Do I have any performance gain using the BufferedReader in this case?

Avro doesn't provide backward compatibility

How to convert an java serialized file to json file

Asn to text file convertor in java

Categories

Resources