Rules regarding auto encoding of messages into base64 while submitting to SQS

Rules regarding auto encoding of messages into base64 while submitting to SQS - java

I am developing an application in which clients (written in multiple languages - Go, C++, Python, C#, Java, Perl and possibly more in the future) submit protobuf (and in some cases, JSON) messages to SQS. At the other end, the messages are read and decoded by Python and Go clients - depending on the message type. Boto seems to automatically encode the messages into base64, but other language libraries don't seem to do so. Or maybe there are some other rules?
Boto does have an option to submit raw messages.
What is the expected behavior here? Am I supposed to encode messages into base64 on my own - which makes boto an odd case - or am I missing something?
This has caused some subtle bugs in my application because an of extra layer of base64 encoding or decoding. As far as I know, there is no idiomatic way to detect whether a message is base64 encoded or not. The best option is to try to decode and see if it throws an exception - something I don't really like.
I tried to look for some documentation, but couldn't find anything with clear guidelines. Maybe I was looking at the wrong places?
Thanks in advance for any pointers.

You probably want to encode your messages as something because SQS does not accept every possible byte combination in message payload, at the API. Only valid UTF-8, tab, newline, and carriage return are supported.
Important
The following list shows the characters (in Unicode) allowed in your message, according to the W3C XML specification. For more information, go to http://www.w3.org/TR/REC-xml/#charsets If you send any characters not included in the list, your request will be rejected.
#x9 | #xA | #xD | [#x20 to #xD7FF] | [#xE000 to #xFFFD] | [#x10000 to #x10FFFF]
http://docs.aws.amazon.com/AWSSimpleQueueService/latest/APIReference/API_SendMessage.html
The base64 alphabet clearly falls in this range, making it impossible for a message with base64 encoding to be rejected as invalid. Of course, it also bloats your payload, since base64 expands every 3 bytes of the original message into 4 bytes of output (64 symbols limits each output byte to carrying 6 bits of usable information, 3 x 8 → 4 x 6).
Presumably boto automatically base64-encodes and decodes messages for you in order to be "helpful."
But there is no reason why base64 has to be used at all.
An example that comes to mind... valid JSON would also comply with the restricted character ranges supported by SQS payloads. (Theoretically, I guess, JSON could be argued not to be an "encoding," but that would be a bit pedantic).
There is no clean way to determine whether a message needs to be decoded more than once, other than the sketchy one you proposed, but the argument could be made that if you are in a situation where the need to decode is ambiguous, then that should be eliminated.
If boto's behavior weren't documented and there were no way to make it behave otherwise, I'd say it is wrong behavior. But, as it is, I'll have to relent a bit and say it's just unusual.

Related

How to use ANTLR4 with binary data?

From the homepage:
ANTLR (ANother Tool for Language Recognition) is a powerful parser generator for [...] or binary files.
I have read through the docs now for some hours and think that I have some basic understanding of ANTLR, but I have a hard time to find any references to processing binary files. And I'm not the only one as it seems.
I need to create a parser for some binary data and would like to decide if ANTLR is of any help or not.
Binary data structure
That binary data is structured in logical fields like field1, which is followed by field2, which is followed by field3 etc. and all those fields have a special purpose. The length of all those fields may differ AND may not be known at the time the parser is generated, so e.g. I do know that field1 is e.g. 4 bytes always, field2 might simply be 1 byte and field3 might be 1 to 10 bytes and might be followed by additional field3s with n bytes, depending on the actual value of the data. That is the second problem, I know the fields are there and e.g. with field1 I know it's 4 bytes, but I don't know the actual value, but that is what I'm interested in. Same goes for the other fields, I need the values from all of those.
What I need in ANTLR
This sounds like a common structure and use case for some arbitrary binary data to me, but I don't see any special handling of such data in ANTLR. All examples are using some kind of texts and I don't see some value extraction callbacks or such. Additionally, I think I would need some callbacks influencing the parsing process itself, so for e.g. one callback is called on the first byte of field3, I check that, decide that one to N additional bytes need to be consumed and that those are logically part of field3 and tell the parser that, so it's able to proceed "somehow".
In the end, I would get some higher level "field" objects and ANTLR would provide the underlying parse logic with callbacks and listener infrastructure, walking abilities etc.
Did anyone ever do something like that and can provide some hints to examples or the concrete documentation I seem to have missed? Thanks!
EN 13757-3:2012
I don't think it makes understanding my question really easier, but the binary data I'm referring to is defined in the standard EN 13757-3:2012:
Communication systems for and remote reading of meters - Part
3: Dedicated application layer
The standard is not freely available on the net (anymore?), but the following PDF might provide you an overview of how example data looks like in page 4. Especially that bytes of the mentioned fields are not constant, only the overall structure of the datagram is defined.
http://fastforward.ag/downloads/docu/FAST_EnergyCam-Protocol-wirelessMBUS.pdf
The tokens for the grammar would be the fields, implemented by a different amount of bytes, but with a value etc. Regarding the self-description of ANTLR, I would expected such things to work somehow...
Alternative: Kaitai.io
Whoever is in a comparable position like me currently, have a look at Kaitai.io, which reads very promising:
https://stackoverflow.com/a/40527106/2055163

thrift character encoding, perl to java

I have a complex situation that I'm trying to deal with involving character encoding.
I have a perl program which is communicating with a java endpoint via thrift, the java is then using the data to make a request to a legacy php service. It's ugly, but part of a migration plan so needs to work for a short while.
In perl a thrift object is created where some of the fields of the thrift object are json encoded strings.
The problem is that when perl makes the request to java, one of the strings is as follows (this is from data:dumper and is subsequently json encoded and added to thrift):
'offer_message' => "<<>>
&&
\x{c3}\x{82}\x{c2}\x{a9}©
<script>alert(\"XSS\");</script>
https://url.com/imghp?hl=uk",
However, when this data is received on the java side the sequence \x{c3}\x{82}\x{c2}\x{a9} has been converted so in java we receive the following:
<<>>\\n&&\\nÃ�Â�Ã�Â©©\\n<script>alert(\"XSS\");</script>\\nhttps://www.google.com.ua/imghp?hl=uk
The problem is that if I pass the second string to the legacy php program, it fails, if I pass the string taken from the dump of the perl hash, it succeeds. So my assumption is that I need to convert the received string to another encoding (correct me if I'm wrong, I'm not sure that this is the right solution).
I've tried taking the parameters received in java and converting them to every encoding I can think of, however it doesn't work. So for example:
byte[] utf8 = templateParams.getBytes("UTF8");
normallisedTemplateParams = new String(utf8, "UTF8");
I've been varying the encoding schemes in the hope I find something that works.
What is the correct way to solve this? For a short time this messy solution is my only option while other re-engineering is happening.

The problem in the end difficult to diagnose but simple to resolve. It turned out that the package I was using to convert in Java was using java's default encoding of UTF-16. I had to modify the package and force it to use UTF-8. After that, everything worked.

As400Text class vs MQC.MQGMO_CONVERT

Why some people prefer to use As400Text object to handle EBCDIC/ASCII conversion (Java code with IBM MQ jars) if we already have MQC.MQGMO_CONVERT option to handle this?
My requirement is to convert ASCII->EBCDIC during the PUT operation which I am doing by setting the character set to 37 and the write format to "STRING" and using MQC.MQGMO_CONVERT option to automatically convert EBCDIC ->ASCII during the GET operation.
Is there any downfall of using convert option? Could anyone please let me know if this is not 100 percent safe option?

Best practice is to write the MQ message in your local code page (where the CCSID and Encoding will normally be filled in automatically as the correct values) and to set the Format field. Then the getter will should use MQGMO_CONVERT to request the message in the CCSID and Encoding they need it in.
Get with Convert is safe, and will be correct so long as you provide the correct CCSID a and Encoding that describes the message, when you put it.
In the description of what you are doing in your question you convert from ASCII->EBCDIC before putting the message, and then getter is converting from EBCDIC->ASCII on the MQGET. This means you have paid for two data conversion operations, when you could have done none (or if two different ASCIIs, only one).

Google ProtoBuf serialization / deserialization

I am reading Google Protocol Buffers. I want to know Can I Serialize C++ object and send it on the wire to Java server and Deserialize there in java and introspect the fields.
Rather I want to send objects from any language to Java Server. and deserialize it there.
Assume following is my .proto file
message Person {
required int32 id = 1;
required string name = 2;
optional string email = 3;
}
I ran protoc on this and created a C++ object.
Basically Now i want to send the serialized stream to java server.
Here on java side can I deserialized the stream , so that I can find out there are 3 fields in the stream and its respective name, type, and value

Here on java side can I deserialized the stream , so that I can find out there are 3 fields in the stream and its respective name, type, and value
You will need to know the schema in advance. Firstly, protobuf does not transmit names; all it uses as identifiers is the numeric key (1, 2 and 3 in your example) of each field. Secondly, it does not explicitly specify the type; there are only a very few wire-types in protobuf (varint, 32-bit, 64-bit, length-prefix, group); actual data types are mapped onto those, but you cannot unambiguously decode data without the schema
varint is "some form of integer", but could be signed, unsigned or "zigzag" (which allows negative numbers of small magnitude to be cheaply encoded), and could be intended to represent any width of data (64 bit, 32 bit, etc)
32-bit could be an integer, but could be signed or unsigned - or it could be a 32-bit floating-point number
64-bit could be an integer, but could be signed or unsigned - or it could be a 64-bit floating-point number
length-prefix could be a UTF-8 string, a sequence or raw bytes (without any particular meaning), a "packed" set of repeated values of some primitive type (integer, floating point, etc), or could be a structured sub-message in protobuf format
groups - hoorah! this is always unambigous! this can only mean one thing; but that one thing is largely deprecated by google :(
So fundamentally: you need the schema. The encoded data does not include what you want. It does this to avoid unnecessary space - if the protocol assumes that the encoder and decoder both know what the message is meant to look like, then a lot less information needs to be sent.
Note, however, that the information that is included is enough to safely round-trip a message even if there are fields that are not expected; it is not necessary to know the name or type if you only need to re-encode it to pass it along / back.
What you can do is use the parser API to scan over the data to reveal that there are three fields, field 1 is a varint, field 2 is length-prefixed, field 3 is length-prefixed. You could make educated guesses about the data beyond that (for example, you could see whether a UTF-8 decode produces something that looks roughly text-like, and verify that UTF-8 encoding that gives you back the original bytes; if it does, it is possible it is a string)

Can I Serialize C++ object and send it on the wire to Java server and Deserialize there in java and introspect the fields.
Yes, it is the very goal of protobuf.
Serialize data in an application developed in any supported language, and deserialize data in an application developed in any supported language. Serialization and deserialization languages can be the same, or be different.
Keep in mind that protocol buffers are not self describing, so both sides of your application needs to have serializers/deserializers generated from the .proto file.

In short: yes you can.
You will need to create .proto files which define the data structures that you want to share. By using the Google Protocol Buffers compiler you can then generate interfaces and (de)serialization code for your structures for both Java and C++ (and almost any other language you can think of).
To transfer your data over the wire you can use for instance ZeroMQ which is an extremely versatile communications framework which also sports a slew of different language API's, among them Java and C++.
See this question for more details.

How do I identify that I am at the last byte of a serialized Java object?

Question
What is (if there is any) terminating characters/byte sequences in serialized java objects?
Background
I'm working on a small self-education project where I would like to serialize java objects and write them to a stream where there are read and then unserialized. Since, I will need to identify the borders between serialized objects and I can't be sure that the current object is not the last one, is there a terminating character that is always there that I can use as my identifier?
I noticed that there is a magic number ACED that allows me to identify the start of the object, so how do I identify the end?
EDIT:
If there is no terminating character, is there any safe terminating characters/sequences that I can use (insert) to identify the end of the object?

In theory you should always be able to find the end of an object, in practice you cannot. I understand the problem is customised writeObject implementations that don't call either defaultReadObject or readFields have a non-standard representation.
I've played about with serialisation in the past. Including creating streams for use when I've been doing unusual things to the ObjectInputStream. It's not pleasant(!).
You can read the details in the spec, and the source is worth a read.

there are none. AFAIK the only requirement is that the deserialiser know when to stop reading, when given a corresponding serialisation. subject to that, the serialiser can write whatever it wants -- in any position not just the last.
if you're old skool dump a 32-bit length field at the beginning a refuse to handle objects bigger than 4 gig.
nu scool, you just make sure your read and your write logic are consistent and don't care about the length.

You can add a terminating object to your object stream. e.g. null or a special String.
However, I suggest that you instead convert the ObjectsStream to a byte[] and write the byte length of the byte[] followed by its data. This way each ObjectStream is independent and you always know where it finishes.

Have you considered applying a record-marking layer similar to HTTP Chunked encoding?
The Chunked encoding is intended to solve a generalization of this scenario: identifying the end of a message of indeterminate length that both itself contains no identifiable end, and is embedded in a longer stream without ending it.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.