How to determine if a InputStream contains JSON data?

How to determine if a InputStream contains JSON data? - java

How do I check if the data behind a java.io.InputStream (from File, URL, ..) is of type JSON?
Of course to be complete the best would be to load the whole data of the stream and try to validate it as JSON (e.g checking for closing bracket }). Since the stream source might be very big (a GeoJSON file with a size of 500MB) this eventually end in a burning machine.
To avoid this I wrote a small method that only takes the first character of the InputStream as UTF-8/16/32 and tests if it is a { according to RFC 4627 (which is referenced and updated by RFC 7159) to determine its JSONness:
JSON text SHALL be encoded in Unicode. The default encoding is UTF-8.
And:
Since the first two characters of a JSON text will always be ASCII
characters [RFC0020], it is possible to determine whether an octet
stream is UTF-8, UTF-16 (BE or LE), or UTF-32 (BE or LE) by looking
at the pattern of nulls in the first four octets.
00 00 00 xx UTF-32BE
00 xx 00 xx UTF-16BE
xx 00 00 00 UTF-32LE
xx 00 xx 00 UTF-16LE
xx xx xx xx UTF-8
The method is:
public static boolean mightBeJSON(InputStream stream) {
try {
byte[] bytes = new byte[1];
stream.read(bytes);
if (bytes[0] == 0x7B) {
return true;
}
stream.read(bytes);
if (bytes[0] == 0x7B) {
return true;
}
stream.read(bytes);
stream.read(bytes);
if (bytes[0] == 0x7B) {
return true;
}
} catch (IOException e) {
// Nothing to do;
}
return false;
}
Until now my machine is still not burning, BUT:
Is there anything wrong with this approach/implementation?
May there be any problems in some situations?
Anything to improve?

RFC 7159 states:
8. String and Character Issues
8.1 Character Encoding
JSON text SHALL be encoded in UTF-8, UTF-16, or UTF-32. The
default encoding is UTF-8, and JSON texts that are encoded in UTF-8
are interoperable in the sense that they will be read successfully by
the maximum number of implementations; there are many implementations
that cannot successfully read texts in other encodings (such as UTF-16
and UTF-32).
Implementations MUST NOT add a byte order mark to the beginning of
a JSON text. In the interests of interoperability, implementations
that parse JSON texts MAY ignore the presence of a byte order mark
rather than treating it as an error.
This doesn't answer your question per say, but I hope it can help in your logic.

Related

How is it possible to encode String twice?

I was Python programmer(Of course I am now, too), so I am familiar with Python encoding and decoding.
I was surprised at the fact that Java can encode String variables twice consecutively.
This is example code:
import java.net.URLEncoder;
public class OpenAPITest {
public static void main(String[] arg) throws Exception {
String str = "안녕"; // Korean
String utfStr = URLEncoder.encode(str, "UTF-8");
System.out.println(utfStr);
String ms949Str = URLEncoder.encode(utfStr, "MS949");
System.out.println(ms949Str);
}
}
I wonder how it can encode string twice times.
In Python, version 3.x, once you encode type 'str' which consists of unicode string, then it converted to type 'byte' which consists of byte string. type 'byte' has only decode() function.
Additionally, I want to get same String values in Python3 as the result value of ms949Str in my example code. Give me some advice, please. Thanks.

Don't know Python, besides you didn't say what Python method you were using anyway, but if the Python method converted a Python string into a UTF-8 sequence of bytes, then you're using the wrong conversion method here, because that has nothing to do with URL Encoding.
str.getBytes("UTF-8") will return a byte[] with the Java string encoded in UTF-8.
new String(bytes, "UTF-8") will decode the byte array.
URL Encoding is about converting text into a string that is valid as a component of a full URL, meaning that all special characters must be encoded using %NN escapes. Non-ASCII characters has to be encoded too.
As an example, take the string Test & gehört. When URL Encoded, it becomes the following string:
Test+%26+geh%C3%B6rt
The string Test & gehört becomes the following sequence of bytes (displayed in hex) when used with getBytes:
54 65 73 74 20 26 20 67 65 68 c3 b6 72 74

What is this "socket heading" (bytes,chars) before receiving the actual xml from an msxml service?

I'm using java jaxb to unmarshal xml request via socket, before the actual xml
<?xml version="1.0"....
I receive these bytes
00 00 01 F9 EF BB BF
What are they?, size of the xml?, session id?...
The sender is using msxml4 to execute request's to my service.
Futhermore, I can see that the sender expect this type of header (it trunks the first 7 bytes if I send directly the xml response).
So when I have understood what these bytes are, is there any "normal" any method using jaxb that can be used to add this header or do I need to do it manually.
Thanks for any reply

This is a BOM header.
The first 4 bytes indicate file size 00 00 01 F9 = 0 + 0 + 256 + 249 = 505 (including the 3 bytes indicating UTF-8 (EF BB BF). Hence the xml length will be 502.
How to handle this stream with Jaxb view:
Byte order mark screws up file reading in Java
why org.apache.xerces.parsers.SAXParser does not skip BOM in utf8 encoded xml?
JAXB unmarshaller BOM handlle
However, I have prefeered to handle the stream byte by byte reading it into a StringBuffer (since I need it also in string format for logging)
My reading byte to byte solution is implemented to wait for the '<' char, hence first char in xml message.
To add the BOM heading before sending response I have used a similar method:
import java.nio.ByteBuffer;
public byte[] getBOMMessage(int xmlLenght) {
byte[] arr = new byte[7];
ByteBuffer buf = ByteBuffer.wrap(arr);
buf.putInt(xmlLenght+3);
arr[4]=(byte)0xef;
arr[5]=(byte)0xbb;
arr[6]=(byte)0xbf;
return arr;
}

Is UTF to EBCDIC Conversion lossless?

We have a process which communicates with an external via MQ. The external system runs on a mainframe maching (IBM z/OS), while we run our process on a CentOS Linux platform. So far we never had any issues.
Recently we started receiving messages from them with non-printable EBCDIC characters embedded in the message. They use the characters as a compressed ID, 8 bytes long. When we receive it, it arrives on our queue encoded in UTF (CCSID 1208).
They need to original 8 bytes back in order to identify our response messages. I'm trying to find a solution in Java to convert the ID back from UTF to EBCDIC before sending the response.
I've been playing around with the JTOpen library, using the AS400Text class to do the conversion. Also, the counterparty has sent us a snapshot of the ID in bytes. However, when I compare the bytes after conversion, they are different from the original message.
Has anyone ever encountered this issue? Maybe I'm using the wrong code page?
Thanks for any input you may have.
Bytes from counterparty(Positions [5,14]):
00000 F0 40 D9 F0 F3 F0 CB 56--EF 80 04 C9 10 2E C4 D4 |0 R030.....I..DM|
Program output:
UTF String: [R030Ã´Ã®Ã•Ã˜ÂœIDMDHP1027W 0510]
EBCDIC String: [R030Ã´Ã®ÃÃÂIDMDHP1027W 0510]
NATIVE CHARSET - HEX: [52303330C3B4C3AEC395C398C29C491006444D44485031303237572030353130]
CP500 CHARSET - HEX: [D9F0F3F066BE66AF663F663F623FC9102EC4D4C4C8D7F1F0F2F7E640F0F5F1F0]
Here is some sample code:
private void readAndPrint(MQMessage mqMessage) throws IOException {
mqMessage.seek(150);
byte[] subStringBytes = new byte[32];
mqMessage.readFully(subStringBytes);
String msgId = toHexString(mqMessage.messageId).toUpperCase();
System.out.println("----------------------------------------------------------------");
System.out.println("MESSAGE_ID: " + msgId);
String hexString = toHexString(subStringBytes).toUpperCase();
String subStr = new String(subStringBytes);
System.out.println("NATIVE CHARSET - HEX: [" + hexString + "] [" + subStr + "]");
// Transform to EBCDIC
int codePageNumber = 37;
String codePage = "CP037";
AS400Text converter = new AS400Text(subStr.length(), codePageNumber);
byte[] bytesData = converter.toBytes(subStr);
String resultedEbcdicText = new String(bytesData, codePage);
String hexStringEbcdic = toHexString(bytesData).toUpperCase();
System.out.println("CP500 CHARSET - HEX: [" + hexStringEbcdic + "] [" + resultedEbcdicText + "]");
System.out.println("----------------------------------------------------------------");
}

If a MQ message has varying sub-message fields that require different encodings, then that's how you should handle those messages, i.e., as separate message pieces.
But as you describe this, the entire message needs to be received without conversion. The first eight bytes need to be extracted and held separately. The remainder of the message can then have its encoding converted (unless other sub-fields also need to be extracted as binary, unconverted bytes).
For any return message, the opposite conversion must be done. The text portion of the message can be converted, and then that sub-string can have the original eight bytes prepended to it. The newly reconstructed message then can be sent back through the queue, again without automatic conversion.
Your partner on the other end is not using the messaging product correctly. (Of course, you probably shouldn't say that out loud.) There should be no part of such a message that cannot automatically survive intact across both directions. Instead of an 8-byte binary field, it should be represented as something more like a 16-byte hex representation of the 8-byte value for one example method. In hex, there'd be no conversion problem either way across the route.

It seems to me that the special 8 bytes are not actually EBCDIC character but simply 8 bytes of data. If it is in such case then I believe, as mentioned by another answer, that you should handle that 8 bytes separately without allowing it convert to UTF8 and then back to EBCDIC for further processing.
Depending on the EBCDIC variant you are using, it is quite possible that a byte in EBCDIC is not converting to a meaningful UTF-8 character, and hence, you will fail to get the original byte by converting the UTF8 character to EBCDIC you received.
A brief search on Google give me several EBCDIC tables (e.g. http://www.simotime.com/asc2ebc1.htm#AscEbcTables). You can see there are lots of values in EBCDIC that do not have character assigned. Hence, when they are converted to UTF8, you may not assume each of them will convert to a distinct character in Unicode. Therefore your proposed way of processing is going to be very dangerous and error-prone.

Decode of base64 string containing zip file gets 8 character codes wrong in result string

I'm receiving a base64-encoded zip file (in the form of a string) from a SOAP request.
I can decode the string successfully using a stand-alone program, b64dec.exe, but I need to do it in a java routine. I'm trying to decode it (theZipString) with Apache commons-codec-1.7.jar routines:
import org.apache.commons.codec.binary.Base64;
import org.apache.commons.codec.binary.StringUtils;
StringUtils.newString(Base64.decodeBase64(theZipString), "ISO-8859-1");
Zip file readers open the resulting file and show the list of content files but the content files have CRC errors.
I compared the result of my java routine with the result of the b64dec.exe program (using UltraEdit) and found that they are identical with the exception that eight different byte-values, where ever they appear in the b64dec.exe result, are replaced by 3F ("?") in mine. The values and their ISO-8859-1 character names are A4 ('currency'), A6 ('broken bar'), A8 ('diaeresis'), B4 ('acute accent'), B8 ('cedilla'), BC ('vulgar fraction 1/4'), BD ('vulgar fraction 1/2'), and BE ('vulgar fraction 3/4').
I'm guessing that the StringUtils.newString function is not translating those eight values to the string output, because I tried other 8-bit character sets: UTF-8, and cp437. Their results are similar but worse, with many more 3F, "?" substitutions.
Any suggestions? What character set should I use for the newString function to convert a .zip string? Is the Apache function incapable of this translation? Is there a better way to do this decode?
Thanks!

A zip file is not a string. It's not encoded text. It may contain text files, but that's not the same thing. It's just binary data.
If you treat arbitrary binary data as a string, bad things will happen. Instead, you should use streams or byte arrays. So this is fine:
byte[] zipData = Base64.decodeBase64(theZipString);
... but don't try to convert that to a string. If you write out that byte[] to a file (probably with FileOutputStream or some utility method) it should be fine.

What's the most efficient way to add data mid way through a stream of bytes?

I've been making an image rescaler that uses the ImageIO library in Java to convert them to a buffered image. Unfortunately it doesn't recognise every type of JPEG that I may pass to it and so I need to "convert" these other types. The way I'm converting them is to take an existing APP0 tag from a standard JFIF JPEG and what I want to do is on the 3rd byte into the file insert 18 bytes of data (the FFE0 marker and the 16 byte APP0 tag) and then I want to add the rest of the file to the end of that.
So to generalise, what's the most efficient way to add/insert bytes of data mid way through a stream/file?
Thanks in advanced,
Alexei Blue.
This question is linked to a previous question of mine and so I'd like to thank onemasse for the answer given there.
Java JPEG Converter for Odd Image Types

If you are reading your images from a stream, you could make a proxy which acts like an inputstream and takes an outputstream. Override the read method so it returns the extra missing bytes when they are missing.
A proxy can be made by extending FilterInputStream http://download.oracle.com/javase/6/docs/api/java/io/FilterInputStream.html

If it is a file, the recommended way to do this is to copy the existing file to a new one, inserting, changing or removing bytes at the appropriate points. Then rename the new file to the old one.
In theory you could try to use RandomAccessFile (or equivalent) perform an in-place update of an existing file. However, it is a bit tricky, not as efficient as you might imagine and ... most important ... it is unsafe. (If your application or the system dies at an inopportune moment, you are left with a broken file, and no way to recover it.)

A PushbackInputStream might be what you need.

Thanks for the suggestion guys, I used a FilterInputStream at first but then saw there was no need to, I used the following piece of code to enter my APP0 Hex tag in:
private static final String APP0Marker = "FF E0 00 10 4A 46 49 46 00 01 01 01 00 8B 00 8B 00 00";
And in the desired converter method:
if (isJPEG(path))
{
fis = new FileInputStream(path);
bytes = new byte[(int)(new File(path).length())];
APP0 = hexStringToByteArray(APP0Marker.replaceAll(" ", ""));
for (int index = 0; index < bytes.length; index++)
{
if (index >= 2 && index <= (2 + APP0.length - 1))
{
b = APP0[index-2];
}
else
{
b = (byte) fis.read();
}//if-else
bytes[index] = b;
}//for
//Write new image file
out = new FileOutputStream(path);
out.write(bytes);
out.flush();
}//if
Hope this helps anyone having the a similar problem :)

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

How to determine if a InputStream contains JSON data? - java

Related

How is it possible to encode String twice?

What is this "socket heading" (bytes,chars) before receiving the actual xml from an msxml service?

Is UTF to EBCDIC Conversion lossless?

Decode of base64 string containing zip file gets 8 character codes wrong in result string

What's the most efficient way to add data mid way through a stream of bytes?

Categories

Resources