Convert Java Byte Array to a UTF-8 Python String - java

I'm pulling JSON data through a REST API using Requests in Python. Unfortunately, one of the fields contains all sorts of unescaped and control characters that breaks the JSON.
I don't control the data, but I can request it undecoded as a string that the application stores as a Java byte array.
For example: [B#1cf3bd82
The question is how do I decode the string back into the original UTF-8 text as I'm working through the JSON? All of the examples I've found seem to work with a byte object, not a encoded string.
Thoughts?

You're currently printing out the result of calling toString() on the byte[]. That's never a good idea - arrays don't override toString().
You should use the new String(byte[], Charset) constructor:
String text = new String(bytes, StandardCharsets.UTF_8);
It's not entirely clear to me from the question where what is happening in terms of the data, but basically you need to modify the Java code - any Python code is probably irrelevant here.

Related

How to convert Protobuf ByteString to an octal-escaped String in java?

Could anyone please let me know how to convert protobuf's ByteString to an octal escape sequence String in java?
In my case, I am getting the ByteString value as \376\024\367 so, when I print the string value in console using System.out.println(), I should get "\376\024\367".
Many thanks.
Normally, you'd convert a ByteString to a String using ByteString#toString(Charset). This method lets you specify what charset the text is encoded in. If it's UTF-8, you can also use the method toStringUtf8() as a shortcut.
From your question, though, it sounds like you actually want to produce the escaped format using C-style three-digit octal escapes. AFAIK there's no public function to do this, but you can see the code here. You could copy that code into your own project and use it.
I have used http://doc.akka.io/japi/akka/2.3.7/akka/util/ByteString.ByteStrings.html
You will see to method decodeString(java.lang.String charset)
else see to https://github.com/akka/akka/issues/18738

Getting UTF8 strings directly from the web service call without conversion to String

I'm using CXF to implement a web services server.
Since I'm low on memory I don't want the web service call parameters to be translated strings which are UTF-16 I rather access the original UTF-8 buffers which are usually half in size in my case.
So if I have a web method:
void addBook(String bookText)
How can I get the bookText without CXF translating it to java string?
The XML parsers used in Java (StAX parsers for CXF) only allow getting the XML contents as either a String or char[]. Thus, it wouldn't be possible to get the raw bytes.
If you have a String object in java, there is no such thing as whether it is UTF-8 or UTF-16 string. The encoding comes in when you convert a String to or from a byte array.
A String in java is a character array. If you already have a String object in java (for example passed as a parameter to your addBook() method, it has already been interpreted properly and converted to a character array.
If you want to avoid character encoding conversions, the only way to do that is to define your method to receive a byte array instead of a String:
void addBook(byte[] bookTextUtf16);
But keep in mind that in this way you have to "remember" the encoding in which the byte array is valid (adding it to the name is one way).
If you need a java.lang.String object, then there is nothing you can do. A String is a character array, characters with which each being a 16-bit value. This is String internal, no way to change the internal representation. Either accept this or don't use java.lang.String to represent your strings.
An alternative way could be to create your own Text class for example which honds the UTF-8 encoded byte array for example, and as long as you don't need the String representation, keep it as a byte array and store it if you want to. Only create the java.lang.String instance when you do need the String.

How to get 'original' bytes of a Java String when read from DataOutputStream.writeUTF()?

Currently I'm transferring a String across the network, using DataInput/OutputStream's. The String I am transferring needs to be converted into a byte array, to be decrypted.
However, since when the string was written using DataOutputStream.writeUTF("foobar"), its byte array contains encoded Java Modified UTF-8 data, which stuffs up the encryption process.
How can I get the original bytes from the Java modified UTF-8 String?
Unicode has several variants, where s-with-^ can either be one character or two: s plus combining-^. Java has a Normalizer class to convert to one specific variant.
See http://docs.oracle.com/javase/tutorial/i18n/text/normalizerapi.html
or look immediately at the API.
This requires that the original string adheres to one variant. One cannot take bytes and then interprete them as UTF-8, because there are illegal sequences. This was done to prevent recognizing a wrong byte/character when in the middle of a byte sequence.
String normalizedString = Normalizer.normalize(s, Normalizer.Form.NFD);
What if you write your string as byte[] and read it as byte[] using http://docs.oracle.com/javase/1.4.2/docs/api/java/io/DataOutputStream.html#write(byte[], int, int)

Need help removing strange characters from string

Currently when I make a signature using java.security.signature, it passes back a string.
I can't seem to use this string since there are special characters that can only be seen when i copy the string into notepad++, from there if I remove these special characters I can use the remains of the string in my program.
In notepad they look like black boxes with the words ACK GS STX SI SUB ETB BS VT
I don't really understand what they are so its hard to tell how to get ride of them.
Is there a function that i can run to remove these and potentially similar characters?
when i use the base64 class supplied in the posts, i cant go back to a signature
System.out.println(signature);
String base64 = Base64.encodeBytes(sig);
System.out.println(base64);
String sig2 = new String (Base64.decode(base64));
System.out.println(sig2);
gives the output
”zÌý¥y]žd”xKmËY³ÕN´Ìå}ÏBÊNÈ›`Αrp~jÖüñ0…Rõ…•éh?ÞÀ_û_¥ÂçªsÂk{6H7œÉ/”âtTK±Ï…Ã/Ùê²
lHrM/aV5XZ5klHhLbctZs9VOtMzlfc9Cyk7Im2DOkXJwfmoG1vzxMIVS9YWV6Wg/HQLewF/7X6XC56pzwmt7DzZIN5zJL5TidFRLsc+Fwy/Z6rIaNA2uVlCh3XYkWcu882tKt2RySSkn1heWhG0IeNNfopAvbmHDlgszaWaXYzY=
[B#15356d5
The odd characters are there because cryptographic signatures produce bytes rather than strings. Consequently if you want a printable representation you should Base64 encode it (here's a public domain implementation for Java).
Stripping the non-printing characters from a cryptographic signature will render it useless as you will be unable to use it for verification.
Update:
[B#15356d5
This is the result of toString called on a byte array. "[" means array, "B" means byte and "15356d5" is the address of the array. You should be passing the array you get out of decode to [Signature.verify](http://java.sun.com/j2se/1.4.2/docs/api/java/security/Signature.html#verify(byte[])).
Something like:
Signature sig = new Signature("dsa");
sig.initVerify(key);
sig.verify(Base64.decode(base64)); // <-- bytes go here
How are you "making" the signature? If you use the sign method, you get back a byte array, not a string. That's not a binary representation of some text, it's just arbitrary binary data. That's what you should use, and if you need to convert it into a string you should use a base64 conversion to avoid data corruption.
If I understand your problem correctly, you need to get rid of characters with code below 32, except maybe char 9 (tab), char 10 (new line) and char 13 (return).
Edit: I agree with the others as handling a crypto output like this is not what you usually want.

Decoding split 16-bit character in Java

In my application, I receive a URL-UTF8 encoded string of characters, which is split up by the sending client. After splitting, each message part includes some header information which is meant to be used to reconstruct the message.
With English characters, it's pretty straightforward
String content = new String(request.getParameter("content").getBytes("UTF-8"));
I store this in along with the header information in a buffer for each received part. When all parts have been received, I simply recompose the message by concatenating each individual part according to header information.
With languages that use 16-bit encodings this is sometimes not working as expected. Everything works fine if the split does NOT happen in the middle of a single character.
For instance here's a string of three Hebrew characters being sent by the client:
%D7%93%D7%99%D7%91
If this winds up split as follows: {%D7%93%D7%99} {%D7%91}, reconstruction isn't a problem.
However sometimes the client splits it up in the middle (example: {%D7%93%D7} {%99%D7%91})
When this happens, after reconstruction I get two � characters at the boundary point instead of the single correct Hebrew character.
I thought the inability to correctly retain the single byte information was related to passing around strings, so I tried passing around byte array from request.getParameter("content").getBytes("UTF-8") to the buffer without wrapping in the string joining together the byte arrays. In the buffer I joined all these arrays BEFORE converting the final array to a string.
Even after doing this, it appears I still "lost" that information held by the single bytes. I'm guessing this is because the getBytes("UTF-8") method can't correctly resolve the single bytes since they are not valid characters. Is that right?
Is there any way I can get around this and preserve these tail/head bytes?
Your client is the problem here. Apparently it treats the text data as a byte array for the purpose of splitting it up, and then sending the invalid fragments as text (HTTP request parameters are inherently textual). At that point, you have already lost.
You either have to change the client to split the data as text (i.e. along character boundaries), or change your protocol to send the fragments as binary data, i.e. not as a parameter but as the request body, to be retrieved via ServletRequest.getInputStream() - then, concatenating the data before decoding it should work.
(Caveat: the above assumes that you are indeed writing Servlet code, which I inferred from the request.getParameter() method; but even if that's a coincidence the same principles apply: either split the data as a String before any conversion to byte[] happens on the client side, or make sure you concatenate the byte arrays on the server before any conversion to String happens.)
You must first collect all bytes and then convert them all at once into a string.
Following scheme is a hack but it should work in your case,
Set you server/page in Latin-1 mode. If this is a GET, client has no way to set encoding. You have to do this on server's end. For example, you need to add URIEncoding="iso-8859-1" in connector for Tomcat.
Get content as Latin1. It will be wrong value at this point but don't worry,
String content = request.getParameter("content");
Concatenate the string as Latin-1.
data = data + content;
When you get the whole thing, you need to re-encode the string as UTF-8 like this,
String value = new String(data.getBytes("iso-8859-1"), "utf-8");
The value should contain the correct characters.
You never need to convert a string to bytes and then to a String java, it is completely pointless. Once a series of bytes have been decoded to a String it is in Java String encoding (UTF-16E I think).
The problem you have is that the application server is making an assumation about the encoding of the incoming HTTP request, usually the platform encoding. You can give the application server a hint as to the expected encoding by calling ServletRequest.setCharacterEncoding(String) before anything else calls getParameter().
Browser's assume that form fields should be submitted back to the server using the same encoding that the page was served with. This is a general rule as the HTTP spec doesn't have a way to specify the encoding of the incoming request, only the response.
Spring has a nice Filter to do this for you CharacterEncodingFilter if you define this as the every first filter in web.xml most of your encoding issue will go away.

Categories