Does Class URLDecoder.decode handle double encoding? - java

I am trying to debug a flaky Java application. I can't (easily) debug it in the only way I would know how - by putting a log statement in it and re-compiling. Then checking the logs.
(I don't have access to a reliable set of source code). And I'm not a Java developer.
The actual question:
If I did this:
str = URLDecoder.decode("%25C3%2596");
What would be in str?
Would it realize that this is double-encoded and handle that i.e. turn it into %C3%96 - and then decode that? (Which decodes into a German Umlaut).
Thanks
--Justin Wyllie

From the Java API URLDecoder:
A sequence of the form "%xy" will be treated as representing a byte where xy is the two-digit hexadecimal representation of the 8 bits.
So my guess would be - most likely not.
You could however call the decode method twice.
str = URLDecoder.decode(URLDecoder.decode("%25C3%2596"));

str = URLDecoder.decode("%25C3%2596");
The result of this operation is system-dependent (the reason the method is deprecated.)
The result of this call:
str = URLDecoder.decode("%25C3%2596", "UTF-8");
...would be %C3%96 which is Ö in percent-encoded UTF-8. The API does not try to recursively decode any percent-signs.

Related

decoding and encoding strings, ISO-8859-1 to UTF-8 in Java

I have read the other posts on this issue, but the solutions they presented did not work for me. Actually, the official Java documentation also did not work as intended (I am using Java 11) : https://docs.oracle.com/javase/tutorial/i18n/text/string.html
My problem is that I am reading one byte at a time from a byte buffer, putting that in a byte array, and making a String out of that byte array. The bytes I read are from an embedded system that can only send ISO-8859-1 bytes, so I end up with a byte array with ISO-8859-1 bytes and the Java String I end up getting is thus ISO-8859-1 encoded. No problem here. The String in IntelliJ looks like this :
The bytes I am trying to convert from ISO-8859-1 to UTF-8 are the ones in yellow. I want them to be UTF-8, so in the end the "C9" byte should be replace by the "C3A9" bytes.
The first step works correctly, I do this : maintenanceResponseString.getBytes(StandardCharsets.UTF_8) and I get the right bytes that I want, the UTF-8 encoding of the string, that's good :
The problem comes in here , when I try to make a STRING out of these new (and GOOD) bytes, like this :
new String(maintenanceResponseString.getBytes(StandardCharsets.UTF_8), StandardCharsets.UTF_8)
The old bytes are back ?!! It's like the "getBytes(UTF-8)" never actually happened. That is NOT what the documentation says should happen... what am I missing here ? I have done tests and the string really is still ISO-8859-1 encoded... I don't know what is going on here. Where are the bytes from "getBytes" ?
How do you convert a String that contains ISO-8859-1 bytes to UTF-8 bytes ? I'm out of alternatives and I need to get it done real bad for a pro project... this should be easy !
Note : I have tried alternatives like
ByteBuffer buffer = StandardCharsets.UTF_8.encode(s);
return StandardCharsets.UTF_8.decode(buffer).toString();
But the exact same thing happens.
Thank you in advance for your help.
EDIT :
With some info in the comments about how Strings in Java 9+ get represented internally not as UTF-16 only anymore, but Latin-1 (why...), I think that is what made me think the Strings were "internally encoded in Latin-1" when it is just the default representation of the String if we don't specify the encoding we want to use when displaying the String.
From what I undestand now the String itself is not bound to any encoding, and you can CHOOSE the encoding you want to display it in when it gets written.
Actually my issue is that the String ends up written to an XML file via JAXB marshalling in LATIN-1, and I now think the issues lies over there... I will dig further when I access my work computer again and report here
It turns out there was nothing wrong with Strings and "their encoding". What happened is I got really confused because the debugger shows the contents of the String in a "default internal storage encoding", and that is ISO-8859-1 (but can be UTF-16, depends on the content of the String).
Quote from the JEP-254 :
We propose to change the internal representation of the String class
from a UTF-16 char array to a byte array plus an encoding-flag field.
The new String class will store characters encoded either as
ISO-8859-1/Latin-1 (one byte per character), or as UTF-16 (two bytes
per character), based upon the contents of the string. The encoding
flag will indicate which encoding is used.
But actually it doesn't matter the internal encoding storage. When it is time to be written, the String will use whatever encoding you want at the time of writing.
My issue actually was when I was sending the String in an HTTP request with Spring RestTemplate. I didn't have the header specifying the "charset" to use in the request, and RestTemplate defaults to ISO-8859-1 if not told otherwise. I added the charset=utf-8, and the String was correctly written as UTF-8 in the request.
Thank you to #VGR #Eugene #skomisa for the help

base64 url safe removes =

The following code(using commons codec Base64):
byte[] a = Hex.decodeHex("9349c513ed080dab".toCharArray());
System.out.println(Base64.encodeBase64URLSafeString(a));
System.out.println(Base64.encodeBase64String(a));
gives the following output:
k0nFE-0IDas //should be k0nFE-0IDas=
k0nFE+0IDas=
Base64.encodeBase64URLSafeString(a) returns k0nFE-0IDas instead of k0nFE-0IDas=. Why is this happening?
Why is this happening?
Because that's what it's documented to do:
Note: no padding is added.
The = characters at the end of a base64 string are called padding. They're used to make sure that the final string's length is a multiple of 4 characters - but they're not really required, in terms of information theory, so it's reasonable to remove them so long as you then convert the data back to binary using a method which doesn't expect padding. The Apache Codec Base64 class claims it transparently handles both regular and URL-safe base64, so presumably does handle a lack of padding.

UTF-8 Encoding ; Only some Japanese characters are not getting converted

I am getting the parameter value as parameter from the Jersey Web Service, which is in Japaneses characters.
Here, 'japaneseString' is the web service parameter containing the characters in japanese language.
String name = new String(japaneseString.getBytes(), "UTF-8");
However, I am able to convert a few sting literals successfully, while some of them are creating problems.
The following were successfully converted:
1) アップル
2) 赤
3) 世丕且且世两上与丑万丣丕且丗丕
4) 世世丗丈
While these din't:
1) ひほわれよう
2) 存在する
When I further investigated, i found that these 2 strings are getting converted in to some JUNK characters.
1) Input: ひほわれよう Output : �?��?��?れよ�?�
2) Input: 存在する Output: 存在�?�る
Any idea why some of the japanese characters are not converted properly?
Thanks.
You are mixing concepts here.
A String is just a sequence of characters (chars); a String in itself has no encoding at all. For what it's worth, replace characters in the above with carrier pigeons. Same thing. A carrier pigeon has no encoding. Neither does a char. (1)
What you are doing here:
new String(x.getBytes(), "UTF-8")
is a "poor man's encoding/decoding process". You will probably have noticed that there are two versions of .getBytes(): one where you pass a charset as an argument and the other where you don't.
If you don't, and that is what happens here, it means you will get the result of the encoding process using your default character set; and then you try and re-decode this byte sequence using UTF-8.
Don't do that. Just take in the string as it comes. If, however, you have trouble reading the original byte stream into a string, it means you use a Reader with the wrong charset. Fix that part.
For more information, read this link.
(1) the fact that, in fact, a char is a UTF-16 code unit is irrelevant to this discussion
Try with JVM parameter file.encoding to set with value UTF-8 in startup of Tomcat(JVM).
E.x.: -Dfile.encoding=UTF-8
I concur with #fge.
Clarification
In java String/char/Reader/Writer handle (Unicode) text, and can combine all scripts in the world.
And byte[]/InputStream/OutputStream are binary data, which need an indication of some encoding to be converted to String.
In your case japaneseStingr should already be a correct String, or be substituted by the original byte[].
Traps in Java
Encoding often is an optional parameter, which then defaults to the platform encoding. You fell in that trap too:
String s = "...";
byte[] b = s.getBytes(); // Platform encoding, non-portable.
byte[] b = s.getBytes("UTF-8"); // Explicit
byte[] b = s.getBytes(StandardCharsets.UTF_8); // Explicit,
// better (for UTF-8, ISO-8859-1)
In general avoid the overloaded methods without encoding parameter, as they are for current-computer only data: non-portable. For completeness: classes FileReader/FileWriter should be avoided as they even provide no encoding parameters.
Error
japaneseString is already wrong. So you have to read that right.
It could have been read erroneouslyas Windows-1252 (Windows Latin-1) and suffered when recoding to UTF-8. Evidently only some cases get messed up.
Maybe you had:
String japanesString = new String(bytes);
instead of:
String japanesString = new String(bytes, StandardCharsets.UTF_8);
At the end:
String name = japaneseString;
Show the code for reading japaneseString for further help.

Why does the Blowfish output in Java and PHP differ by only 2 chars?

I have a blowfish encryption script in PHP and JAVA vice versa that was working fine until today when I came across a problem.
The same content is encrypted differently in Java vs PHP by only 2 chars, which is really weird.
PHP
wTHzxfxLHdMm/JMFnoh0hciS/JADvFFg
Java
wTHzxfxLHdMm/JMFnoh0hciS/D8DvFFg
-------------------------^^
As you see those two positions do not match. Unfortunately the value is a real email address and I can't share it. Also I was not able to reproduce the problem with other few values I've tested. I've tried changing Base64 encode classes on Java, and that neither helped.
The source code for PHP is here, and for Java is here.
What could I do to resolve this problem?
Let's have a look at your Java code:
String c = new String(Test.encrypt((new String("thevalue")).getBytes(),
(new String("mykey")).getBytes()));
...
System.out.println("Base64 encoded String:" +
new sun.misc.BASE64Encoder().encode(c.getBytes()));
What you are doing here is:
Convert the plaintext string to bytes, using the system's default encoding
convert the key to bytes, using the system's default encoding
encrypt the bytes
convert the encrypted bytes back to a string, using the system's default encoding
convert the encrypted string back to bytes, using the system's default encoding
encode these encrypted bytes using Base64.
The problem is in step 4. It assumes that an arbitrary byte array represents a string in your system's default encoding, and encoding this string back gives the same byte[]. This is valid for some encodings (the ISO-8859 series, for example), but not for others. In Java, when some byte (or byte sequence) is not representable in the given encoding, it will be replaced by some other character, which later for reconverting will be mapped to byte 63 (ASCII ?). Actually, the documentation even says:
The behavior of this constructor when the given bytes are not valid in the default charset is unspecified.
In your case, there is no reason to do this at all - simply use the bytes which your encrypt method outputs directly to convert them to Base64.
byte[] encrypted = Test.encrypt("thevalue".getBytes(),
"mykey".getBytes());
System.out.println("Base64 encoded String:"+ new sun.misc.BASE64Encoder().encode(encrypted));
(Also note that I removed the superfluous new String("...") constructor calls here, though this does not relate to your problem.)
The point to remember: Never ever convert an arbitrary byte[], which did not come from encoding a string, to a string. Output of an encryption algorithm (and most other cryptographic algorithms, except decryption) certainly belongs to the category of data which should not be converted to a string.
And never ever use the System's default encoding, if you want portable programs.
Your code seems right to me.
It looks like you have a trailing white space in the input to one of these programs, and it is only one. I'll tell you why:
Each of these 4-char blocks represent 3 characters in the encrypted string. Th different part (JA and D8 in the 7th block) actually come from a single different character.
wTHz xfxL HdMm /JMF noh0 hciS /JAD vFFg
wTHz xfxL HdMm /JMF noh0 hciS /D8D vFFg
If I have got it right your email address is 19 characters long. The 20th character in one of your input strings is a white space.
Question: Have you tried the associated PHP decryption library to decrypt the PHP generated encrypted text? Have you tried the associated JAVA decryption library to decrypt the JAVA encrypted text?
If both produce differing outputs, then one MUST fail decrypting.
Is that one PHP, or Java?
Whichever one it is -- I would try to duplicate another such failure with a publicly shareable string... give that string as a unit test -- to the developer or developers that created the encrypt/decrypt code in the language that the round-trip encrypt/decrypt fails in.
Then... wait for them to fix it.
Not sure of any faster solutions -- except maybe change encryption/decryption library providers... or roll your own...

Need help removing strange characters from string

Currently when I make a signature using java.security.signature, it passes back a string.
I can't seem to use this string since there are special characters that can only be seen when i copy the string into notepad++, from there if I remove these special characters I can use the remains of the string in my program.
In notepad they look like black boxes with the words ACK GS STX SI SUB ETB BS VT
I don't really understand what they are so its hard to tell how to get ride of them.
Is there a function that i can run to remove these and potentially similar characters?
when i use the base64 class supplied in the posts, i cant go back to a signature
System.out.println(signature);
String base64 = Base64.encodeBytes(sig);
System.out.println(base64);
String sig2 = new String (Base64.decode(base64));
System.out.println(sig2);
gives the output
”zÌý¥y]žd”xKmËY³ÕN´Ìå}ÏBÊNÈ›`Αrp~jÖüñ0…Rõ…•éh?ÞÀ_û_¥ÂçªsÂk{6H7œÉ/”âtTK±Ï…Ã/Ùê²
lHrM/aV5XZ5klHhLbctZs9VOtMzlfc9Cyk7Im2DOkXJwfmoG1vzxMIVS9YWV6Wg/HQLewF/7X6XC56pzwmt7DzZIN5zJL5TidFRLsc+Fwy/Z6rIaNA2uVlCh3XYkWcu882tKt2RySSkn1heWhG0IeNNfopAvbmHDlgszaWaXYzY=
[B#15356d5
The odd characters are there because cryptographic signatures produce bytes rather than strings. Consequently if you want a printable representation you should Base64 encode it (here's a public domain implementation for Java).
Stripping the non-printing characters from a cryptographic signature will render it useless as you will be unable to use it for verification.
Update:
[B#15356d5
This is the result of toString called on a byte array. "[" means array, "B" means byte and "15356d5" is the address of the array. You should be passing the array you get out of decode to [Signature.verify](http://java.sun.com/j2se/1.4.2/docs/api/java/security/Signature.html#verify(byte[])).
Something like:
Signature sig = new Signature("dsa");
sig.initVerify(key);
sig.verify(Base64.decode(base64)); // <-- bytes go here
How are you "making" the signature? If you use the sign method, you get back a byte array, not a string. That's not a binary representation of some text, it's just arbitrary binary data. That's what you should use, and if you need to convert it into a string you should use a base64 conversion to avoid data corruption.
If I understand your problem correctly, you need to get rid of characters with code below 32, except maybe char 9 (tab), char 10 (new line) and char 13 (return).
Edit: I agree with the others as handling a crypto output like this is not what you usually want.

Categories