How is it possible to encode String twice? - java

I was Python programmer(Of course I am now, too), so I am familiar with Python encoding and decoding.
I was surprised at the fact that Java can encode String variables twice consecutively.
This is example code:
import java.net.URLEncoder;
public class OpenAPITest {
public static void main(String[] arg) throws Exception {
String str = "안녕"; // Korean
String utfStr = URLEncoder.encode(str, "UTF-8");
System.out.println(utfStr);
String ms949Str = URLEncoder.encode(utfStr, "MS949");
System.out.println(ms949Str);
}
}
I wonder how it can encode string twice times.
In Python, version 3.x, once you encode type 'str' which consists of unicode string, then it converted to type 'byte' which consists of byte string. type 'byte' has only decode() function.
Additionally, I want to get same String values in Python3 as the result value of ms949Str in my example code. Give me some advice, please. Thanks.

Don't know Python, besides you didn't say what Python method you were using anyway, but if the Python method converted a Python string into a UTF-8 sequence of bytes, then you're using the wrong conversion method here, because that has nothing to do with URL Encoding.
str.getBytes("UTF-8") will return a byte[] with the Java string encoded in UTF-8.
new String(bytes, "UTF-8") will decode the byte array.
URL Encoding is about converting text into a string that is valid as a component of a full URL, meaning that all special characters must be encoded using %NN escapes. Non-ASCII characters has to be encoded too.
As an example, take the string Test & gehört. When URL Encoded, it becomes the following string:
Test+%26+geh%C3%B6rt
The string Test & gehört becomes the following sequence of bytes (displayed in hex) when used with getBytes:
54 65 73 74 20 26 20 67 65 68 c3 b6 72 74

Related

Java - decode base64 - Illegal base64 character 1

I have following data in a file:
I want to decode the UserData. On reading it as string comment, I'm doing following:
String[] split = comment.split("=");
if(split[0].equals("UserData")) {
System.out.println(split[1]);
byte[] callidArray = Arrays.copyOf(java.util.Base64.getDecoder().decode(split[1]), 9);
System.out.println("UserData:" + Hex.encodeHexString(callidArray).toString());
}
But I'm getting the following exception:
java.lang.IllegalArgumentException: Illegal base64 character 1
What could be the reason?
The image suggests that the string you are trying to decode contains characters like SOH and BEL. These are ASCII control characters, and will not ever appear in a Base64 encoded string.
(Base64 typically consists of letters, digits, and +, \ and =. There are some variant formats, but control characters are never included.)
This is confirmed by the exception message:
java.lang.IllegalArgumentException: Illegal base64 character 1
The SOH character has ASCII code 1.
Conclusions:
You cannot decode that string as if it was Base64. It won't work.
It looks like the string is not "encoded" at all ... in the normal sense of what "encoding" means in Java.
We can't advise you on what you should do with it without a clear explanation of:
where the (binary) data comes from,
what you expected it to contain, and
how you read the data and turned it into a Java String object: show us the code that did that!
The UserData field in the picture in the question actually contains Bytes representation of Hexadecimal characters.
So, I don't need to decode Base64. I need to copy the string to a byte array and get equivalent hexadecimal characters of the byte array.
String[] split = comment.split("=");
if(split[0].equals("UserData")) {
System.out.println(split[1]);
byte[] callidArray = Arrays.copyOf(split[1].getBytes(), 9);
System.out.println("UserData:" + Hex.encodeHexString(callidArray).toString());
}
Output:
UserData:010a20077100000000

Java String internal representation

I understand that the internal representation of Java for String is UTF-16. What is java string representation?
Also, I know that in a UTF-16 String, each 'character' is encoded with one or two 16-bit code units.
However, when I debug the following java code
String hello = "Hello";
the variable hello is an array of 5 bytes 0x48, 0x101, 0x108, 0x108, 0x111
which is ASCII for "Hello".
How can this be?
I took a gcore dump of a mini java process with this code:
class Hi {
public static void main(String args[]) {
String hello = "Hello";
try {
Thread.sleep(60_000);
} catch (InterruptedException e) {
e.printStackTrace();
}
}
}
And did a gcore memory dump on Ubuntu. (usign jps to get the pid and passed that to gcore)
If found this: 48 65 6C 6C 6F in the dump using a Hexeditor, so it is somewhere in the memory as ASCII.
But also 48 00 65 00 6C 00 6C which is part of the UTF-16 representation of the String
String internal representation is not specified, it's the implementation detail, so you cannot rely on it. It's very likely that in JDK-9 it will be changed to use double encoding (Latin-1 for strings which can be encoded in Latin-1, UTF-16 for other strings). See JEP-254 for details. This feature is already integrated in OpenJDK master codebase, so if you are using Java-9 early access builds, you will have actually 5 bytes.

ASC Visual Basic for Java

I need a function on Java that do the same as ASC function on Visual Basic. I've had looking for it on internet, but I can't found the solution.
The String that I have to know the codes was created on Visual Basic. It's according to ISO 8859-1 and Microsoft Windows Latin-1 characters. The ASC function on Visual Basic knows those codes, but in Java, I can't find a function that does the same thing.
I know in Java this sentence:
String myString = "ÅÛ–ßÕÅÝ•ÞÃ";
int first = (int)string.chartAt(0); // "Å"- VB and Java returns: 197
int second = (int)string.chartAt(0); // "Û" - VB and Java returns: 219
int third = (int)string.chartAt(0); // "–" - VB returns: 150 and Java returns: 8211
The first two characters, I haven't had problem, but the third character is not a ASCII code.
How can I get same codes in VB and Java?
First of all, note that ISO 8859-1 != Windows Latin-1. (See http://en.wikipedia.org/wiki/Windows-1252)
The problem is that Java encodes characters as UTF16, so casting to int will generally result in the Unicode value of the char.
To get the Latin-1 encoding of a char, first convert it to a Latin-1 encoded byte array:
public class Encoding {
public static void main(String[] args) {
// Cp1252 is Windows codepage 1252
byte[] bytes = "ÅÛ–ßÕÅÝ•ÞÃ".getBytes(Charset.forName("Cp1252"));
for (byte b: bytes) {
System.out.println(b & 255);
}
}
}
prints:
197
219
150
223
213
197
221
149
222
195

Chinese character 数 encodes into too many bytes

I'm trying to encode some Chinese characters using the GB18030 cp in Java, and I ran into this character 数, which translates to "Number" in Google Translate.
The issue is, it's turning into 10 bytes (!) when encoded:
81 30 81 34 81 30 83 31 ca fd
import java.math.BigInteger;
import java.nio.charset.Charset;
public class Test3
{
public static void main(String[] args)
{
String s = new String("数");
System.out.println( "source file: "+String.format("%x ",
new BigInteger(1, s.getBytes(Charset.forName("GB18030"))) ));
}
}
When I try to decode that using the GB18030, it results in ? characters appearing beside the Chinese Number character (??数). When I try to decode only "CA FD", the last two bytes from above, it correctly decodes to the character.
Google translate notes the above character is Simplified. My source file is also saved in UTF8.
I thought GB18030 has a max of 4 bytes per character? Is there any particular reason this character behaves so strangely? (I'm not Chinese, BTW)
The most likely things are either:
There's an issue with the encoding of your source file, or
You have "invisible" characters prior to the 数 in it.
You can check both of those by completely deleting the string literal on this line:
String s = new String("数");
so it looks like this (note I removed the quotes as well as the character):
String s = new String();
and then adding back "\u6570" to get this:
String s = new String("\u6570");
and seeing if your output changes (as 数 is Unicode code point U+6570 and so that escape sequence should be the same character). If it changes, either there's an encoding problem or you had invisible characters in the string prior to the character. You can probably differentiate the two cases by then adding back just that character (via copy and paste from this page rather than your previous source code). If the problem reappears, it's an encoding issue. If not, you had hidden characters.

Decode of base64 string containing zip file gets 8 character codes wrong in result string

I'm receiving a base64-encoded zip file (in the form of a string) from a SOAP request.
I can decode the string successfully using a stand-alone program, b64dec.exe, but I need to do it in a java routine. I'm trying to decode it (theZipString) with Apache commons-codec-1.7.jar routines:
import org.apache.commons.codec.binary.Base64;
import org.apache.commons.codec.binary.StringUtils;
StringUtils.newString(Base64.decodeBase64(theZipString), "ISO-8859-1");
Zip file readers open the resulting file and show the list of content files but the content files have CRC errors.
I compared the result of my java routine with the result of the b64dec.exe program (using UltraEdit) and found that they are identical with the exception that eight different byte-values, where ever they appear in the b64dec.exe result, are replaced by 3F ("?") in mine. The values and their ISO-8859-1 character names are A4 ('currency'), A6 ('broken bar'), A8 ('diaeresis'), B4 ('acute accent'), B8 ('cedilla'), BC ('vulgar fraction 1/4'), BD ('vulgar fraction 1/2'), and BE ('vulgar fraction 3/4').
I'm guessing that the StringUtils.newString function is not translating those eight values to the string output, because I tried other 8-bit character sets: UTF-8, and cp437. Their results are similar but worse, with many more 3F, "?" substitutions.
Any suggestions? What character set should I use for the newString function to convert a .zip string? Is the Apache function incapable of this translation? Is there a better way to do this decode?
Thanks!
A zip file is not a string. It's not encoded text. It may contain text files, but that's not the same thing. It's just binary data.
If you treat arbitrary binary data as a string, bad things will happen. Instead, you should use streams or byte arrays. So this is fine:
byte[] zipData = Base64.decodeBase64(theZipString);
... but don't try to convert that to a string. If you write out that byte[] to a file (probably with FileOutputStream or some utility method) it should be fine.

Categories