Chinese character 数 encodes into too many bytes - java

I'm trying to encode some Chinese characters using the GB18030 cp in Java, and I ran into this character 数, which translates to "Number" in Google Translate.
The issue is, it's turning into 10 bytes (!) when encoded:
81 30 81 34 81 30 83 31 ca fd
import java.math.BigInteger;
import java.nio.charset.Charset;
public class Test3
{
public static void main(String[] args)
{
String s = new String("数");
System.out.println( "source file: "+String.format("%x ",
new BigInteger(1, s.getBytes(Charset.forName("GB18030"))) ));
}
}
When I try to decode that using the GB18030, it results in ? characters appearing beside the Chinese Number character (??数). When I try to decode only "CA FD", the last two bytes from above, it correctly decodes to the character.
Google translate notes the above character is Simplified. My source file is also saved in UTF8.
I thought GB18030 has a max of 4 bytes per character? Is there any particular reason this character behaves so strangely? (I'm not Chinese, BTW)

The most likely things are either:
There's an issue with the encoding of your source file, or
You have "invisible" characters prior to the 数 in it.
You can check both of those by completely deleting the string literal on this line:
String s = new String("数");
so it looks like this (note I removed the quotes as well as the character):
String s = new String();
and then adding back "\u6570" to get this:
String s = new String("\u6570");
and seeing if your output changes (as 数 is Unicode code point U+6570 and so that escape sequence should be the same character). If it changes, either there's an encoding problem or you had invisible characters in the string prior to the character. You can probably differentiate the two cases by then adding back just that character (via copy and paste from this page rather than your previous source code). If the problem reappears, it's an encoding issue. If not, you had hidden characters.

Related

File.toURI does not encode plus sign

I just want to check my own sanity with this question here. I have a filename which has a + (plus) character in it, which is perfectly valid on some operating systems and filesystems (e.g. MacOS and HFS+).
However, I am seeing an issue where I think that java.io.File#toURI() is not operating correctly.
For example:
new File("hello+world.txt").toURI().toString()
On my Mac machine returns:
file:/Users/aretter/code/rocksdb/hello+world.txt
However IMHO, that is not correct, because the + (plus) character from the filename has not been encoded in the URI. The URI does not represent the original filename at all, a + in a URI has a very different meaning to a + character in a filename.
So if we decode the URI, the plus will now be replaced with a (space) character, and we have lost information. e.g.:
URLDecoder.decode(new File("hello+world.txt").toURI().toURL().toString)
Which results in:
file:/Users/aretter/code/rocksdb/hello world.txt
What I would have expected instead would be something like:
new File("hello+world.txt").toURI().toString()
resulting in:
file:/Users/aretter/code/rocksdb/hello%2Bworld.txt
So that when it is later used and decoded the plus sign is preserved.
I am struggling to believe that such an obvious bug could be present in Java SE. Can someone point out where I am mistaken?
Also, if there is a workaround, I would like to hear about it please? Keep in mind that I am not actually providing static strings as filenames to File, but rather reading a directory of files from disk, of which some of those files may contain a + (plus) character.
Let me try to clarify,
'+' plus character is used as encoding character to encode ' ' space in context of HTML form (a.k.a. application/x-www-form-urlencoded MIME format).
'%20' character is used as encoding character to encode ' ' space in context of URL/URI format.
'+' plus character is threat as a normal character in context of URL and it is not encoded in any form (e.g. %20).
So when you call the new File("hello+world.txt").toURI().toString() does not perform any encoding for '+' character(simply because it is not required).
Now come to URLDecoder, this class is an utility class for HTML form decoding. It treat the '+' plus as encoded character and hence decode it to ' ' space character. In your example, this class tread the URI's to string value as normal html form field's value (not the URI value). This class should never be used to decode the full URI/URL value as it is not designed for this purpose)
From java docs of URLDecoder#decode(String),
Decodes a x-www-form-urlencoded string. The platform's default
encoding is used to determine what characters are represented by any
consecutive sequences of the form "%xy".
Hope it helps.
Update #1 based on comments:
As per section 2.2, If data for a URI component has conflicts with a reserved character, then the conflicting data must be percent-encoded before the URI is formed.
It is also an important point that different parts of URI has different set of reserved words depending on the their context. For example, / sign is reserved only in path part of URI, + sign is reserved in query string part. So there is no need to escape / in query part and similarly there is no need to escape + in path part.
In your example, URI producer File.toURI does not encode + sign in path part of URI (since +' is not considered as reserved word in path part) and you see the +' sign in to URI's to string representation.
You may refers to URI recommendation for more details.
Related answer:
https://stackoverflow.com/a/1006074/1700467
https://stackoverflow.com/a/2678602/1700467
https://stackoverflow.com/a/4571518/1700467
I'm assuming, you wanted to encode + sign in your filename to %2B. So, that you get back it as + sign when you decode it back.
If that is the case, then you need to use URLEncoder.encode
System.out.println(URLEncoder.encode(new File("hello+world.txt").toURI().toString()));
It will encode all special characters including + sign. The output would be
file%3A%2Fhome%2FT8hvs7%2Fhello%2Bworld.txt
Now, to decode use URLDecoder.decode
System.out.println(URLDecoder.decode("file%3A%2Fhome%2FwQCXni%2Fhello%2Bworld.txt"));
It will display
file:/home/wQCXni/hello+world.txt
Obviously this is not a bug, documentation clearly says
The plus sign "+" is converted into a space character " " .
You can do something like that: https://ideone.com/JHDkM4
import java.util.*;
import java.lang.*;
import java.io.*;
import static java.lang.System.out;
class Ideone
{
public static void main (String[] args) throws java.lang.Exception
{
out.println(new File("hello+world.txt").toURI().toString());
out.println(java.net.URLDecoder.decode(new File("hello+world.txt").toURI().toURL().toString()));
out.println(new File("hello+world.txt").toURI().toString().replaceAll("\\+", "%2B"));
}
}
If the URI represents a file, let the File class decode the URI.
Let's say we have a URI for a file, for example to get the filepath of a jar file :
URI uri = MyClass.class.getProtectionDomain().getCodeSource().getLocation().toURI();
System.out.println(uri.toString());
=> BAD : will display the plus sign, but %20 for spaces
System.out.println(URLDecoder.decode(uri.toString(), StandardCharsets.UTF_8.toString()));
=> BAD : will display spaces instead of %20, but also instead of the plus sign
System.out.println(new File(uri).getAbsolutePath());
=> GOOD
Try to escape the plus sign with a backslash \
So do
new File("hello\+world.txt").toURI().toString()

How is it possible to encode String twice?

I was Python programmer(Of course I am now, too), so I am familiar with Python encoding and decoding.
I was surprised at the fact that Java can encode String variables twice consecutively.
This is example code:
import java.net.URLEncoder;
public class OpenAPITest {
public static void main(String[] arg) throws Exception {
String str = "안녕"; // Korean
String utfStr = URLEncoder.encode(str, "UTF-8");
System.out.println(utfStr);
String ms949Str = URLEncoder.encode(utfStr, "MS949");
System.out.println(ms949Str);
}
}
I wonder how it can encode string twice times.
In Python, version 3.x, once you encode type 'str' which consists of unicode string, then it converted to type 'byte' which consists of byte string. type 'byte' has only decode() function.
Additionally, I want to get same String values in Python3 as the result value of ms949Str in my example code. Give me some advice, please. Thanks.
Don't know Python, besides you didn't say what Python method you were using anyway, but if the Python method converted a Python string into a UTF-8 sequence of bytes, then you're using the wrong conversion method here, because that has nothing to do with URL Encoding.
str.getBytes("UTF-8") will return a byte[] with the Java string encoded in UTF-8.
new String(bytes, "UTF-8") will decode the byte array.
URL Encoding is about converting text into a string that is valid as a component of a full URL, meaning that all special characters must be encoded using %NN escapes. Non-ASCII characters has to be encoded too.
As an example, take the string Test & gehört. When URL Encoded, it becomes the following string:
Test+%26+geh%C3%B6rt
The string Test & gehört becomes the following sequence of bytes (displayed in hex) when used with getBytes:
54 65 73 74 20 26 20 67 65 68 c3 b6 72 74

Japanese Character Encoding in Base64

I have been asked to fix a bug in our email processing software.
When a message whose subject is encoded in RFC 2047 like this:
=?ISO-2022-JP?B?GyRCR1s/LiVGJTklSC1qRnxLXDhsGyhC?=
is received, it is incorrectly decoded - one of the Japanese characters is not rendered properly. It is rendered like this: 配信テスト?日本語 when it should be 配信テスト㈱日本語
(I do not understand Japanese) - clearly one of the characters, the one which looks its in brackets, has not been rendered.
The decoding is carried out by javax.mail.internet.MimeUtility.decodeText()
If I try it with an on-line decoder (the only one I've found is here) it seems to work OK, so I was suspecting a bug in MimeUtility.
So I tried some experiments, in the form of this little program:
public class Encoding {
private static final Charset CHARSET = Charset.forName("ISO-2022-JP");
public static void main(String[] args) throws UnsupportedEncodingException {
String control = "繋がって";
String subject= "配信テスト㈱日本語";
String controlBase64 = japaneseToBase64(control);
System.out.println(controlBase64);
System.out.println(base64ToJapanese(controlBase64));
String subjectBase64 = japaneseToBase64(subject);
System.out.println(subjectBase64);
System.out.println(base64ToJapanese(subjectBase64));
}
private static String japaneseToBase64(String in) {
return Base64.encodeBase64String(in.getBytes(CHARSET));
}
private static String base64ToJapanese(String in) {
return new String(Base64.decodeBase64(in), CHARSET);
}
}
(The Base64 and Hex classes are in org.apache.commons.codec)
When I run it, here's the output:
GyRCN1IkLCRDJEYbKEI=
繋がって
GyRCR1s/LiVGJTklSCEpRnxLXDhsGyhC
配信テスト?日本語
The first, shorter Japanese string is a control, and this returns the same as the input, having been converted into Base64 and back again, using Charset ISO-2022-JP. All OK there.
The second Japanese string is the one with the dodgy character. As you see, it returns with a ? instead of the character. The Base64 encoding output is also different from the original subject encoding.
Sorry if this is long, I wanted to be thorough. What's going on, and how can I decode this character correctly?
The bug is not in your software, but the subject string itself is incorrectly encoded. Other software may be able to decode the text by making further assumptions about the content, just as it is often assumed that characters in the range 0x80-0x9f are Cp1252-encoded, although ISO-8859-1 or ISO-8859-15 is specified.
ISO-2022-JP is a multi-charset encoding, using escape sequences to switch between the actually used character set. Your encoded string starts with ESC $ B, indicating that the character set JIS X 0208-1983 is used. The offending character is encoded as 0x2d6a. That code point is not defined in the referred character set, but later added to JIS X 0213:2000, a newer version of the JIS X character set specifications.
Try using "MS932" or "Shift-JIS" in your encoding. Means
private static final Charset CHARSET = Charset.forName("MS932");
There are different scripts in Japanese like kanji, katakana. Some of the encoding like Cp132 will not support some characters of Japanese. The problem you face is because of the encoding "ISO-2022-JP" you have used in your code.
ISO-2022-JP uses pairs of bytes, called ku and ten, that index into a 94×94 table of characters. The pair that fails has ku 12 and ten 73, which is not listed in table of valid characters I have (based on JIS X 0208). All of ku=12 seems to be unused.
Wikipedia doesn't list any updates to JIS X 0208, either. Perhaps the sender is using some sort of vendor-defined extension?
Despite the fact that ISO-2022-JP is a variable width encoding, it seems as though Java doesn't support the section of the character set that it lies in (possibly as a result of the missing escape sequences in ISO-2022-JP-2 that are present in ISO-2022-JP-3 and ISO-2022-JP-2004 which aren't supported). UTF-8, UTF-16 and UTF-32 do however support all of the characters.
UTF-32:
AAB+SwAAMEwAADBjAAAwZg==
繋がって
AACRTQAAT+EAADDGAAAwuQAAMMgAADIxAABl5QAAZywAAIqe
配信テスト㈱日本語
As an extra tidbit, regardless of whether UTF-32 was used, when the strings were printed as-is they retained their natural encoding and appeared normally.

Decode of base64 string containing zip file gets 8 character codes wrong in result string

I'm receiving a base64-encoded zip file (in the form of a string) from a SOAP request.
I can decode the string successfully using a stand-alone program, b64dec.exe, but I need to do it in a java routine. I'm trying to decode it (theZipString) with Apache commons-codec-1.7.jar routines:
import org.apache.commons.codec.binary.Base64;
import org.apache.commons.codec.binary.StringUtils;
StringUtils.newString(Base64.decodeBase64(theZipString), "ISO-8859-1");
Zip file readers open the resulting file and show the list of content files but the content files have CRC errors.
I compared the result of my java routine with the result of the b64dec.exe program (using UltraEdit) and found that they are identical with the exception that eight different byte-values, where ever they appear in the b64dec.exe result, are replaced by 3F ("?") in mine. The values and their ISO-8859-1 character names are A4 ('currency'), A6 ('broken bar'), A8 ('diaeresis'), B4 ('acute accent'), B8 ('cedilla'), BC ('vulgar fraction 1/4'), BD ('vulgar fraction 1/2'), and BE ('vulgar fraction 3/4').
I'm guessing that the StringUtils.newString function is not translating those eight values to the string output, because I tried other 8-bit character sets: UTF-8, and cp437. Their results are similar but worse, with many more 3F, "?" substitutions.
Any suggestions? What character set should I use for the newString function to convert a .zip string? Is the Apache function incapable of this translation? Is there a better way to do this decode?
Thanks!
A zip file is not a string. It's not encoded text. It may contain text files, but that's not the same thing. It's just binary data.
If you treat arbitrary binary data as a string, bad things will happen. Instead, you should use streams or byte arrays. So this is fine:
byte[] zipData = Base64.decodeBase64(theZipString);
... but don't try to convert that to a string. If you write out that byte[] to a file (probably with FileOutputStream or some utility method) it should be fine.

why is String.split("£", 2) not working?

I have a text file with 1000 lines in the following format:
19 x 75 Bullnose Architrave/Skirting £1.02
I am writing a method that reads the file line by line in - This works OK.
I then want to split each string using the "£" as a deliminater & write it out to
an ArrayList<String> in the following format:
19 x 75 Bullnose Architrave/Skirting, Metre, 1.02
This is how I have approached it (productList is the ArrayList, declared/instantiated outside the try block):
try{
br = new BufferedReader(new FileReader(aFile));
String inputLine = br.readLine();
String delim = "£";
while (inputLine != null){
String[]halved = inputLine.split(delim, 2);
String lineOut = halved[0] + ", Metre, " + halved[1];//Array out of bounds
productList.add(lineOut);
inputLine = br.readLine();
}
}
The String is not splitting and I keep getting an ArrayIndexOutOfBoundsException. I'm not very familiar with regex. I've also tried using the old StringTokenizer but get the same result.
Is there an issue with £ as a delim or is it something else? I did wonder if it is something to do with the second token not being read as a String?
Any ideas would be helpful.
Here are some of the possible causes:
The encoding of the file doesn't match the encoding that you are using to read it, and the "pound" character in the file is getting "mangled" into something else.
The file and your source code are using different pound-like characters. For instance, Unicode has two code points that look like a "pound sign" - the Pound Sterling character (00A3) and the Lira character (2084) ... then there is the Roman semuncia character (10192).
You are trying to compile a UTF-8 encoded source file without tell the compiler that it is UTF-8 encoded.
Judging from your comments, this is an encoding mismatch problem; i.e. the "default" encoding being used by Java doesn't match the actual encoding of the file. There are two ways to address this:
Change the encoding of the file to match Java's default encoding. You seem to have tried that and failed. (And it wouldn't be the way I'd do this ...)
Change the program to open the file with a specific (non default) encoding; e.g. change
new FileReader(aFile)
to
new FileReader(aFile, encoding)
where encoding is the name of the file's actual character encoding. The names of the encodings understood by Java are listed here, but my guess is that it is "ISO-8859-1" (aka Latin-1).
This is probably a case of encoding mismatch. To check for this,
Print delim.length and make sure it is 1.
Print inputLine.length and make sure it is the right value (42).
If one of them is not the expected value then you have to make sure you are using UTF-8 everywhere.
You say delim.length is 1, so this is good. On the other hand if inputLine.length is 34, this is very wrong. For "19 x 75 Bullnose Architrave/Skirting £1.02" you should get 42 if all was as expected. If your file was UTF-8 encoded but read as ISO-8859-1 or similar you would have gotten 43.
Now I am a little at a loss. To debug this you could print individually each character of the string and check what is wrong with them.
for (int i = 0; i < inputLine.length; i++)
System.err.println("debug: " + i + ": " + inputLine.charAt(i) + " (" + inputLine.codePointAt(i) + ")");
Many thanks for all your replies.
Specifying the encoding within the read & saving the original text file as UTF -8 has worked.
However, the experience has taught me that delimiting text using "£" or indeed other characters that may have multiple representations in different encodings is a poor strategy.
I have decided to take a different approach:
1) Find the last space in the input string & replace it with "xxx" or similar.
2) Split this using the delimiter "xxx." which should split the strings & rip out the "£".
3) Carry on..

Categories