I need a function on Java that do the same as ASC function on Visual Basic. I've had looking for it on internet, but I can't found the solution.
The String that I have to know the codes was created on Visual Basic. It's according to ISO 8859-1 and Microsoft Windows Latin-1 characters. The ASC function on Visual Basic knows those codes, but in Java, I can't find a function that does the same thing.
I know in Java this sentence:
String myString = "ÅÛ–ßÕÅÝ•ÞÃ";
int first = (int)string.chartAt(0); // "Å"- VB and Java returns: 197
int second = (int)string.chartAt(0); // "Û" - VB and Java returns: 219
int third = (int)string.chartAt(0); // "–" - VB returns: 150 and Java returns: 8211
The first two characters, I haven't had problem, but the third character is not a ASCII code.
How can I get same codes in VB and Java?
First of all, note that ISO 8859-1 != Windows Latin-1. (See http://en.wikipedia.org/wiki/Windows-1252)
The problem is that Java encodes characters as UTF16, so casting to int will generally result in the Unicode value of the char.
To get the Latin-1 encoding of a char, first convert it to a Latin-1 encoded byte array:
public class Encoding {
public static void main(String[] args) {
// Cp1252 is Windows codepage 1252
byte[] bytes = "ÅÛ–ßÕÅÝ•ÞÃ".getBytes(Charset.forName("Cp1252"));
for (byte b: bytes) {
System.out.println(b & 255);
}
}
}
prints:
197
219
150
223
213
197
221
149
222
195
Related
According to the Java documentation for String.length:
public int length()
Returns the length of this string.
The length is equal to the number of Unicode code units in the string.
Specified by:
length in interface CharSequence
Returns:
the length of the sequence
of characters represented by this object.
But then I don't understand why the following program, HelloUnicode.java, produces different results on different platforms. According to my understanding, the number of Unicode code units should be the same, since Java supposedly always represents strings in UTF-16:
public class HelloWorld {
public static void main(String[] args) {
String myString = "I have a 🙂 in my string";
System.out.println("String: " + myString);
System.out.println("Bytes: " + bytesToHex(myString.getBytes()));
System.out.println("String Length: " + myString.length());
System.out.println("Byte Length: " + myString.getBytes().length);
System.out.println("Substring 9 - 13: " + myString.substring(9, 13));
System.out.println("Substring Bytes: " + bytesToHex(myString.substring(9, 13).getBytes()));
}
// Code from https://stackoverflow.com/a/9855338/4019986
private final static char[] hexArray = "0123456789ABCDEF".toCharArray();
public static String bytesToHex(byte[] bytes) {
char[] hexChars = new char[bytes.length * 2];
for ( int j = 0; j < bytes.length; j++ ) {
int v = bytes[j] & 0xFF;
hexChars[j * 2] = hexArray[v >>> 4];
hexChars[j * 2 + 1] = hexArray[v & 0x0F];
}
return new String(hexChars);
}
}
The output of this program on my Windows box is:
String: I have a 🙂 in my string
Bytes: 492068617665206120F09F998220696E206D7920737472696E67
String Length: 26
Byte Length: 26
Substring 9 - 13: 🙂
Substring Bytes: F09F9982
The output on my CentOS 7 machine is:
String: I have a 🙂 in my string
Bytes: 492068617665206120F09F998220696E206D7920737472696E67
String Length: 24
Byte Length: 26
Substring 9 - 13: 🙂 i
Substring Bytes: F09F99822069
I ran both with Java 1.8. Same byte length, different String length. Why?
UPDATE
By replacing the "🙂" in the string with "\uD83D\uDE42", I get the following results:
Windows:
String: I have a ? in my string
Bytes: 4920686176652061203F20696E206D7920737472696E67
String Length: 24
Byte Length: 23
Substring 9 - 13: ? i
Substring Bytes: 3F2069
CentOS:
String: I have a 🙂 in my string
Bytes: 492068617665206120F09F998220696E206D7920737472696E67
String Length: 24
Byte Length: 26
Substring 9 - 13: 🙂 i
Substring Bytes: F09F99822069
Why "\uD83D\uDE42" ends up being encoded as 0x3F on the Windows machine is beyond me...
Java Versions:
Windows:
java version "1.8.0_211"
Java(TM) SE Runtime Environment (build 1.8.0_211-b12)
Java HotSpot(TM) 64-Bit Server VM (build 25.211-b12, mixed mode)
CentOS:
openjdk version "1.8.0_201"
OpenJDK Runtime Environment (build 1.8.0_201-b09)
OpenJDK 64-Bit Server VM (build 25.201-b09, mixed mode)
Update 2
Using .getBytes("utf-8"), with the "🙂" embedded in the string literal, here are the outputs.
Windows:
String: I have a 🙂 in my string
Bytes: 492068617665206120C3B0C5B8E284A2E2809A20696E206D7920737472696E67
String Length: 26
Byte Length: 32
Substring 9 - 13: 🙂
Substring Bytes: C3B0C5B8E284A2E2809A
CentOS:
String: I have a 🙂 in my string
Bytes: 492068617665206120F09F998220696E206D7920737472696E67
String Length: 24
Byte Length: 26
Substring 9 - 13: 🙂 i
Substring Bytes: F09F99822069
So yes it appears to be a difference in system encoding. But then that means string literals are encoded differently on different platforms? That sounds like it could be problematic in certain situations.
Also... where is the byte sequence C3B0C5B8E284A2E2809A coming from to represent the smiley in Windows? That doesn't make sense to me.
For completeness, using .getBytes("utf-16"), with the "🙂" embedded in the string literal, here are the outputs.
Windows:
String: I have a 🙂 in my string
Bytes: FEFF00490020006800610076006500200061002000F001782122201A00200069006E0020006D007900200073007400720069006E0067
String Length: 26
Byte Length: 54
Substring 9 - 13: 🙂
Substring Bytes: FEFF00F001782122201A
CentOS:
String: I have a 🙂 in my string
Bytes: FEFF004900200068006100760065002000610020D83DDE4200200069006E0020006D007900200073007400720069006E0067
String Length: 24
Byte Length: 50
Substring 9 - 13: 🙂 i
Substring Bytes: FEFFD83DDE4200200069
You have to be careful about specifying the encodings:
when you compile the Java file, it uses some encoding for the source file. My guess is that this already broke your original String literal on compilation. This can be fixed by using the escape sequence.
after you use the escape sequence, the String.length are the same. The bytes inside the String are also the same, but what you are printing out does not show that.
the bytes printed are different because you called getBytes() and that again uses the environment or platform-specific encoding. So it was also broken (replacing unencodable smilies with question mark). You need to call getBytes("UTF-8") to be platform-independent.
So to answer the specific questions posed:
Same byte length, different String length. Why?
Because the string literal is being encoded by the java compiler, and the java compiler often uses a different encoding on different systems by default. This may result in a different number of character units per Unicode character, which results in a different string length. Passing the -encoding command line option with the same option across platforms will make them encode consistently.
Why "\uD83D\uDE42" ends up being encoded as 0x3F on the Windows machine is beyond me...
It's not encoded as 0x3F in the string. 0x3f is the question mark. Java puts this in when it is asked to output invalid characters via System.out.println or getBytes, which was the case when you encoded literal UTF-16 representations in a string with a different encoding and then tried to print it to the console and getBytes from it.
But then that means string literals are encoded differently on different platforms?
By default, yes.
Also... where is the byte sequence C3B0C5B8E284A2E2809A coming from to represent the smiley in Windows?
This is quite convoluted. The "🙂" character (Unicode code point U+1F642) is stored in the Java source file with UTF-8 encoding using the byte sequence F0 9F 99 82. The Java compiler then reads the source file using the platform default encoding, Cp1252 (Windows-1252), so it treats these UTF-8 bytes as though they were Cp1252 characters, making a 4-character string by translating each byte from Cp1252 to Unicode, resulting in U+00F0 U+0178 U+2122 U+201A. The getBytes("utf-8") call then converts this 4-character string into bytes by encoding them as utf-8. Since every character of the string is higher than hex 7F, each character is converted into 2 or more UTF-8 bytes; hence the resulting string being this long. The value of this string is not significant; it's just the result of using an incorrect encoding.
You didn't take into account, that getBytes() returns the bytes in the platform's default encoding. This is different on windows and centOS.
See also How to Find the Default Charset/Encoding in Java? and the API documentation on String.getBytes().
I was Python programmer(Of course I am now, too), so I am familiar with Python encoding and decoding.
I was surprised at the fact that Java can encode String variables twice consecutively.
This is example code:
import java.net.URLEncoder;
public class OpenAPITest {
public static void main(String[] arg) throws Exception {
String str = "안녕"; // Korean
String utfStr = URLEncoder.encode(str, "UTF-8");
System.out.println(utfStr);
String ms949Str = URLEncoder.encode(utfStr, "MS949");
System.out.println(ms949Str);
}
}
I wonder how it can encode string twice times.
In Python, version 3.x, once you encode type 'str' which consists of unicode string, then it converted to type 'byte' which consists of byte string. type 'byte' has only decode() function.
Additionally, I want to get same String values in Python3 as the result value of ms949Str in my example code. Give me some advice, please. Thanks.
Don't know Python, besides you didn't say what Python method you were using anyway, but if the Python method converted a Python string into a UTF-8 sequence of bytes, then you're using the wrong conversion method here, because that has nothing to do with URL Encoding.
str.getBytes("UTF-8") will return a byte[] with the Java string encoded in UTF-8.
new String(bytes, "UTF-8") will decode the byte array.
URL Encoding is about converting text into a string that is valid as a component of a full URL, meaning that all special characters must be encoded using %NN escapes. Non-ASCII characters has to be encoded too.
As an example, take the string Test & gehört. When URL Encoded, it becomes the following string:
Test+%26+geh%C3%B6rt
The string Test & gehört becomes the following sequence of bytes (displayed in hex) when used with getBytes:
54 65 73 74 20 26 20 67 65 68 c3 b6 72 74
I understand that the internal representation of Java for String is UTF-16. What is java string representation?
Also, I know that in a UTF-16 String, each 'character' is encoded with one or two 16-bit code units.
However, when I debug the following java code
String hello = "Hello";
the variable hello is an array of 5 bytes 0x48, 0x101, 0x108, 0x108, 0x111
which is ASCII for "Hello".
How can this be?
I took a gcore dump of a mini java process with this code:
class Hi {
public static void main(String args[]) {
String hello = "Hello";
try {
Thread.sleep(60_000);
} catch (InterruptedException e) {
e.printStackTrace();
}
}
}
And did a gcore memory dump on Ubuntu. (usign jps to get the pid and passed that to gcore)
If found this: 48 65 6C 6C 6F in the dump using a Hexeditor, so it is somewhere in the memory as ASCII.
But also 48 00 65 00 6C 00 6C which is part of the UTF-16 representation of the String
String internal representation is not specified, it's the implementation detail, so you cannot rely on it. It's very likely that in JDK-9 it will be changed to use double encoding (Latin-1 for strings which can be encoded in Latin-1, UTF-16 for other strings). See JEP-254 for details. This feature is already integrated in OpenJDK master codebase, so if you are using Java-9 early access builds, you will have actually 5 bytes.
I'm trying to encode some Chinese characters using the GB18030 cp in Java, and I ran into this character 数, which translates to "Number" in Google Translate.
The issue is, it's turning into 10 bytes (!) when encoded:
81 30 81 34 81 30 83 31 ca fd
import java.math.BigInteger;
import java.nio.charset.Charset;
public class Test3
{
public static void main(String[] args)
{
String s = new String("数");
System.out.println( "source file: "+String.format("%x ",
new BigInteger(1, s.getBytes(Charset.forName("GB18030"))) ));
}
}
When I try to decode that using the GB18030, it results in ? characters appearing beside the Chinese Number character (??数). When I try to decode only "CA FD", the last two bytes from above, it correctly decodes to the character.
Google translate notes the above character is Simplified. My source file is also saved in UTF8.
I thought GB18030 has a max of 4 bytes per character? Is there any particular reason this character behaves so strangely? (I'm not Chinese, BTW)
The most likely things are either:
There's an issue with the encoding of your source file, or
You have "invisible" characters prior to the 数 in it.
You can check both of those by completely deleting the string literal on this line:
String s = new String("数");
so it looks like this (note I removed the quotes as well as the character):
String s = new String();
and then adding back "\u6570" to get this:
String s = new String("\u6570");
and seeing if your output changes (as 数 is Unicode code point U+6570 and so that escape sequence should be the same character). If it changes, either there's an encoding problem or you had invisible characters in the string prior to the character. You can probably differentiate the two cases by then adding back just that character (via copy and paste from this page rather than your previous source code). If the problem reappears, it's an encoding issue. If not, you had hidden characters.
We have a process which communicates with an external via MQ. The external system runs on a mainframe maching (IBM z/OS), while we run our process on a CentOS Linux platform. So far we never had any issues.
Recently we started receiving messages from them with non-printable EBCDIC characters embedded in the message. They use the characters as a compressed ID, 8 bytes long. When we receive it, it arrives on our queue encoded in UTF (CCSID 1208).
They need to original 8 bytes back in order to identify our response messages. I'm trying to find a solution in Java to convert the ID back from UTF to EBCDIC before sending the response.
I've been playing around with the JTOpen library, using the AS400Text class to do the conversion. Also, the counterparty has sent us a snapshot of the ID in bytes. However, when I compare the bytes after conversion, they are different from the original message.
Has anyone ever encountered this issue? Maybe I'm using the wrong code page?
Thanks for any input you may have.
Bytes from counterparty(Positions [5,14]):
00000 F0 40 D9 F0 F3 F0 CB 56--EF 80 04 C9 10 2E C4 D4 |0 R030.....I..DM|
Program output:
UTF String: [R030ôîÕ؜IDMDHP1027W 0510]
EBCDIC String: [R030ôîÃÃÂIDMDHP1027W 0510]
NATIVE CHARSET - HEX: [52303330C3B4C3AEC395C398C29C491006444D44485031303237572030353130]
CP500 CHARSET - HEX: [D9F0F3F066BE66AF663F663F623FC9102EC4D4C4C8D7F1F0F2F7E640F0F5F1F0]
Here is some sample code:
private void readAndPrint(MQMessage mqMessage) throws IOException {
mqMessage.seek(150);
byte[] subStringBytes = new byte[32];
mqMessage.readFully(subStringBytes);
String msgId = toHexString(mqMessage.messageId).toUpperCase();
System.out.println("----------------------------------------------------------------");
System.out.println("MESSAGE_ID: " + msgId);
String hexString = toHexString(subStringBytes).toUpperCase();
String subStr = new String(subStringBytes);
System.out.println("NATIVE CHARSET - HEX: [" + hexString + "] [" + subStr + "]");
// Transform to EBCDIC
int codePageNumber = 37;
String codePage = "CP037";
AS400Text converter = new AS400Text(subStr.length(), codePageNumber);
byte[] bytesData = converter.toBytes(subStr);
String resultedEbcdicText = new String(bytesData, codePage);
String hexStringEbcdic = toHexString(bytesData).toUpperCase();
System.out.println("CP500 CHARSET - HEX: [" + hexStringEbcdic + "] [" + resultedEbcdicText + "]");
System.out.println("----------------------------------------------------------------");
}
If a MQ message has varying sub-message fields that require different encodings, then that's how you should handle those messages, i.e., as separate message pieces.
But as you describe this, the entire message needs to be received without conversion. The first eight bytes need to be extracted and held separately. The remainder of the message can then have its encoding converted (unless other sub-fields also need to be extracted as binary, unconverted bytes).
For any return message, the opposite conversion must be done. The text portion of the message can be converted, and then that sub-string can have the original eight bytes prepended to it. The newly reconstructed message then can be sent back through the queue, again without automatic conversion.
Your partner on the other end is not using the messaging product correctly. (Of course, you probably shouldn't say that out loud.) There should be no part of such a message that cannot automatically survive intact across both directions. Instead of an 8-byte binary field, it should be represented as something more like a 16-byte hex representation of the 8-byte value for one example method. In hex, there'd be no conversion problem either way across the route.
It seems to me that the special 8 bytes are not actually EBCDIC character but simply 8 bytes of data. If it is in such case then I believe, as mentioned by another answer, that you should handle that 8 bytes separately without allowing it convert to UTF8 and then back to EBCDIC for further processing.
Depending on the EBCDIC variant you are using, it is quite possible that a byte in EBCDIC is not converting to a meaningful UTF-8 character, and hence, you will fail to get the original byte by converting the UTF8 character to EBCDIC you received.
A brief search on Google give me several EBCDIC tables (e.g. http://www.simotime.com/asc2ebc1.htm#AscEbcTables). You can see there are lots of values in EBCDIC that do not have character assigned. Hence, when they are converted to UTF8, you may not assume each of them will convert to a distinct character in Unicode. Therefore your proposed way of processing is going to be very dangerous and error-prone.