Java String internal representation - java

I understand that the internal representation of Java for String is UTF-16. What is java string representation?
Also, I know that in a UTF-16 String, each 'character' is encoded with one or two 16-bit code units.
However, when I debug the following java code
String hello = "Hello";
the variable hello is an array of 5 bytes 0x48, 0x101, 0x108, 0x108, 0x111
which is ASCII for "Hello".
How can this be?

I took a gcore dump of a mini java process with this code:
class Hi {
public static void main(String args[]) {
String hello = "Hello";
try {
Thread.sleep(60_000);
} catch (InterruptedException e) {
e.printStackTrace();
}
}
}
And did a gcore memory dump on Ubuntu. (usign jps to get the pid and passed that to gcore)
If found this: 48 65 6C 6C 6F in the dump using a Hexeditor, so it is somewhere in the memory as ASCII.
But also 48 00 65 00 6C 00 6C which is part of the UTF-16 representation of the String

String internal representation is not specified, it's the implementation detail, so you cannot rely on it. It's very likely that in JDK-9 it will be changed to use double encoding (Latin-1 for strings which can be encoded in Latin-1, UTF-16 for other strings). See JEP-254 for details. This feature is already integrated in OpenJDK master codebase, so if you are using Java-9 early access builds, you will have actually 5 bytes.

Related

Print albanian characters in cmd using Java program

I have a string "Përshëndetje botë!" in a .java file and I am trying to print it using System.out.println(). The file is in ISO-8859-1 encoding. In cmd I do
chcp 28591
to change the encoding to ISO-8859-1 (per the list).
Then I compile a .java file using
javac -encoding ISO-8859-1 C:\...\Hello.java
and run it using
java -Dfile.encoding=ISO-8859-1 packagename.Hello
In this case the ë are replaced with spaces. I also tried
java -Dfile.encoding=ISO88591 packagename.Hello
and the ë were replaced with wrong foreign symbols.
How would I get it running?
Actual answer
Per the OP's comment, the actual issue was that the font cmd was using didn't have the relevant symbols.
Original post
I'm posting this as an answer because what I want to say is too long for comments :) .
First, please edit your question to include a minimal example of the printing code. For example, if you could write a separate Java program that did nothing but print the message, that would be much easier to debug. (Maybe packagename.Hello is such an example, but I can't tell.)
Second, please try the below, and edit your question to include the results of each step.
Check the actual bytes in your source file to confirm its encoding, then edit your question to include that information. You can use, e.g., the FileFormat.info hex dumper (I am not affiliated with it). For example, here is the output for your string, pasted into a UTF-8 text file:
file name: foo.txt
mime type:
0000-0010: 50 c3 ab 72-73 68 c3 ab-6e 64 65 74-6a 65 20 62 P..rsh.. ndetje.b
0000-0017: 6f 74 c3 ab-21 0d 0a ot..!..
^^ ^^ ^^
Note, at the ^^ markers, that ë in UTF-8 is 0xc3 0xab.
By contrast, in ISO 8859-1 (aka "latin1" in vim), the same text is:
file name: foo.txt
mime type:
0000-0010: 50 eb 72 73-68 eb 6e 64-65 74 6a 65-20 62 6f 74 P.rsh.nd etje.bot
0000-0014: eb 21 0d 0a .!..
^^ ^
Note that ë is now 0xeb.
Try running your command as java packagename.Hello, without any -D option. In my answer that you read, the -D option to java was not necessary.
Try code page 1250, as in the earlier question.

How is it possible to encode String twice?

I was Python programmer(Of course I am now, too), so I am familiar with Python encoding and decoding.
I was surprised at the fact that Java can encode String variables twice consecutively.
This is example code:
import java.net.URLEncoder;
public class OpenAPITest {
public static void main(String[] arg) throws Exception {
String str = "안녕"; // Korean
String utfStr = URLEncoder.encode(str, "UTF-8");
System.out.println(utfStr);
String ms949Str = URLEncoder.encode(utfStr, "MS949");
System.out.println(ms949Str);
}
}
I wonder how it can encode string twice times.
In Python, version 3.x, once you encode type 'str' which consists of unicode string, then it converted to type 'byte' which consists of byte string. type 'byte' has only decode() function.
Additionally, I want to get same String values in Python3 as the result value of ms949Str in my example code. Give me some advice, please. Thanks.
Don't know Python, besides you didn't say what Python method you were using anyway, but if the Python method converted a Python string into a UTF-8 sequence of bytes, then you're using the wrong conversion method here, because that has nothing to do with URL Encoding.
str.getBytes("UTF-8") will return a byte[] with the Java string encoded in UTF-8.
new String(bytes, "UTF-8") will decode the byte array.
URL Encoding is about converting text into a string that is valid as a component of a full URL, meaning that all special characters must be encoded using %NN escapes. Non-ASCII characters has to be encoded too.
As an example, take the string Test & gehört. When URL Encoded, it becomes the following string:
Test+%26+geh%C3%B6rt
The string Test & gehört becomes the following sequence of bytes (displayed in hex) when used with getBytes:
54 65 73 74 20 26 20 67 65 68 c3 b6 72 74

Write hex numbers stored as strings to hex file directly in Java?

I'm writing a small assembler in Java that converts ARM assembly code to hex file that will be executed by a virtual ARM CPU on FPGA.
For example
Sub R0, r15, R15
Add R2, R0, #5
aDD R3, R0, #12
SUB R7, r3, #9
will be translated to machine(hex) code as
E04F000F
E2802005
E280300C
E2437009
which are stored as strings line by line in a string array output in my code.
So how can I write the machine code to hex file which is exactly what it literally is? Instead of being encoded as text.
So when I open the output file with tools like Hexeditor it will show the encoding exactly as (without newline sign of course...)
E0 4F 00 0F E2 80 20 05 E2 80 30 0C E2 43 70 09
Currently I tried:
OutputStream os = new FileOutputStream("hexcode.hex");
for (String s : output) {
int value = Integer.decode(s);
os.write(value);
}
os.close();
but it gives me errors
Exception in thread "main" java.lang.NumberFormatException: For input string:"E04F000F"
at java.lang.NumberFormatException.forInputString(NumberFormatException.java:65)
at java.lang.Integer.parseInt(Integer.java:580)
at java.lang.Integer.valueOf(Integer.java:740)
at java.lang.Integer.decode(Integer.java:1197)
at assembler.Assembler.main(Assembler.java:220)
Any help? Million thanks!
use Integer.parseInt(s, 16) instead of Integer.decode(string)
"16" tells it it is a hex-string

ASC Visual Basic for Java

I need a function on Java that do the same as ASC function on Visual Basic. I've had looking for it on internet, but I can't found the solution.
The String that I have to know the codes was created on Visual Basic. It's according to ISO 8859-1 and Microsoft Windows Latin-1 characters. The ASC function on Visual Basic knows those codes, but in Java, I can't find a function that does the same thing.
I know in Java this sentence:
String myString = "ÅÛ–ßÕÅÝ•ÞÃ";
int first = (int)string.chartAt(0); // "Å"- VB and Java returns: 197
int second = (int)string.chartAt(0); // "Û" - VB and Java returns: 219
int third = (int)string.chartAt(0); // "–" - VB returns: 150 and Java returns: 8211
The first two characters, I haven't had problem, but the third character is not a ASCII code.
How can I get same codes in VB and Java?
First of all, note that ISO 8859-1 != Windows Latin-1. (See http://en.wikipedia.org/wiki/Windows-1252)
The problem is that Java encodes characters as UTF16, so casting to int will generally result in the Unicode value of the char.
To get the Latin-1 encoding of a char, first convert it to a Latin-1 encoded byte array:
public class Encoding {
public static void main(String[] args) {
// Cp1252 is Windows codepage 1252
byte[] bytes = "ÅÛ–ßÕÅÝ•ÞÃ".getBytes(Charset.forName("Cp1252"));
for (byte b: bytes) {
System.out.println(b & 255);
}
}
}
prints:
197
219
150
223
213
197
221
149
222
195

How to determine if a InputStream contains JSON data?

How do I check if the data behind a java.io.InputStream (from File, URL, ..) is of type JSON?
Of course to be complete the best would be to load the whole data of the stream and try to validate it as JSON (e.g checking for closing bracket }). Since the stream source might be very big (a GeoJSON file with a size of 500MB) this eventually end in a burning machine.
To avoid this I wrote a small method that only takes the first character of the InputStream as UTF-8/16/32 and tests if it is a { according to RFC 4627 (which is referenced and updated by RFC 7159) to determine its JSONness:
JSON text SHALL be encoded in Unicode. The default encoding is UTF-8.
And:
Since the first two characters of a JSON text will always be ASCII
characters [RFC0020], it is possible to determine whether an octet
stream is UTF-8, UTF-16 (BE or LE), or UTF-32 (BE or LE) by looking
at the pattern of nulls in the first four octets.
00 00 00 xx UTF-32BE
00 xx 00 xx UTF-16BE
xx 00 00 00 UTF-32LE
xx 00 xx 00 UTF-16LE
xx xx xx xx UTF-8
The method is:
public static boolean mightBeJSON(InputStream stream) {
try {
byte[] bytes = new byte[1];
stream.read(bytes);
if (bytes[0] == 0x7B) {
return true;
}
stream.read(bytes);
if (bytes[0] == 0x7B) {
return true;
}
stream.read(bytes);
stream.read(bytes);
if (bytes[0] == 0x7B) {
return true;
}
} catch (IOException e) {
// Nothing to do;
}
return false;
}
Until now my machine is still not burning, BUT:
Is there anything wrong with this approach/implementation?
May there be any problems in some situations?
Anything to improve?
RFC 7159 states:
8. String and Character Issues
8.1 Character Encoding
JSON text SHALL be encoded in UTF-8, UTF-16, or UTF-32. The
default encoding is UTF-8, and JSON texts that are encoded in UTF-8
are interoperable in the sense that they will be read successfully by
the maximum number of implementations; there are many implementations
that cannot successfully read texts in other encodings (such as UTF-16
and UTF-32).
Implementations MUST NOT add a byte order mark to the beginning of
a JSON text. In the interests of interoperability, implementations
that parse JSON texts MAY ignore the presence of a byte order mark
rather than treating it as an error.
This doesn't answer your question per say, but I hope it can help in your logic.

Categories