I have a ""binary"" string String s1 = "10011000" and want to print the corresponding Character (Ф) of this Byte, how can I make this?
I have read and tested so many solutions and tutorials...and can't find exactly what I want! Moreover, I think therected is an encoding problem.
For example, this code doesn't work, but why (I have "?" in output, so encoding problem?)?
int j = Integer.parseInt("10011000", 2);
System.out.println(new Character ((char)j));
10011000 is unicode code point 152 which is an extended unicode character which will only appear if its encoding is supported by your console
The character Ф is a Cyrillic capital letter; in Unicode, the hexadecimal value is \u0424. The binary string you are trying to parse is 152 decimal. The binary string for \u0424 is 010000100100 (1060 decimal) and so I would fix that first. And as others noted, until your environment character set supports Unicode output, Java will substitute a "?" character for any character that the current character set doesn't support. See Unicode characters in Eclipse for setting up Eclipse console to Unicode.
You have used wrong code. If you want to see in output Ф you need to change your code into this:
int j = Integer.parseInt("10000100100", 2);
System.out.println((char) j);
Related
While programming I encountered a weird behavior of Strings that are converted to bytes and then back to Strings again. Some chars are converted wrongly, and therefore the hashCode of the String is also changed. The length of the Strings remain the same.
The problem seems to occur with chars from 55296 - 57343 (U+D800 to U+DFFF). Other chars work fine. Is it because they are surrogates?
String string = new String(new char[] { 56000 });
System.out.println((int)string.charAt(0));
System.out.println((int)new String(string.getBytes(StandardCharsets.UTF_8), StandardCharsets.UTF_8).charAt(0));
The console output is:
56000
63
What is going on here? Is this a java bug, or am I misunderstanding something?
That's because these values are not characters but surrogates. Two of these values form a surrogate pair that in turn represents one character. If you have just one low or high surrogate value this is an invalid encoding and not a character.
Since this is an invalid encoding, it is replaced by a "?" character when you convert it to UTF-8.
You can read more about it for example here https://en.wikipedia.org/wiki/UTF-16
I want to convert decimal to ascii and this is the code returns the unexpected results. Here is the code I am using.
public static void main(String[] args) {
char ret= (char)146;
System.out.println(ret);// returns nothing.
I expect to get character single "'" as per http://www.ascii-code.com/
Anyone came across this? Thanks.
So, a couple of things.
First of all the page you linked to says this about the code point range in question:
The extended ASCII codes (character code 128-255)
There are several different variations of the 8-bit ASCII table. The table below is according to ISO 8859-1, also called ISO Latin-1. Codes 128-159 contain the Microsoft® Windows Latin-1 extended characters.
This is incorrect, or at least, to me, misleadingly worded. ISO 8859-1 / Latin-1 does not define code point 146 (and another reference just because). So that's already asking for trouble. You can see this also if you do the conversion through String:
String s = new String(new byte[] {(byte)146}, "iso-8859-1");
System.out.println(s);
Outputs the same "unexpected" result. It appears that what they are actually referring to is the Windows-1252 set (aka "Windows Latin-1", but this name is almost completely obsolete these days), which does define that code point as a right single quote (for other charsets that provide this character at 146 see this list and look for encodings that provide it at 0x92), and we can verify this as such:
String s = new String(new byte[] {(byte)146}, "windows-1252");
System.out.println(s);
So the first mistake is that page is confusing.
But the big mistake is you can't do what you're trying to do in the way you are doing it. A char in Java is a UTF-16 code point (or half of one, if you're representing the supplementary characters > 0xFFFF, a single char corresponds to a BMP point, a pair of them or an int corresponds to the full range, including the supplementary ones).
Unfortunately, Java doesn't really expose a lot of API for single-character conversions. Even Character doesn't have any readily available ways to convert from the charset of your choice to UTF-16.
So one option is to do it via String as hinted at in the examples above, e.g. express your code points as a raw byte[] array and convert from there:
String s = new String(new byte[] {(byte)146}, "windows-1252");
System.out.println(s);
char c = s.charAt(0);
System.out.println(c);
You could grab the char again via s.charAt(0). Note that you have to be mindful of your character set when doing this. Here we know that our byte sequence is valid for the specified encoding, and we know that the result is only one char long, so we can do this.
However, you have to watch out for things in the general case. For example, perhaps your byte sequence and character set yield a result that is in the UTF-16 supplementary character range. In that case s.charAt(0) would not be sufficient and s.codePointAt(0) stored in an int would be required instead.
As an alternative, with the same caveats, you could use Charset to decode, although it's just as clunky, e.g.:
Charset cs = Charset.forName("windows-1252");
CharBuffer cb = cs.decode(ByteBuffer.wrap(new byte[] {(byte)146}));
char c = cb.get(0);
System.out.println(c);
Note that I am not entirely sure how Charset#decode handles supplementary characters and can't really test right now (but anybody, feel free to chime in).
As an aside: In your case, 146 (0x92) cast directly to char corresponds to the UTF-16 character "PRIVATE USE TWO" (see also), and all bets are off for what you'll end up displaying there. This character is classified by Unicode as a control character, and seems to fall in the range of characters reserved for ANSI terminal control (although AFAIK isn't actually used, but it's in that range regardless). I wouldn't be surprised if perhaps browsers in some locales rendered it as a right-single-quote for compatibility, but terminals did something weird with it.
Also, fyi, the official UTF-16 code point for right single quote is 0x2019. You could reliably store that in a char by using that value, e.g.:
System.out.println((char)0x2019);
You can also see this for yourself by looking at the value after the conversion from windows-1252:
String s = new String(new byte[] {(byte)146}, "windows-1252");
char c = s.charAt(0);
System.out.printf("0x%x\n", (int)c); // outputs 0x2019
Or, for completeness:
String s = new String(new byte[] {(byte)146}, "windows-1252");
int cp = s.codePointAt(0);
System.out.printf("0x%x\n", cp); // outputs 0x2019
The page you refer mention that values 160 to 255 correspond to the ISO-8859-1 (aka Latin 1) table; as for values in the range 128 to 159, they are from the Windows specific variant of the Latin 1 (ISO-8859-1 leave that range undefined, to be assigned by operating system).
Java characters are based on UTF16, which is itself based on the Unicode table. If you want to specifically refer to the right quote character, it is you can specify it as '\u2019' in Java (see http://www.fileformat.info/info/unicode/char/2019/index.htm).
So I'm making a basic GUI with the NetBeans IDE (in Java), and I want to make a button with a √ sign in it. It didn't let me copypaste it in, so I tried using its ASCII code - char sqrt = (char) 251. Instead of the square root sign, however, it gave me "û", and I have no idea why. Can someone please explain why this is happening, as well as offer an idea as to how I should go about this?
Java characters are Unicode, not ASCII. Unicode codepoint 251 (U+00FB) is "Latin Small Letter U with Circumflex". To make input of various Unicode characters using a character set with only the basic ASCII symbols, Java provides a way to input Unicode characters using a literal format. So, you can do this:
char sqrt = '\u221a';
since U+221A is the Unicode codepoint for the square root symbol.
This \uXXXX format can also be used in String literals:
String s = "The square root of 2 (\u221a2) is approximately 1.4142";
If you print that String, you will see
The square root of 2 (√2) is 1.4142
Java uses Unicode, and the Unicode value for '√' is 8730. So, this should do it:
char sqrt = 8730;
I have a Java String like this: "peque\u00f1o". Note that it has an embedded Unicode character: '\u00f1'.
Is there a method in Java that will replace these Unicode character sequences with the actual characters? That is, a method that would return "pequeño" if you gave it "peque\u00f1o" as input?
Note that I have a string that has 12 chars (those that we see, that happen to be in the ASCII range).
Actually the string is "pequeño".
String s = "peque\u00f1o";
System.out.println(s.length());
System.out.println(s);
yields
7
pequeño
i.e. seven chars and the correct representation on System.out.
I remember giving the same response last week, use org.apache.commons.lang.StringEscapeUtils.
If you have the appropriate fonts, a println or setting the string in a JLabel or JTextArea should do the trick. The escaping is only for the compiler.
If you plan to copy-paste the readable strings in source, remember to also choose a suitable file encoding like UTF8.
I am trying to parse some word documents in java. Some of the values are things like a date range and instead of showing up like Startdate - endDate I am getting some funky characters like so
StartDate ΓÇô EndDate
This is where word puts in a special character hypen. Can you search for these characters and replace them with a regular - or something int he string so that I can then tokenize on a "-" and what is that character - ascii? unicode or what?
Edited to add some code:
String projDateString = "08/2010 ΓÇô Present"
Charset charset = Charset.forName("Cp1252");
CharsetDecoder decoder = charset.newDecoder();
ByteBuffer buf = ByteBuffer.wrap(projDateString.getBytes("Cp1252"));
CharBuffer cbuf = decoder.decode(buf);
String s = cbuf.toString();
println ("S: " + s)
println("projDatestring: " + projDateString)
Outputs the following:
S: 08/2010 ΓÇô Present
projDatestring: 08/2010 ΓÇô Present
Also, using the same projDateString above, if I do:
projDateString.replaceAll("\u0096", "\u2013");
projDateString.replaceAll("\u0097", "\u2014");
and then print out projDateString, it still prints as
projDatestring: 08/2010 ΓÇô Present
You are probably getting Windows-1252 which is a character set, not an encoding. (Torgamus - Googling for Windows-1232 didn't give me anything.)
Windows-1252, formerly "Cp1252" is almost Unicode, but keeps some characters that came from Cp1252 in their same places. The En Dash is character 150 (0x96) which falls within the Unicode C1 reserved control character range and shouldn't be there.
You can search for char 150 and replace it with \u2013 which is the proper Unicode code point for En Dash.
There are quite a few other character that MS has in the 0x80 to 0x9f range, which is reserved in the Unicode standard, including Em Dash, bullets, and their "smart" quotes.
Edit: By the way, Java uses Unicode code point values for characters internally. UTF-8 is an encoding, which Java uses as the default encoding when writing Strings to files or network connections.
Say you have
String stuff = MSWordUtil.getNextChunkOfText();
Where MSWordUtil would be something that you've written to somehow get pieces of an MS-Word .doc file. It might boil down to
File myDocFile = new File(pathAndFileFromUser);
InputStream input = new FileInputStream(myDocFile);
// and then start reading chunks of the file
By default, as you read byte buffers from the file and make Strings out of them, Java will treat it as UTF-8 encoded text. There are ways, as Lord Torgamus says, to tell what encoding should be used, but without doing that Windows-1252 is pretty close to UTF-8, except there are those pesky characters that are in the C1 control range.
After getting some String like stuff above, you won't find \u2013 or \u2014 in it, you'll find 0x96 and 0x97 instead.
At that point you should be able to do
stuff.replaceAll("\u0096", "\u2013");
I don't do that in my code where I've had to deal with this issue. I loop through an input CharSequence one char at a time, decide based on 0x80 <= charValue <= 0x9f if it has to be replaced, and look up in an array what to replace it with. The above replaceAll() is far easier if all you care about is the 1252 En Dash vs. the Unicode En Dash.
s = s.replace( (char)145, (char)'\'');
s = s.replace( (char)8216, (char)'\''); // left single quote
s = s.replace( (char)146, (char)'\'');
s = s.replace( (char)8217, (char)'\''); // right single quote
s = s.replace( (char)147, (char)'\"');
s = s.replace( (char)148, (char)'\"');
s = s.replace( (char)8220, (char)'\"'); // left double
s = s.replace( (char)8221, (char)'\"'); // right double
s = s.replace( (char)8211, (char)'-' ); // em dash??
s = s.replace( (char)150, (char)'-' );
http://www.coderanch.com/how-to/java/WeirdWordCharacters
Your problem almost certainly has to do with your encoding scheme not matching the encoding scheme Word saves in. Your code is probably using the Java default, likely UTF-8 if you haven't done anything to it. Your input, on the other hand, is likely Windows-1252, the default for Microsoft Word's .doc documents. See this site for more info. Notably,
Within Windows, ISO-8859-1 is replaced by Windows-1252, which often means that text copied from, say, a Microsoft Word document and pasted straight into a web page produces HTML validation errors.
So what does this mean for you? You'll have to tell your program that the input is using Windows-1252 encoding, and convert it to UTF-8. You can do this in varying flavors of "manually." Probably the most natural way is to take advantage of Java's built-in Charset class.
Windows-1252 is recognized by the IANA Charset Registry
Name: windows-1252
MIBenum: 2252
Source: Microsoft (http://www.iana.org/assignments/charset-reg/windows-1252) [Wendt]
Alias: None
so you it should be Charset-compatible. I haven't done this before myself, so I can't give you a code sample, but I will point out that there is a String constructor that takes a byte[] and a Charset as arguments.
Probably, that character is an en dash, and the strange blurb you see is due to a difference between the way Word encodes that character and the way that character is decoded by whatever (other) system you are using to display it.
If I remember correctly from when I did some work on character encodings in Java, String instances always internally use UTF-8; so, within such an instance, you may search and replace a single character by its Unicode form. For example, let's say you would like to substitute smart quotes with plain double quotes: given a String s, you may write
s = s.replace('\u201c', '"');
s = s.replace('\u201d', '"');
where 201c and 201d are the Unicode code points for the opening and closing smart quotes. According to the link above on Wikipedia, the Unicode code point for the en dash is 2013.