How to display string from Latin-Extended-A charset - java

I can't display string, which contains Latin-Extended-A chars in an appropriate way.
I have tried getting bytes with different Unicode and creating new String with new Unicode.
If you have string "ăăăăăăăăă". How can i output it in an appropriate way.

Java supports unicode characters. If you have:
String x = "ăăăăăăăăă";
System.out.println(x);
You'll get
ăăăăăăăăă
If you get question marks or funky looking characters, then it's most likely not a problem with java or the code, but with the fonts on your computer not supporting it.

Related

? is the only output for all unicode above U+0080 in java [duplicate]

While printing certain unicode characters in java we get output as '?'. Why is it so and is there any way to print these characters?
This is my code
String symbol1="\u200d";
StringBuilder strg = new StringBuilder("unicodecharacter");
strg.insert(5,symbol1);
System.out.println("After insertion...");
System.out.println(strg.toString());
Output is
After insertion...
unico?decharacter
Here's a great article, written by Joel Spolsky, on the topic. It won't directly help you solve your problem, but it will help you understand what's going on. It'll also show you how involved the situation really is.
You have a character encoding which doesn't match the character you have or the supported characters on the screen.
I would check which encoding you are using through out and try to determine whether you are reading, storing or printing the value correctly.
Are you sure which encoding you need? You may need to explicitly encode your output as UTF-8 or ISO 8859-1 if you are dealing with European characters.
Java's default behaviour when reading an invalid unicode character is to replace it with the Replacement Character (\uFFFD). This character is often rendered as a question mark.
In your case, the text you're reading is not encoded as unicode, it's encoded as something else (Windows-1252 or ISO-8859-1 are probably the most common alternatives if your text is in English).
I wrote an Open Source Library that has a utility that converts any String to Unicode sequence and vise-versa. It helps to diagnose such issues. So for instance to print your String you can use something like this:
String str= StringUnicodeEncoderDecoder.decodeUnicodeSequenceToString("\\u0197" +
StringUnicodeEncoderDecoder.encodeStringToUnicodeSequence("Test"));
You can read about the library and where to download it and how to use it at Open Source Java library with stack trace filtering, Silent String parsing Unicode converter and Version comparison See the paragraph "String Unicode converter"

Split combined arabic characters in individual characters

I'm trying to convert "combined arabic characters" (like ﻼ ) in the different individual characters that compose that "combined" character (e.g. ﻝ ا). I wasnt able to do this in JAVA or C#, because I need split the complete list of characters.
In C# i'm trying to get the Unicode character, convert it to Windows-1256 waiting to get 2 o 3 bytes that are the individual characters and that combined character uses, but I wasnt able to do this.
String unicodeWord = (char)sc;
byte[] arabicBytes = System.Text.Encoding.GetEncoding(1256).GetBytes(unicodeWord);
but the result is always ?.
Can you help me with this? I have not problem to use either java or c#.
Thanks a lot!
string input = "ﻼ";
string normalized = input.Normalize(NormalizationForm.FormKC);
Note that there are different normalization forms with different results; FormKC results in ل and ا

Chinese Strings Handling in Java? [duplicate]

This question already has answers here:
How to read UTF 8 encoded file in java with turkish characters
(3 answers)
Closed 9 years ago.
In my assigned project, the original author has written a function:
public String asString() throws DataException
{
if (getData() == null) return null;
CharBuffer charBuf = null;
try
{
charBuf = s_charset.newDecoder().decode(ByteBuffer.wrap(f_data));
}
catch (CharacterCodingException e)
{
throw new DataException("You can't have a string from this ParasolBlob: " + this, e);
}
return charBuf.toString()+"你好";
}
Please note that the constant s_charset is defined as:
private static final Charset s_charset = Charset.forName("UTF-8");
Please also note that I have hard-coded a Chinese string in the return string.
Now when the program flow reaches this method, it will throw the following exception:
java.nio.charset.UnmappableCharacterException: Input length = 2
And more interstingly, the hard-coded Chinese strings will be shown as "??" at the console if I do a System.out.println().
I think this problem is quite interesting in regard of Localization. And I've tried changing it to
Charset.forName("GBK");
but seems is not the solution. Also, I have set the coding of the Java class to be of "UTF-8".
Any experts have experience in this regard? Would you please share a little? Thanks in advance!
And more interstingly, the hard-coded Chinese strings will be shown as
"??" at the console if I do a System.out.println().
System.out performs transcoding operations from UTF-16 strings to the default JRE character encoding. If this does not match the encoding used by the device receiving the character data is corrupted. So, the console should be set to use the right character encoding(UTF-8) to render the chinese chars properly.
If you are using eclipse then you can change the console encoding by going to
Run Configuration-> Common -> Encoding(slect UTF-8 from dropdown)
Java Strings are unicodes
System.out.println("你好");
As Kevin stated, depending on what is the underlying encoding of your source file, this encoding will be used to convert it to UTF-16BE (real encoding of Java String). So when you see "??" it is surely simple conversion error.
Now, if you want to convert simple byte array to String, using given character encoding, I believe there is much easier way to do this, than using raw CharsetDecoder. That is:
byte[] bytes = {0x61};
String string = new String(bytes, Charset.forName("UTF-8"));
System.out.println(string);
This will work, provided that the byte array really contains UTF-8 encoded stream of bytes. And it must be without BOM, otherwise the conversion will probably fail. Make sure that what you are trying to convert does not start with the sequence 0xEF 0xBB 0xBF.

string to encode in CodePage 857

I need a function that can print Turkish characters.
public String convert(String input) {
String output = new String(s.getBytes(input), "CodePage-857");
return output;
}
Is there anyone out there can show me how to achieve this?
Thank you
Java strings are UTF-16 by default which includes the Turkish character set. You can display the strings in UTF-8, UTF-16, or ISO-8859-3.
What view technology are you working with? It's probably configured for ISO-8859-1, which doesn't support Turkish.
This is how I achieved it.
System.Text.Encoding CP857 = System.Text.Encoding.GetEncoding(857);
return CP857.GetBytes("Text goes here");
Java strings are always stored as UTF-16 therefore it wont help to create a new string from your input string. If you wan to print a string in a different encoding then you will need to address that on the actual displaying, e.g.: if you want to display it in a JSP page then the JSP page's encoding must be set to CodePage-857

Replacing Unicode character codes with characters in String in Java

I have a Java String like this: "peque\u00f1o". Note that it has an embedded Unicode character: '\u00f1'.
Is there a method in Java that will replace these Unicode character sequences with the actual characters? That is, a method that would return "pequeño" if you gave it "peque\u00f1o" as input?
Note that I have a string that has 12 chars (those that we see, that happen to be in the ASCII range).
Actually the string is "pequeño".
String s = "peque\u00f1o";
System.out.println(s.length());
System.out.println(s);
yields
7
pequeño
i.e. seven chars and the correct representation on System.out.
I remember giving the same response last week, use org.apache.commons.lang.StringEscapeUtils.
If you have the appropriate fonts, a println or setting the string in a JLabel or JTextArea should do the trick. The escaping is only for the compiler.
If you plan to copy-paste the readable strings in source, remember to also choose a suitable file encoding like UTF8.

Categories