Unicode to String in java but tricky - java

I was fetching data from a website using its API which was returning the data in JSON format.
The issue was when there where some umlaut characters in the JSON. It would return its UNICODE, for e.g. Münich would be Mu\u0308nich.
When I passed this JSON string to the constructor of the org.codehaus.jettison.json.JSONObject, Mu\u0308nich was converted to Munich (n has an umlaut). Wrong.
I realized this very late (after fetching the entire data). Now I use the following method to convert it back to the Unicode form i.e. I pass Munich (n has an umlaut) to the method and it returns Mu\u0308nich.
I want to somehow convert this Mu\u0308nich to Münich. Any ideas?
Please note the conversion is needed only for u\u0308 to ü and o\u0308 to ö and a\u0308 to ä and so on.
Method used to convert back -
public static String escapeUnicode(String input) {
StringBuilder b = new StringBuilder(input.length());
Formatter f = new Formatter(b);
for (char c : input.toCharArray()) {
if (c < 128) {
b.append(c);
} else {
f.format("\\u%04x", (int) c);
}
}
return b.toString();
}

These are called Diacritics and you can use Normalizer to combine diacritics into single unicode characters.
Use the normalize method and as Form NFKC. This will first decompose the full string into diacritics and then do a composition to return 'real' unicode umlauts.
So: 'München' stays 'München' and 'Mu\u0308nchen' will become 'München'
You then will have the string in a single format, not using diacritics anymore and easily portable and displayable.
If you work with texts from different platforms, some normalization is crucial or you will end up with the problems you described.

Related

Is this format: "U+043E;U+006F,U+004D" some sort of encoding standard and does java offer a standard library method to convert it to char?

I am investigating some mess that has been done to our languages-support (it is used in our IDN functionality, if that rings a bell)...
I used an SQL GUI client to quickly see the structure of our language definitions. So, when I do select charcodes from ourCharCodesTable where language = 'myLanguage';, I get results for some values of 'myLanguage', E.G.:
myLanguage = "ASCII":
result = "-0123456789abcdefghijklmnopqrstuvwxyz"
myLanguage = "Russian":
result = "-0123456789абвгдежзийклмнопрстуфхцчшщъьюяѐѝ"
(BTW: can already see a language mistake here, if you are a polyglot like me!)
I thought: "OK, I can work with this! Let's write a Java program and put some logic to find mistakes..."
I need my logic to receive one char at a time from the 'result' and, according to the current table context, apply my logic to flag if it should or should not be there...
However! When I am at:
myLanguage = "Belarusian" :
One would think this language is rather similar to Russian, but the very format of the result, as coming from the database is totally different: result = "U+002D\nU+0030\nU+0030..." !
And, there's another format!
myLanguage = "Chinese" :
result = "#\nU+002D;U+002D;U+003D,U+004D,U+002D\nU+0030;U+0030;U+0030"
FWIW: charcodes column is of CLOB type.
I know U+002D is '-' and U+0030 is '0'...
My current idea is to:
1] Check if the entire response is in 'щ' format or 'U+0449` format (whether the 'U+****'s are separated with ';', ',' or '\n' - I am just going to treat them as standalone chars)
a. If it is the "easy one", just send the char on to my testing method
b. If it is the "hard one", get the hex part (0449), convert to decimal (1097) and cast to char (щ)
So, again, my questions are:
What is this "U+043E;U+006F,U+004D" format?
If it is a widely-used standard, does Java offer any methods to convert a whole String of these into a char array?
UPDATED
What is this "U+043E;U+006F,U+004D" format?
In a comment, OP provided a link to https://www.iana.org/domains/idn-tables/tables/academy_zh_1.0.txt, which has the following text:
This table conforms to the format specified in RFC 3743.
RFC 3743 can be found at https://www.rfc-editor.org/rfc/rfc3743
If it is a widely-used standard, does Java offer any methods to convert a whole String of these into a char array?
It is not a widely-used standard, so Java does not offer that natively, but it is easy to convert to regular String using regex, so you can then process the string normally.
// Java 11+
static String decodeUnicode(String input) {
return Pattern.compile("U\\+[0-9A-F]{4,6}").matcher(input).replaceAll(mr ->
Character.toString(Integer.parseInt(mr.group().substring(2), 16)));
}
// Java 9+
static String decodeUnicode(String input) {
return Pattern.compile("U\\+[0-9A-F]{4,6}").matcher(input).replaceAll(mr ->
new String(new int[] { Integer.parseInt(mr.group().substring(2), 16) }, 0, 1));
}
// Java 1.5+
static String decodeUnicode(String input) {
StringBuffer buf = new StringBuffer();
Matcher m = Pattern.compile("U\\+[0-9A-F]{4,6}").matcher(input);
while (m.find()) {
String hexString = m.group().substring(2);
int codePoint = Integer.parseInt(hexString, 16);
String unicodeCharacter = new String(new int[] { codePoint }, 0, 1);
m.appendReplacement(buf, unicodeCharacter);
}
return m.appendTail(buf).toString();
}
Test
System.out.println(decodeUnicode("#\nU+002D;U+002D;U+003D,U+004D,U+002D\nU+0030;U+0030;U+0030"));
Output
#
-;-;=,M,-
0;0;0
U+0000 is a representation of a Unicode Codepoint and the format is defined in Apendix A of the Unicode Standard. The numbers are simply the hex-encoded number of the represented codepoint. For historical reasons they are always left-padded to at least 4 digits with 0, but can be up to 6 digits long.
It is not primarily meant as a machine-readable encoding, but rather as a human-readable representation of Unicode codepoints for use in running text (i.e. paragraphs such as this one). Note especially that this format does not have a way to distinguish a four-character number followed by some numbers from a 5- or 6-digit number. So U+123456 could be interpreted in 3 different was: U+1234 followed by the text 56, U+12345 followed by the text 6 or U+123456. This makes it unsuited for automatic replacement and use as a general-purpose encoding.
As such there is no built-in functionality to parse this into its equivalent String or similar in Java.
The following code can be used to parse a single Unicode codepoint reference into the appropriate codepoint in a String:
public static String codePointToString(String input) {
if (!input.startsWith("U+")) {
throw new IllegalArgumentException("Malformed input, doesn't start with U+");
}
int codepoint = Integer.parseInt(input.substring(2), 16);
if (codepoint < 0 || codepoint > Character.MAX_CODE_POINT) {
throw new IllegalArgumentException("Malformed input, codepoint value out of valid range: " + codepoint);
}
return Character.toString(codepoint);
}
(Before Java 11 the return line needs to use new String(new int[] { codepoint }, 0, 1) instead).
And if you want to replace all Unicode codepoints represented in a text by their actual text (which might render it unreadable in some cases) you can use this (together with the method above):
private static final Pattern PATTERN = Pattern.compile("U\\+[0-9A-Za-z]{4,6}");
public static String decodeCodePoints(String input) {
return PATTERN
.matcher(input)
.replaceAll(result -> codePointToString(result.group()));
}
Actually, I wrote an Open Source Library called MgntUtils that has a utility that can very much help you. The codes that you see are unicode sequences where each U+XXXX represents a character. The utility in the library can convert any string in any language (including special characters) into Unicode sequences and vise-versa. Here is a sample of how it works:
result = "Hello World";
result = StringUnicodeEncoderDecoder.encodeStringToUnicodeSequence(result);
System.out.println(result);
result = StringUnicodeEncoderDecoder.decodeUnicodeSequenceToString(result);
System.out.println(result);
The output of this code is:
\u0048\u0065\u006c\u006c\u006f\u0020\u0057\u006f\u0072\u006c\u0064
Hello World
The library can be found at Maven Central or at Github It comes as maven artifact and with sources and javadoc
Here is javadoc for the class StringUnicodeEncoderDecoder

Java - read UTF-8 file with a single emoji symbol

I have a file with a single unicode symbol.
The file is encoded in UTF-8.
It contains a single symbol represented as 4 bytes.
https://www.fileformat.info/info/unicode/char/1f60a/index.htm
F0 9F 98 8A
When I read the file I get two symbols/chars.
The program below prints
?
2
?
?
55357
56842
======================================
😊
16
&
======================================
?
2
?
======================================
Is this normal... or a bug? Or am I misusing something?
How do I get that single emoji symbol in my code?
EDIT: And also... how do I escape it for XML?
import java.io.BufferedReader;
import java.io.FileInputStream;
import java.io.InputStreamReader;
public class Test008 {
public static void main(String[] args) throws Exception{
BufferedReader in = new BufferedReader(
new InputStreamReader(
new FileInputStream("D:\\DATA\\test1.txt"), "UTF8"));
String s = "";
while ((s = in.readLine()) != null) {
System.out.println(s);
System.out.println(s.length());
System.out.println(s.charAt(0));
System.out.println(s.charAt(1));
System.out.println((int)(s.charAt(0)));
System.out.println((int)(s.charAt(1)));
String z = org.apache.commons.lang.StringEscapeUtils.escapeXml(s);
String z3 = org.apache.commons.lang3.StringEscapeUtils.escapeXml(s);
System.out.println("======================================");
System.out.println(z);
System.out.println(z.length());
System.out.println(z.charAt(0));
System.out.println("======================================");
System.out.println(z3);
System.out.println(z3.length());
System.out.println(z3.charAt(0));
System.out.println("======================================");
}
in.close();
}
}
Yes normal, the Unicode symbol is 2 UTF-16 chars (1 char is 2 bytes).
int codePoint = s.codePointAt(0); // Your code point.
System.out.printf("U+%04X, chars: $d%n", codePoint, Character.charCount(cp));
U+F09F988A, chars: 2
After comments
Java, using a Stream:
public static String escapeToAsciiHTML(String s) {
StringBuilder sb = new StringBuilder();
s.codePoints().forEach(cp -> {
if (cp < 128) {
sb.append((char) cp);
} else{
sb.append("&#").append(cp).append(";");
}
});
return sb.toString();
}
StringEscapeUtils is broken. Don't use it. Try NumericEntityEscaper.
Or, better yet, as apache commons libraries tend to be bad API** and broken*** anyway, guava*'s XmlEscapers
java is unicode, yes, but 'char' is a lie. 'char' does not represent characters; it represents a single, unsigned 16 bit number. The actual method to get a character out of, say, a j.l.String object isn't charAt, which is a misnomer; it's codepointAt, and friends.
This (char being a fakeout) normally doesn't matter; most actual characters fit in the 16-bit char type. But when they don't, this matters, and that emoji doesn't fit. In the unicode model used by java and the char type, you then get 2 char values (representing a single unicode character). This pair is called a 'surrogate pair'.
Note that the right methods tend to work in int (you need the 32 bits to represent one single unicode symbol, after all).
*) guava has its own issues, by being aggressively not backwards compatible with itself, it tends to lead to dependency hell. It's a pick your poison kind of deal, unfortunately.
**) Utils-anything is usually a sign of bad API design; 'util' is almost meaningless as a term and usually implies you've broken the object oriented model. The right model is of course to have an object representing the process of translating data in one form (say, a raw string) to another (say, a string that can be dumped straight into an XML file, escaped and well) - and such a thing would thus be called an 'escaper', and would live perhaps in a package named 'escapers' or 'text'. Later editions of apache libraries, as well as guava, fortunately 'fixed' this.
***) As this very example shows, these APIs often don't do what you want them to. Note that apache is open source; if you want these APIs to be better, they accept pull requests :)

How can I save a String Byte without losing information?

I'm developing a JPEG decoder(I'm in the Huffman phase) and I want to write BinaryString's into a file.
For example, let's say we've this:
String huff = "00010010100010101000100100";
I've tried to convert it to an integer spliting it by 8 and saving it integer represantation, as I can't write bits:
huff.split("(?<=\\G.{8})"))
int val = Integer.parseInt(str, 2);
out.write(val); //writes to a FileOutputStream
The problem is that, in my example, if I try to save "00010010" it converts it to 18 (10010), and I need the 0's.
And finally, when I read :
int enter;
String code = "";
while((enter =in.read())!=-1) {
code+=Integer.toBinaryString(enter);
}
I got :
Code = 10010
instead of:
Code = 00010010
Also I've tried to convert it to bitset and then to Byte[] but I've the same problem.
Your example is that you have the string "10010" and you want the string "00010010". That is, you need to left-pad this string with zeroes. Note that since you're joining the results of many calls to Integer.toBinaryString in a loop, you need to left-pad these strings inside the loop, before concatenating them.
while((enter = in.read()) != -1) {
String binary = Integer.toBinaryString(enter);
// left-pad to length 8
binary = ("00000000" + binary).substring(binary.length());
code += binary;
}
You might want to look at the UTF-8 algorithm, since it does exactly what you want. It stores massive amounts of data while discarding zeros, keeping relevant data, and encoding it to take up less disk space.
Works with: Java version 7+
import java.nio.charset.StandardCharsets;
import java.util.Formatter;
public class UTF8EncodeDecode {
public static byte[] utf8encode(int codepoint) {
return new String(new int[]{codepoint}, 0, 1).getBytes(StandardCharsets.UTF_8);
}
public static int utf8decode(byte[] bytes) {
return new String(bytes, StandardCharsets.UTF_8).codePointAt(0);
}
public static void main(String[] args) {
System.out.printf("%-7s %-43s %7s\t%s\t%7s%n",
"Char", "Name", "Unicode", "UTF-8 encoded", "Decoded");
for (int codepoint : new int[]{0x0041, 0x00F6, 0x0416, 0x20AC, 0x1D11E}) {
byte[] encoded = utf8encode(codepoint);
Formatter formatter = new Formatter();
for (byte b : encoded) {
formatter.format("%02X ", b);
}
String encodedHex = formatter.toString();
int decoded = utf8decode(encoded);
System.out.printf("%-7c %-43s U+%04X\t%-12s\tU+%04X%n",
codepoint, Character.getName(codepoint), codepoint, encodedHex, decoded);
}
}
}
https://rosettacode.org/wiki/UTF-8_encode_and_decode#Java
UTF-8 is a variable width character encoding capable of encoding all 1,112,064[nb 1] valid code points in Unicode using one to four 8-bit bytes.[nb 2] The encoding is defined by the Unicode Standard, and was originally designed by Ken Thompson and Rob Pike.[1][2] The name is derived from Unicode (or Universal Coded Character Set) Transformation Format – 8-bit.[3]
It was designed for backward compatibility with ASCII. Code points with lower numerical values, which tend to occur more frequently, are encoded using fewer bytes. The first 128 characters of Unicode, which correspond one-to-one with ASCII, are encoded using a single byte with the same binary value as ASCII, so that valid ASCII text is valid UTF-8-encoded Unicode as well. Since ASCII bytes do not occur when encoding non-ASCII code points into UTF-8, UTF-8 is safe to use within most programming and document languages that interpret certain ASCII characters in a special way, such as "/" (slash) in filenames, "\" (backslash) in escape sequences, and "%" in printf.
https://en.wikipedia.org/wiki/UTF-8
Binary 11110000 10010000 10001101 10001000 becomes F0 90 8D 88 in UTF-8. Since you are storing it as text, you go from having to store 32 characters to storing 8. And because it's a well known and well designed encoding, you can reverse it easily. All the math is done for you.
Your example of 00010010100010101000100100 (or rather 00000001 0010100 0101010 00100100) converts to *$ (two unprintable characters on my machine). That's the UTF-8 encoding of the binary. I had mistakenly used a different site that was using the data I put in as decimal instead of binary.
https://onlineutf8tools.com/convert-binary-to-utf8
For a really good explanation of UTF-8 and how it can apply to the answer:
https://hackaday.com/2013/09/27/utf-8-the-most-elegant-hack/
Edit:
I took this question as a way to reduce the amount of characters needed to store values, which is a type of encoding. UTF-8 is a type of encoding. Used in a "non-standard" way, the OP can use UTF-8 to encode their strings of 0's & 1's in a much shorter format. That's how this answer is relevant.
If you concatenate the characters, you can go from 4x 8 bits (32 bits) to 8x 8 bits (64 bits) easily and encode a value as large as 9,223,372,036,854,775,807.

Java NIO server receives random string [duplicate]

I'm writing a web application in Google app Engine. It allows people to basically edit html code that gets stored as an .html file in the blobstore.
I'm using fetchData to return a byte[] of all the characters in the file. I'm trying to print to an html in order for the user to edit the html code. Everything works great!
Here's my only problem now:
The byte array is having some issues when converting back to a string. Smart quotes and a couple of characters are coming out looking funky. (?'s or japanese symbols etc.) Specifically it's several bytes I'm seeing that have negative values which are causing the problem.
The smart quotes are coming back as -108 and -109 in the byte array. Why is this and how can I decode the negative bytes to show the correct character encoding?
The byte array contains characters in a special encoding (that you should know). The way to convert it to a String is:
String decoded = new String(bytes, "UTF-8"); // example for one encoding type
By The Way - the raw bytes appear may appear as negative decimals just because the java datatype byte is signed, it covers the range from -128 to 127.
-109 = 0x93: Control Code "Set Transmit State"
The value (-109) is a non-printable control character in UNICODE. So UTF-8 is not the correct encoding for that character stream.
0x93 in "Windows-1252" is the "smart quote" that you're looking for, so the Java name of that encoding is "Cp1252". The next line provides a test code:
System.out.println(new String(new byte[]{-109}, "Cp1252"));
Java 7 and above
You can also pass your desired encoding to the String constructor as a Charset constant from StandardCharsets. This may be safer than passing the encoding as a String, as suggested in the other answers.
For example, for UTF-8 encoding
String bytesAsString = new String(bytes, StandardCharsets.UTF_8);
You can try this.
String s = new String(bytearray);
public class Main {
/**
* Example method for converting a byte to a String.
*/
public void convertByteToString() {
byte b = 65;
//Using the static toString method of the Byte class
System.out.println(Byte.toString(b));
//Using simple concatenation with an empty String
System.out.println(b + "");
//Creating a byte array and passing it to the String constructor
System.out.println(new String(new byte[] {b}));
}
/**
* #param args the command line arguments
*/
public static void main(String[] args) {
new Main().convertByteToString();
}
}
Output
65
65
A
public static String readFile(String fn) throws IOException
{
File f = new File(fn);
byte[] buffer = new byte[(int)f.length()];
FileInputStream is = new FileInputStream(fn);
is.read(buffer);
is.close();
return new String(buffer, "UTF-8"); // use desired encoding
}
I suggest Arrays.toString(byte_array);
It depends on your purpose. For example, I wanted to save a byte array exactly like the format you can see at time of debug that is something like this : [1, 2, 3] If you want to save exactly same value without converting the bytes to character format, Arrays.toString (byte_array) does this,. But if you want to save characters instead of bytes, you should use String s = new String(byte_array). In this case, s is equal to equivalent of [1, 2, 3] in format of character.
The previous answer from Andreas_D is good. I'm just going to add that wherever you are displaying the output there will be a font and a character encoding and it may not support some characters.
To work out whether it is Java or your display that is a problem, do this:
for(int i=0;i<str.length();i++) {
char ch = str.charAt(i);
System.out.println(i+" : "+ch+" "+Integer.toHexString(ch)+((ch=='\ufffd') ? " Unknown character" : ""));
}
Java will have mapped any characters it cannot understand to 0xfffd the official character for unknown characters. If you see a '?' in the output, but it is not mapped to 0xfffd, it is your display font or encoding that is the problem, not Java.

How to replace/remove 4(+)-byte characters from a UTF-8 string in Java?

Because MySQL 5.1 does not support 4 byte UTF-8 sequences, I need to replace/drop the 4 byte sequences in these strings.
I'm looking a clean way to replace these characters.
Apache libraries are replacing the characters with a question-mark is fine for this case, although ASCII equivalent would be nicer, of course.
N.B. The input is from external sources (e-mail names) and upgrading the database is not a solution at this point in time.
We ended up implementing the following method in Java for this problem.
Basicaly replacing the characters with a higher codepoint then the last 3byte UTF-8 char.
The offset calculations are to make sure we stay on the unicode code points.
public static final String LAST_3_BYTE_UTF_CHAR = "\uFFFF";
public static final String REPLACEMENT_CHAR = "\uFFFD";
public static String toValid3ByteUTF8String(String s) {
final int length = s.length();
StringBuilder b = new StringBuilder(length);
for (int offset = 0; offset < length; ) {
final int codepoint = s.codePointAt(offset);
// do something with the codepoint
if (codepoint > CharUtils.LAST_3_BYTE_UTF_CHAR.codePointAt(0)) {
b.append(CharUtils.REPLACEMENT_CHAR);
} else {
if (Character.isValidCodePoint(codepoint)) {
b.appendCodePoint(codepoint);
} else {
b.append(CharUtils.REPLACEMENT_CHAR);
}
}
offset += Character.charCount(codepoint);
}
return b.toString();
}
Another simple solution is to use regular expression [^\u0000-\uFFFF]. For example in java:
text.replaceAll("[^\\u0000-\\uFFFF]", "\uFFFD");
5 byte utf-8 sequences begin with a 111110xx-byte and 6 byte utf-8 sequences begin with a 1111110x-byte. Important to note is, that no follow-up bytes of 1-4-byte utf-8 sequences contain bytes that large because follow-up bytes are always of the form 10xxxxxx.
Therefore you can just go through the bytes and every time you see a byte of kind 111110xx then only emit a '?' to the output-stream/array while skipping the next 4 bytes from the input; analogue for the 6-byte-sequences.

Categories