Java - read UTF-8 file with a single emoji symbol

Java - read UTF-8 file with a single emoji symbol - java

I have a file with a single unicode symbol.
The file is encoded in UTF-8.
It contains a single symbol represented as 4 bytes.
https://www.fileformat.info/info/unicode/char/1f60a/index.htm
F0 9F 98 8A
When I read the file I get two symbols/chars.
The program below prints
?
2
?
?
55357
56842
======================================
😊
16
&
======================================
?
2
?
======================================
Is this normal... or a bug? Or am I misusing something?
How do I get that single emoji symbol in my code?
EDIT: And also... how do I escape it for XML?
import java.io.BufferedReader;
import java.io.FileInputStream;
import java.io.InputStreamReader;
public class Test008 {
public static void main(String[] args) throws Exception{
BufferedReader in = new BufferedReader(
new InputStreamReader(
new FileInputStream("D:\\DATA\\test1.txt"), "UTF8"));
String s = "";
while ((s = in.readLine()) != null) {
System.out.println(s);
System.out.println(s.length());
System.out.println(s.charAt(0));
System.out.println(s.charAt(1));
System.out.println((int)(s.charAt(0)));
System.out.println((int)(s.charAt(1)));
String z = org.apache.commons.lang.StringEscapeUtils.escapeXml(s);
String z3 = org.apache.commons.lang3.StringEscapeUtils.escapeXml(s);
System.out.println("======================================");
System.out.println(z);
System.out.println(z.length());
System.out.println(z.charAt(0));
System.out.println("======================================");
System.out.println(z3);
System.out.println(z3.length());
System.out.println(z3.charAt(0));
System.out.println("======================================");
}
in.close();
}
}

Yes normal, the Unicode symbol is 2 UTF-16 chars (1 char is 2 bytes).
int codePoint = s.codePointAt(0); // Your code point.
System.out.printf("U+%04X, chars: $d%n", codePoint, Character.charCount(cp));
U+F09F988A, chars: 2
After comments
Java, using a Stream:
public static String escapeToAsciiHTML(String s) {
StringBuilder sb = new StringBuilder();
s.codePoints().forEach(cp -> {
if (cp < 128) {
sb.append((char) cp);
} else{
sb.append("&#").append(cp).append(";");
}
});
return sb.toString();
}

StringEscapeUtils is broken. Don't use it. Try NumericEntityEscaper.
Or, better yet, as apache commons libraries tend to be bad API** and broken*** anyway, guava*'s XmlEscapers
java is unicode, yes, but 'char' is a lie. 'char' does not represent characters; it represents a single, unsigned 16 bit number. The actual method to get a character out of, say, a j.l.String object isn't charAt, which is a misnomer; it's codepointAt, and friends.
This (char being a fakeout) normally doesn't matter; most actual characters fit in the 16-bit char type. But when they don't, this matters, and that emoji doesn't fit. In the unicode model used by java and the char type, you then get 2 char values (representing a single unicode character). This pair is called a 'surrogate pair'.
Note that the right methods tend to work in int (you need the 32 bits to represent one single unicode symbol, after all).
*) guava has its own issues, by being aggressively not backwards compatible with itself, it tends to lead to dependency hell. It's a pick your poison kind of deal, unfortunately.
**) Utils-anything is usually a sign of bad API design; 'util' is almost meaningless as a term and usually implies you've broken the object oriented model. The right model is of course to have an object representing the process of translating data in one form (say, a raw string) to another (say, a string that can be dumped straight into an XML file, escaped and well) - and such a thing would thus be called an 'escaper', and would live perhaps in a package named 'escapers' or 'text'. Later editions of apache libraries, as well as guava, fortunately 'fixed' this.
***) As this very example shows, these APIs often don't do what you want them to. Note that apache is open source; if you want these APIs to be better, they accept pull requests :)

Related

Is this format: "U+043E;U+006F,U+004D" some sort of encoding standard and does java offer a standard library method to convert it to char?

I am investigating some mess that has been done to our languages-support (it is used in our IDN functionality, if that rings a bell)...
I used an SQL GUI client to quickly see the structure of our language definitions. So, when I do select charcodes from ourCharCodesTable where language = 'myLanguage';, I get results for some values of 'myLanguage', E.G.:
myLanguage = "ASCII":
result = "-0123456789abcdefghijklmnopqrstuvwxyz"
myLanguage = "Russian":
result = "-0123456789абвгдежзийклмнопрстуфхцчшщъьюяѐѝ"
(BTW: can already see a language mistake here, if you are a polyglot like me!)
I thought: "OK, I can work with this! Let's write a Java program and put some logic to find mistakes..."
I need my logic to receive one char at a time from the 'result' and, according to the current table context, apply my logic to flag if it should or should not be there...
However! When I am at:
myLanguage = "Belarusian" :
One would think this language is rather similar to Russian, but the very format of the result, as coming from the database is totally different: result = "U+002D\nU+0030\nU+0030..." !
And, there's another format!
myLanguage = "Chinese" :
result = "#\nU+002D;U+002D;U+003D,U+004D,U+002D\nU+0030;U+0030;U+0030"
FWIW: charcodes column is of CLOB type.
I know U+002D is '-' and U+0030 is '0'...
My current idea is to:
1] Check if the entire response is in 'щ' format or 'U+0449` format (whether the 'U+****'s are separated with ';', ',' or '\n' - I am just going to treat them as standalone chars)
a. If it is the "easy one", just send the char on to my testing method
b. If it is the "hard one", get the hex part (0449), convert to decimal (1097) and cast to char (щ)
So, again, my questions are:
What is this "U+043E;U+006F,U+004D" format?
If it is a widely-used standard, does Java offer any methods to convert a whole String of these into a char array?

UPDATED
What is this "U+043E;U+006F,U+004D" format?
In a comment, OP provided a link to https://www.iana.org/domains/idn-tables/tables/academy_zh_1.0.txt, which has the following text:
This table conforms to the format specified in RFC 3743.
RFC 3743 can be found at https://www.rfc-editor.org/rfc/rfc3743
If it is a widely-used standard, does Java offer any methods to convert a whole String of these into a char array?
It is not a widely-used standard, so Java does not offer that natively, but it is easy to convert to regular String using regex, so you can then process the string normally.
// Java 11+
static String decodeUnicode(String input) {
return Pattern.compile("U\\+[0-9A-F]{4,6}").matcher(input).replaceAll(mr ->
Character.toString(Integer.parseInt(mr.group().substring(2), 16)));
}
// Java 9+
static String decodeUnicode(String input) {
return Pattern.compile("U\\+[0-9A-F]{4,6}").matcher(input).replaceAll(mr ->
new String(new int[] { Integer.parseInt(mr.group().substring(2), 16) }, 0, 1));
}
// Java 1.5+
static String decodeUnicode(String input) {
StringBuffer buf = new StringBuffer();
Matcher m = Pattern.compile("U\\+[0-9A-F]{4,6}").matcher(input);
while (m.find()) {
String hexString = m.group().substring(2);
int codePoint = Integer.parseInt(hexString, 16);
String unicodeCharacter = new String(new int[] { codePoint }, 0, 1);
m.appendReplacement(buf, unicodeCharacter);
}
return m.appendTail(buf).toString();
}
Test
System.out.println(decodeUnicode("#\nU+002D;U+002D;U+003D,U+004D,U+002D\nU+0030;U+0030;U+0030"));
Output
#
-;-;=,M,-
0;0;0

U+0000 is a representation of a Unicode Codepoint and the format is defined in Apendix A of the Unicode Standard. The numbers are simply the hex-encoded number of the represented codepoint. For historical reasons they are always left-padded to at least 4 digits with 0, but can be up to 6 digits long.
It is not primarily meant as a machine-readable encoding, but rather as a human-readable representation of Unicode codepoints for use in running text (i.e. paragraphs such as this one). Note especially that this format does not have a way to distinguish a four-character number followed by some numbers from a 5- or 6-digit number. So U+123456 could be interpreted in 3 different was: U+1234 followed by the text 56, U+12345 followed by the text 6 or U+123456. This makes it unsuited for automatic replacement and use as a general-purpose encoding.
As such there is no built-in functionality to parse this into its equivalent String or similar in Java.
The following code can be used to parse a single Unicode codepoint reference into the appropriate codepoint in a String:
public static String codePointToString(String input) {
if (!input.startsWith("U+")) {
throw new IllegalArgumentException("Malformed input, doesn't start with U+");
}
int codepoint = Integer.parseInt(input.substring(2), 16);
if (codepoint < 0 || codepoint > Character.MAX_CODE_POINT) {
throw new IllegalArgumentException("Malformed input, codepoint value out of valid range: " + codepoint);
}
return Character.toString(codepoint);
}
(Before Java 11 the return line needs to use new String(new int[] { codepoint }, 0, 1) instead).
And if you want to replace all Unicode codepoints represented in a text by their actual text (which might render it unreadable in some cases) you can use this (together with the method above):
private static final Pattern PATTERN = Pattern.compile("U\\+[0-9A-Za-z]{4,6}");
public static String decodeCodePoints(String input) {
return PATTERN
.matcher(input)
.replaceAll(result -> codePointToString(result.group()));
}

Actually, I wrote an Open Source Library called MgntUtils that has a utility that can very much help you. The codes that you see are unicode sequences where each U+XXXX represents a character. The utility in the library can convert any string in any language (including special characters) into Unicode sequences and vise-versa. Here is a sample of how it works:
result = "Hello World";
result = StringUnicodeEncoderDecoder.encodeStringToUnicodeSequence(result);
System.out.println(result);
result = StringUnicodeEncoderDecoder.decodeUnicodeSequenceToString(result);
System.out.println(result);
The output of this code is:
\u0048\u0065\u006c\u006c\u006f\u0020\u0057\u006f\u0072\u006c\u0064
Hello World
The library can be found at Maven Central or at Github It comes as maven artifact and with sources and javadoc
Here is javadoc for the class StringUnicodeEncoderDecoder

The readChar() method displays japanese character

I'm trying to write a code that pick-up a word from a file according to an index entered by the user but the problem is that the method readChar() from the RandomAccessFile class is returning japanese characters, I must admit that it's not the first time that I've seen this on my lenovo laptop , sometimes on some installation wizards I can see mixed stuff with normal characters mixed with japanese characters, do you think it comes from the laptop or rather from the code?
This is the code:
package com.project;
import java.io.*;
import java.util.StringTokenizer;
public class Main {
public static void main(String[] args) throws IOException {
int N, i=0;
char C;
char[] charArray = new char[100];
String fileLocation = "file.txt";
BufferedReader buffer = new BufferedReader(new InputStreamReader(System.in));
do {
System.out.println("enter the index of the word");
N = Integer.parseInt(buffer.readLine());
if (N!=0) {
RandomAccessFile word = new RandomAccessFile(new File(fileLocation), "r");
do {
word.seek((2*(N-1))+i);
C = word.readChar();
charArray[i] = C;
i++;
}while(charArray[i-1] != ' ');
System.out.println("the word of index " + N + " is: " );
for (char carTemp : charArray )
System.out.print(carTemp);
System.out.print("\n");
}
}while(N!=0);
buffer.close();
}
}
i get this output :
瑯潕啰灰灥敲牃䍡慳獥攨⠩⤍ഊੴ瑯潌䱯潷睥敲牃䍡慳獥攨⠩⤍ഊ੣捯潭浣捡慴琨⡓却瑲物楮湧朩⤍ഊ੣捨桡慲牁䅴琨⡩楮湴琩⤍ഊੳ獵畢扳獴瑲物楮湧木⠠⁳獴瑡慲牴琠⁩楮湤摥數砬Ⱐ⁥敮湤搠⁩楮湤摥數砩⤍ഊੴ瑲物業洨⠩Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: Index 100 out of bounds for length 100
at Main.main(Main.java:21)

There are many things wrong, all of which have to do with fundamental misconceptions.
First off: A file on your disk - never mind the File interface in Java, or any other programming language; the file itself - does not and cannot store text. Ever. It stores bytes. That is, raw data, as (on every machine that's been relevant for decades, but historically there have been other ways to do it) quantified in bits, which are organized into groups of 8 that are called bytes.
Text is an abstraction; an interpretation of some particular sequence of byte values. It depends - fundamentally and unavoidably - on an encoding. Because this isn't a blog, I'll spare you the history lesson here, but suffice to say that Java's char type does not simply store a character of text. It stores an unsigned two-byte value, which may represent a character of text. Because there are more characters of text in Unicode than two bytes can represent, sometimes two adjacent chars in an array are required to represent a character of text. (And, of course, there is probably code out there that abuses the char type simply because someone wanted an unsigned equivalent of short. I may even have written some myself. That era is a blur for me.)
Anyway, the point is: using .readChar() is going to read two bytes from your file, and store them into a char within your char[], and the corresponding numeric value is not going to be anything like the one you wanted - unless your file happens to be encoded using the same encoding that Java uses natively, called UTF-16.
You cannot properly read and interpret the file without knowing the file encoding. Full stop. You can at best delude yourself into believing that you can read it. You also cannot have "random access" to a text file - i.e., indexing according to a number of characters of text - unless the encoding in question is constant width. (Otherwise, of course, you can't just calculate the distance-in-bytes into the file where a given character of text is; it depends on how many bytes the previous characters took up, which depends on which characters they are.) Many text encodings are not constant width. One of the most popular, which frankly is the sane default recommendation for most tasks these days, is not. In which case you are simply out of luck for the problem you describe.
At any rate, once you know the encoding of your file, the expected way to retrieve a character of text from a file in Java is to use one of the Reader classes, such as InputStreamReader:
An InputStreamReader is a bridge from byte streams to character streams: It reads bytes and decodes them into characters using a specified charset. The charset that it uses may be specified by name or may be given explicitly, or the platform's default charset may be accepted.
(Here, charset simply means an instance of the class that Java uses to represent text encodings.)
You may be able to fudge your problem description a little bit: seek to a byte offset, and then grab the text characters starting at that offset. However, there is no guarantee that the "text characters starting at that offset" make any sense, or in fact can be decoded at all. If the offset happens to be in the middle of a multi-byte encoding for a character, the remaining part isn't necessarily valid encoded text.

char is 16 bits, i.e. 2 bytes.
seek seeks to a byte boundary.
If the file contains chars then they are at even offsets: 0, 2, 4...
The expression (2*(N-1))+i) is even iff i is even; if odd, you are sure to land in the middle of a char, and thus read garbage.
i starts at zero, but you increment by 1, i.e., half a character.
Your seek argument should probably be (2*(N-1+i)).
Alternative explanation: your file does not contain chars at all; for example, you created an ASCII file in which a character is a single byte.
In that case, the error is attempting to read ASCII (an obsolete character encoding) with a readChar function.
But if the file contains ASCII, the purpose of multiplying by 2 in the seek argument is obscure. It apparently serves no useful purpose.

I changed the encoding of the file to UTF-16 and modified the programe in order to display the right indexes, those that represents the beginning of each word, now it works fine, Thank you guys.
import java.io.*;
public class Main {
public static void main(String[] args) throws IOException {
int N, i=0, j=0, k=0;
char C;
char[] charArray = new char[100];
String fileLocation = "file.txt";
BufferedReader buffer = new BufferedReader(new InputStreamReader(System.in));
DataInputStream in = new DataInputStream(new FileInputStream(fileLocation));
boolean EOF=false;
do {
try {
j++;
C = in.readChar();
if((C==' ')||(C=='\n')){
System.out.print(j+1+"\t");
}
}catch (IOException e){
EOF=true;
}
}while (EOF!=true);
System.out.println("\n");
do {
System.out.println("enter the index of the word");
N = Integer.parseInt(buffer.readLine());
if (N!=0) {
RandomAccessFile word = new RandomAccessFile(new File(fileLocation), "r");
do {
word.seek((2*(N-1+i)));
C = word.readChar();
charArray[i] = C;
i++;
}while(charArray[i-1] != ' ' && charArray[i-1] != '\n');
System.out.print("the word of index " + N + " is: " );
for (char carTemp : charArray )
System.out.print(carTemp);
System.out.print("\n");
i=0;
charArray = new char[100];
}
}while(N!=0);
buffer.close();
}
}

How can I save a String Byte without losing information?

I'm developing a JPEG decoder(I'm in the Huffman phase) and I want to write BinaryString's into a file.
For example, let's say we've this:
String huff = "00010010100010101000100100";
I've tried to convert it to an integer spliting it by 8 and saving it integer represantation, as I can't write bits:
huff.split("(?<=\\G.{8})"))
int val = Integer.parseInt(str, 2);
out.write(val); //writes to a FileOutputStream
The problem is that, in my example, if I try to save "00010010" it converts it to 18 (10010), and I need the 0's.
And finally, when I read :
int enter;
String code = "";
while((enter =in.read())!=-1) {
code+=Integer.toBinaryString(enter);
}
I got :
Code = 10010
instead of:
Code = 00010010
Also I've tried to convert it to bitset and then to Byte[] but I've the same problem.

Your example is that you have the string "10010" and you want the string "00010010". That is, you need to left-pad this string with zeroes. Note that since you're joining the results of many calls to Integer.toBinaryString in a loop, you need to left-pad these strings inside the loop, before concatenating them.
while((enter = in.read()) != -1) {
String binary = Integer.toBinaryString(enter);
// left-pad to length 8
binary = ("00000000" + binary).substring(binary.length());
code += binary;
}

You might want to look at the UTF-8 algorithm, since it does exactly what you want. It stores massive amounts of data while discarding zeros, keeping relevant data, and encoding it to take up less disk space.
Works with: Java version 7+
import java.nio.charset.StandardCharsets;
import java.util.Formatter;
public class UTF8EncodeDecode {
public static byte[] utf8encode(int codepoint) {
return new String(new int[]{codepoint}, 0, 1).getBytes(StandardCharsets.UTF_8);
}
public static int utf8decode(byte[] bytes) {
return new String(bytes, StandardCharsets.UTF_8).codePointAt(0);
}
public static void main(String[] args) {
System.out.printf("%-7s %-43s %7s\t%s\t%7s%n",
"Char", "Name", "Unicode", "UTF-8 encoded", "Decoded");
for (int codepoint : new int[]{0x0041, 0x00F6, 0x0416, 0x20AC, 0x1D11E}) {
byte[] encoded = utf8encode(codepoint);
Formatter formatter = new Formatter();
for (byte b : encoded) {
formatter.format("%02X ", b);
}
String encodedHex = formatter.toString();
int decoded = utf8decode(encoded);
System.out.printf("%-7c %-43s U+%04X\t%-12s\tU+%04X%n",
codepoint, Character.getName(codepoint), codepoint, encodedHex, decoded);
}
}
}
https://rosettacode.org/wiki/UTF-8_encode_and_decode#Java
UTF-8 is a variable width character encoding capable of encoding all 1,112,064[nb 1] valid code points in Unicode using one to four 8-bit bytes.[nb 2] The encoding is defined by the Unicode Standard, and was originally designed by Ken Thompson and Rob Pike.[1][2] The name is derived from Unicode (or Universal Coded Character Set) Transformation Format – 8-bit.[3]
It was designed for backward compatibility with ASCII. Code points with lower numerical values, which tend to occur more frequently, are encoded using fewer bytes. The first 128 characters of Unicode, which correspond one-to-one with ASCII, are encoded using a single byte with the same binary value as ASCII, so that valid ASCII text is valid UTF-8-encoded Unicode as well. Since ASCII bytes do not occur when encoding non-ASCII code points into UTF-8, UTF-8 is safe to use within most programming and document languages that interpret certain ASCII characters in a special way, such as "/" (slash) in filenames, "\" (backslash) in escape sequences, and "%" in printf.
https://en.wikipedia.org/wiki/UTF-8
Binary 11110000 10010000 10001101 10001000 becomes F0 90 8D 88 in UTF-8. Since you are storing it as text, you go from having to store 32 characters to storing 8. And because it's a well known and well designed encoding, you can reverse it easily. All the math is done for you.
Your example of 00010010100010101000100100 (or rather 00000001 0010100 0101010 00100100) converts to *$ (two unprintable characters on my machine). That's the UTF-8 encoding of the binary. I had mistakenly used a different site that was using the data I put in as decimal instead of binary.
https://onlineutf8tools.com/convert-binary-to-utf8
For a really good explanation of UTF-8 and how it can apply to the answer:
https://hackaday.com/2013/09/27/utf-8-the-most-elegant-hack/
Edit:
I took this question as a way to reduce the amount of characters needed to store values, which is a type of encoding. UTF-8 is a type of encoding. Used in a "non-standard" way, the OP can use UTF-8 to encode their strings of 0's & 1's in a much shorter format. That's how this answer is relevant.
If you concatenate the characters, you can go from 4x 8 bits (32 bits) to 8x 8 bits (64 bits) easily and encode a value as large as 9,223,372,036,854,775,807.

In what format is data sent on, getInputStream on a URLconnection Object?

Im trying to connect to a php script on a server and retrieve the text the script echoes.Do accomplish I used the following code.
CODE:=
import java.net.*;
import java.io.*;
class con{
public static void main(String[] args){
try{
int c;
URL tj = new URL("http://www.thejoint.cf/test.php");
URLConnection tjcon = tj.openConnection();
InputStream input = tjcon.getInputStream();
while(((c = input.read()) != -1)){
System.out.print((char) c);
}
input.close();
}catch(Exception e){
System.out.println("Caught this Exception:"+e);
}
}
}
I do get the desired output that is the text "You will be Very successful".But when I remove the (char) type casting it yields a 76 digit long.
8911111732119105108108329810132118101114121321151179999101115115102117108108
number which I'm not able to make sense of.I read that the getInputStream is a byte stream, then should there be number of digits times 8 number long output?
Any insight would be very helpful, Thank you

It does not print one number 76 digits long. You have a loop there, it prints a lot of numbers, each up to three digits long (one byte).
In ASCII, 89 = "Y", 111 = "o" ....
What the cast to char that you removed did was that it interpreted that number as a Unicode code point and printed the corresponding characters instead (also one at a time).
This way of reading text byte by byte is very fragile. It basically only works with ASCII. You should be using a Reader to wrap the InputStream. Then you can read char and String directly (and it will take care of character sets such as Unicode).
Oh I thought it would give out the byte representation of the individual letter.
But that's exactly what it does.
You can see it more clearly if you use println instead of print (then it will print each number on its own line).

Unicode to String in java but tricky

I was fetching data from a website using its API which was returning the data in JSON format.
The issue was when there where some umlaut characters in the JSON. It would return its UNICODE, for e.g. Münich would be Mu\u0308nich.
When I passed this JSON string to the constructor of the org.codehaus.jettison.json.JSONObject, Mu\u0308nich was converted to Munich (n has an umlaut). Wrong.
I realized this very late (after fetching the entire data). Now I use the following method to convert it back to the Unicode form i.e. I pass Munich (n has an umlaut) to the method and it returns Mu\u0308nich.
I want to somehow convert this Mu\u0308nich to Münich. Any ideas?
Please note the conversion is needed only for u\u0308 to ü and o\u0308 to ö and a\u0308 to ä and so on.
Method used to convert back -
public static String escapeUnicode(String input) {
StringBuilder b = new StringBuilder(input.length());
Formatter f = new Formatter(b);
for (char c : input.toCharArray()) {
if (c < 128) {
b.append(c);
} else {
f.format("\\u%04x", (int) c);
}
}
return b.toString();
}

These are called Diacritics and you can use Normalizer to combine diacritics into single unicode characters.
Use the normalize method and as Form NFKC. This will first decompose the full string into diacritics and then do a composition to return 'real' unicode umlauts.
So: 'München' stays 'München' and 'Mu\u0308nchen' will become 'München'
You then will have the string in a single format, not using diacritics anymore and easily portable and displayable.
If you work with texts from different platforms, some normalization is crucial or you will end up with the problems you described.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.