UTF-8 and charcters in Java and Eclipse IDE - java

public static void main(String[] args) throws UnsupportedEncodingException {
String str = "अ";
byte[] bytes = str.getBytes("UTF-8");
for (byte b : bytes) {
System.out.print(b + "\t");
}
String hindi = new String(bytes, "UTF-8");
System.out.println("\nHindi = " + hindi);
System.out.println((int) 'अ');
}
OUTPUT:
-32 -92 -123
Hindi = अ
2309
I need explanation on those three outputs. Especially the last one.
Also, I copy paste this characterअ from a web page. How do I type it manually in Eclipse IDE? For example, ALT + 65 will give 'A' but ALT + 2309 does not give me 'अ' (I copy paste this again).

The first print:
See public byte[] getBytes(Charset charset):
Encodes this String into a sequence of bytes using the given charset,
storing the result into a new byte array.
The second print:
See public String(byte[] bytes,
Charset charset):
Constructs a new String by decoding the specified array of bytes using
the specified charset.
The third print:
See this link:
You're printing the decimal code of it, which is 2309.
The links provided above should help you to understand the output you're getting in each case.

Ideally you need to type letter's associated unicode number "\uNNNN" in Java in any IDE.
Further for single character, straight approach would be:
char c = '\uNNNN';
System.out.println((int) c);
Update: List of Unicode ranges for Devanagri http://www.unicode.org/charts/PDF/U0900.pdf
To type in Hindi in notepad/IDE etc, search for software which which maps keyboard keys to specific Hindi letter and related grammaticle punctuations.

Related

How can I save a String Byte without losing information?

I'm developing a JPEG decoder(I'm in the Huffman phase) and I want to write BinaryString's into a file.
For example, let's say we've this:
String huff = "00010010100010101000100100";
I've tried to convert it to an integer spliting it by 8 and saving it integer represantation, as I can't write bits:
huff.split("(?<=\\G.{8})"))
int val = Integer.parseInt(str, 2);
out.write(val); //writes to a FileOutputStream
The problem is that, in my example, if I try to save "00010010" it converts it to 18 (10010), and I need the 0's.
And finally, when I read :
int enter;
String code = "";
while((enter =in.read())!=-1) {
code+=Integer.toBinaryString(enter);
}
I got :
Code = 10010
instead of:
Code = 00010010
Also I've tried to convert it to bitset and then to Byte[] but I've the same problem.
Your example is that you have the string "10010" and you want the string "00010010". That is, you need to left-pad this string with zeroes. Note that since you're joining the results of many calls to Integer.toBinaryString in a loop, you need to left-pad these strings inside the loop, before concatenating them.
while((enter = in.read()) != -1) {
String binary = Integer.toBinaryString(enter);
// left-pad to length 8
binary = ("00000000" + binary).substring(binary.length());
code += binary;
}
You might want to look at the UTF-8 algorithm, since it does exactly what you want. It stores massive amounts of data while discarding zeros, keeping relevant data, and encoding it to take up less disk space.
Works with: Java version 7+
import java.nio.charset.StandardCharsets;
import java.util.Formatter;
public class UTF8EncodeDecode {
public static byte[] utf8encode(int codepoint) {
return new String(new int[]{codepoint}, 0, 1).getBytes(StandardCharsets.UTF_8);
}
public static int utf8decode(byte[] bytes) {
return new String(bytes, StandardCharsets.UTF_8).codePointAt(0);
}
public static void main(String[] args) {
System.out.printf("%-7s %-43s %7s\t%s\t%7s%n",
"Char", "Name", "Unicode", "UTF-8 encoded", "Decoded");
for (int codepoint : new int[]{0x0041, 0x00F6, 0x0416, 0x20AC, 0x1D11E}) {
byte[] encoded = utf8encode(codepoint);
Formatter formatter = new Formatter();
for (byte b : encoded) {
formatter.format("%02X ", b);
}
String encodedHex = formatter.toString();
int decoded = utf8decode(encoded);
System.out.printf("%-7c %-43s U+%04X\t%-12s\tU+%04X%n",
codepoint, Character.getName(codepoint), codepoint, encodedHex, decoded);
}
}
}
https://rosettacode.org/wiki/UTF-8_encode_and_decode#Java
UTF-8 is a variable width character encoding capable of encoding all 1,112,064[nb 1] valid code points in Unicode using one to four 8-bit bytes.[nb 2] The encoding is defined by the Unicode Standard, and was originally designed by Ken Thompson and Rob Pike.[1][2] The name is derived from Unicode (or Universal Coded Character Set) Transformation Format – 8-bit.[3]
It was designed for backward compatibility with ASCII. Code points with lower numerical values, which tend to occur more frequently, are encoded using fewer bytes. The first 128 characters of Unicode, which correspond one-to-one with ASCII, are encoded using a single byte with the same binary value as ASCII, so that valid ASCII text is valid UTF-8-encoded Unicode as well. Since ASCII bytes do not occur when encoding non-ASCII code points into UTF-8, UTF-8 is safe to use within most programming and document languages that interpret certain ASCII characters in a special way, such as "/" (slash) in filenames, "\" (backslash) in escape sequences, and "%" in printf.
https://en.wikipedia.org/wiki/UTF-8
Binary 11110000 10010000 10001101 10001000 becomes F0 90 8D 88 in UTF-8. Since you are storing it as text, you go from having to store 32 characters to storing 8. And because it's a well known and well designed encoding, you can reverse it easily. All the math is done for you.
Your example of 00010010100010101000100100 (or rather 00000001 0010100 0101010 00100100) converts to *$ (two unprintable characters on my machine). That's the UTF-8 encoding of the binary. I had mistakenly used a different site that was using the data I put in as decimal instead of binary.
https://onlineutf8tools.com/convert-binary-to-utf8
For a really good explanation of UTF-8 and how it can apply to the answer:
https://hackaday.com/2013/09/27/utf-8-the-most-elegant-hack/
Edit:
I took this question as a way to reduce the amount of characters needed to store values, which is a type of encoding. UTF-8 is a type of encoding. Used in a "non-standard" way, the OP can use UTF-8 to encode their strings of 0's & 1's in a much shorter format. That's how this answer is relevant.
If you concatenate the characters, you can go from 4x 8 bits (32 bits) to 8x 8 bits (64 bits) easily and encode a value as large as 9,223,372,036,854,775,807.

Converting unicode chars like 0x00 to bytes

I am trying to port small library from Java to C#.
I encountered problem during conversion of unicode strings to the bytes. This can be displayed by snippets below:
import java.io.*;
public class Test {
public static void method(String x){
System.out.println(x);
byte[] bytes = x.getBytes();
for (byte z : bytes) {
System.out.println(z);
}
System.out.println("Array length: "+bytes.length);
}
public static void main(String args[]) {
method(""+(char)0xEE+(char)0x00+"testowy wydruk");
}
}
This will do 3 things:
print string
get bytes
print that array + its length
I rewrote this snippet to C#:
string x = "" + (char)0xEE + (char)0x00 + "testowy wydruk";
Console.WriteLine(x);
byte[] d = System.Text.Encoding.ASCII.GetBytes(x);
foreach(byte z in d)
{
Console.WriteLine(z);
}
Console.WriteLine("Array length: "+d.Count());
Don't know why in Java Array has 17 elems and in C# it has 16.
difference is in first elements of bytes arrays:
Unfortunately this differences can cause a problems later because this array is being sent to another api.
(char)0xEE is î, aka Unicode Character 'LATIN SMALL LETTER I WITH CIRCUMFLEX' (U+00EE), which is encoded to UTF-8 as 0xC3 0xAE, aka -61 -82.
Your Java code doesn't specify which encoding you wanted the bytes in, so Java apparently converted to UTF-8 for you (default varies by installation).
You explicitly specified ASCII in the C# code, so the EE character was converted to ?, aka 0x3F aka 63, since there is no such character in ASCII.
If you change Java code to use getBytes("ASCII") or getBytes(StandardCharsets.US_ASCII), then you get same result as C#.

Java, trying to create a specific network byte header based on length of command

I'm running into some trouble when attempting to create a network byte header. The header should be 2 bytes long, which simply defines the length of the following command.
For example; The following command String "HED>0123456789ABCDEF" is 20 characters long, which is 0014 as hex signed 2 complement, creating the network byte header for this command works as the command is under 124 characters. The following snippet of code essentially works out the byte header and adds the following prefix to the command \u00000\u0014 when the command is under 124 characters.
However for commands that are 124 characters or above, the code in the if block doesn't work. Therefore, I looked into possible alternatives and tried a couple of things regarding generating hex characters and setting them as the network byte header, but as they aren't bytes it's not going to work (As seen in the else block). Instead the else block simply returns 0090 for a command which is 153 characters long which is technically correct, but I'm not able to use this 'length' header the same way as the if blocks length header
public static void main(String[] args) {
final String commandHeader = "HED>";
final String command = "0123456789ABCDEF";
short commandLength = (short) (commandHeader.length() + command.length());
char[] array;
if( commandLength < 124 )
{
final ByteBuffer bb = ByteBuffer.allocate(2).putShort(commandLength);
array = new String( bb.array() ).toCharArray();
}
else
{
final ByteBuffer bb = ByteBuffer.allocate(2).putShort(commandLength);
array = convertToHex(bb.array());
}
final String command = new String(array) + commandHeader + command;
System.out.println( command );
}
private static char[] convertToHex(byte[] data) {
final StringBuilder buf = new StringBuilder();
for (byte b : data) {
int halfByte = (b >>> 4) & 0x0F;
int twoHalves = 0;
do {
if ((0 <= halfByte) && (halfByte <= 9))
buf.append((char) ( '0' + halfByte));
halfByte = b & 0x0F;
} while (twoHalves++ < 1);
}
return buf.toString().toCharArray();
}
Furthermore, I have managed to get this working in Python 2 by doing the following three lines, no less! This returns the following network byte header for a 153 character command as \x00\x99
msg_length = len(str_header + str_command)
command_length = pack('>h', msg_length)
command = command_length + str_header + str_command
Also simply replicated by running Python 2 and entering the following commands:
In [1]: import struct
In [2]: struct.pack('>h', 153)
Out[2]: '\x00\x99'
Any assistance, or light that could be shed to resolve this issue would be greatly appreciated.
The basic problem is that you (try to) convert fundamentally binary data to character data. Furthermore, you do it using the platform's default charset, which varies from machine to machine.
I think you have mischaracterized the problem slightly, however. I am confident that it arises when command.length() is at least 124, so that commandLength, which includes the length of commandHeader, too, is at least 128. You would also find that there are some (much) larger command lengths that worked, too.
The key point here is that when any of the bytes in the binary representation of the length have their most-significant bit set, that is meaningful to some character encodings, especially UTF-8, which is a common (but not universal) default. Unless you get very lucky, binary lengths that have any such bytes will not be correctly decoded into characters in UTF-8. Moreover, they may get decoded into characters successfully but differently on machines with that use different charsets for the purpose.
You also have another, related inconsistency. You are formatting data for transmission over the network, which is a byte-oriented medium. The transmission will be a sequence of bytes. But you are measuring and reporting the number of characters in the decoded internal representation, not the number of bytes in the encoded representation that will go over the wire. The two counts are the same for your example command, but they would differ for some strings that you could express in Java.
Additionally, your code is inconsistent with your description of the format wanted. You say that the "network byte header" should be four bytes long, but your code emits only two.
You can address all these issues by taking character encoding explicitly into account, and by avoiding the unneeded and inappropriate conversion of raw binary data to character data. The ByteBuffer class you're already using can help with that. For example:
public static void main(String[] args) throws IOException {
String commandHeader = "HED>";
// a 128-byte command
String command = "0123456789ABCDEF"
+ "0123456789ABCDEF"
+ "0123456789ABCDEF"
+ "0123456789ABCDEF"
+ "0123456789ABCDEF"
+ "0123456789ABCDEF"
+ "0123456789ABCDEF"
+ "0123456789ABCDEF";
// Convert characters to bytes, and do so with a specified charset
// Note that ALL Java implementations are required to support UTF-8
byte[] commandHeaderBytes = commandHeader.getBytes("UTF-8");
byte[] commandBytes = command.getBytes("UTF-8");
// Measure the command length in bytes, since that's what the receiver
// will need to know
int commandLength = commandHeaderBytes.length + commandBytes.length;
// Build the whole message in your ByteBuffer
// Allow a 4-byte length field, per spec
ByteBuffer bb = ByteBuffer.allocate(commandLength + 4);
bb.putInt(commandLength)
.put(commandHeaderBytes)
.put(commandBytes);
// DO NOT convert to a String or other character type. Output the
// bytes directly.
System.out.write(bb.array());
System.out.println();
}

Java NIO server receives random string [duplicate]

I'm writing a web application in Google app Engine. It allows people to basically edit html code that gets stored as an .html file in the blobstore.
I'm using fetchData to return a byte[] of all the characters in the file. I'm trying to print to an html in order for the user to edit the html code. Everything works great!
Here's my only problem now:
The byte array is having some issues when converting back to a string. Smart quotes and a couple of characters are coming out looking funky. (?'s or japanese symbols etc.) Specifically it's several bytes I'm seeing that have negative values which are causing the problem.
The smart quotes are coming back as -108 and -109 in the byte array. Why is this and how can I decode the negative bytes to show the correct character encoding?
The byte array contains characters in a special encoding (that you should know). The way to convert it to a String is:
String decoded = new String(bytes, "UTF-8"); // example for one encoding type
By The Way - the raw bytes appear may appear as negative decimals just because the java datatype byte is signed, it covers the range from -128 to 127.
-109 = 0x93: Control Code "Set Transmit State"
The value (-109) is a non-printable control character in UNICODE. So UTF-8 is not the correct encoding for that character stream.
0x93 in "Windows-1252" is the "smart quote" that you're looking for, so the Java name of that encoding is "Cp1252". The next line provides a test code:
System.out.println(new String(new byte[]{-109}, "Cp1252"));
Java 7 and above
You can also pass your desired encoding to the String constructor as a Charset constant from StandardCharsets. This may be safer than passing the encoding as a String, as suggested in the other answers.
For example, for UTF-8 encoding
String bytesAsString = new String(bytes, StandardCharsets.UTF_8);
You can try this.
String s = new String(bytearray);
public class Main {
/**
* Example method for converting a byte to a String.
*/
public void convertByteToString() {
byte b = 65;
//Using the static toString method of the Byte class
System.out.println(Byte.toString(b));
//Using simple concatenation with an empty String
System.out.println(b + "");
//Creating a byte array and passing it to the String constructor
System.out.println(new String(new byte[] {b}));
}
/**
* #param args the command line arguments
*/
public static void main(String[] args) {
new Main().convertByteToString();
}
}
Output
65
65
A
public static String readFile(String fn) throws IOException
{
File f = new File(fn);
byte[] buffer = new byte[(int)f.length()];
FileInputStream is = new FileInputStream(fn);
is.read(buffer);
is.close();
return new String(buffer, "UTF-8"); // use desired encoding
}
I suggest Arrays.toString(byte_array);
It depends on your purpose. For example, I wanted to save a byte array exactly like the format you can see at time of debug that is something like this : [1, 2, 3] If you want to save exactly same value without converting the bytes to character format, Arrays.toString (byte_array) does this,. But if you want to save characters instead of bytes, you should use String s = new String(byte_array). In this case, s is equal to equivalent of [1, 2, 3] in format of character.
The previous answer from Andreas_D is good. I'm just going to add that wherever you are displaying the output there will be a font and a character encoding and it may not support some characters.
To work out whether it is Java or your display that is a problem, do this:
for(int i=0;i<str.length();i++) {
char ch = str.charAt(i);
System.out.println(i+" : "+ch+" "+Integer.toHexString(ch)+((ch=='\ufffd') ? " Unknown character" : ""));
}
Java will have mapped any characters it cannot understand to 0xfffd the official character for unknown characters. If you see a '?' in the output, but it is not mapped to 0xfffd, it is your display font or encoding that is the problem, not Java.

How can I check whether a byte array contains a Unicode string in Java?

Given a byte array that is either a UTF-8 encoded string or arbitrary binary data, what approaches can be used in Java to determine which it is?
The array may be generated by code similar to:
byte[] utf8 = "Hello World".getBytes("UTF-8");
Alternatively it may have been generated by code similar to:
byte[] messageContent = new byte[256];
for (int i = 0; i < messageContent.length; i++) {
messageContent[i] = (byte) i;
}
The key point is that we don't know what the array contains but need to find out in order to fill in the following function:
public final String getString(final byte[] dataToProcess) {
// Determine whether dataToProcess contains arbitrary data or a UTF-8 encoded string
// If dataToProcess contains arbitrary data then we will BASE64 encode it and return.
// If dataToProcess contains an encoded string then we will decode it and return.
}
How would this be extended to also cover UTF-16 or other encoding mechanisms?
It's not possible to make that decision with full accuracy in all cases, because an UTF-8 encoded string is one kind of arbitrary binary data, but you can look for byte sequences that are invalid in UTF-8. If you find any, you know that it's not UTF-8.
If you array is large enough, this should work out well since it is very likely for such sequences to appear in "random" binary data such as compressed data or image files.
However, it is possible to get valid UTF-8 data that decodes to a totally nonsensical string of characters (probably from all kinds of diferent scripts). This is more likely with short sequences. If you're worried about that, you might have to do a closer analysis to see whether the characters that are letters all belong to the same code chart. Then again, this may yield false negatives when you have valid text input that mixes scripts.
Here's a way to use the UTF-8 "binary" regex from the W3C site
static boolean looksLikeUTF8(byte[] utf8) throws UnsupportedEncodingException
{
Pattern p = Pattern.compile("\\A(\n" +
" [\\x09\\x0A\\x0D\\x20-\\x7E] # ASCII\\n" +
"| [\\xC2-\\xDF][\\x80-\\xBF] # non-overlong 2-byte\n" +
"| \\xE0[\\xA0-\\xBF][\\x80-\\xBF] # excluding overlongs\n" +
"| [\\xE1-\\xEC\\xEE\\xEF][\\x80-\\xBF]{2} # straight 3-byte\n" +
"| \\xED[\\x80-\\x9F][\\x80-\\xBF] # excluding surrogates\n" +
"| \\xF0[\\x90-\\xBF][\\x80-\\xBF]{2} # planes 1-3\n" +
"| [\\xF1-\\xF3][\\x80-\\xBF]{3} # planes 4-15\n" +
"| \\xF4[\\x80-\\x8F][\\x80-\\xBF]{2} # plane 16\n" +
")*\\z", Pattern.COMMENTS);
String phonyString = new String(utf8, "ISO-8859-1");
return p.matcher(phonyString).matches();
}
As originally written, the regex is meant to be used on a byte array, but you can't do that with Java's regexes; the target has to be something that implements the CharSequence interface (so a char[] is out, too). By decoding the byte[] as ISO-8859-1, you create a String in which each char has the same unsigned numeric value as the corresponding byte in the original array.
As others have pointed out, tests like this can only tell you the byte[] could contain UTF-8 text, not that it does. But the regex is so exhaustive, it seems extremely unlikely that raw binary data could slip past it. Even an array of all zeroes wouldn't match, since the regex never matches NUL. If the only possibilities are UTF-8 and binary, I'd be willing to trust this test.
And while you're at it, you could strip the UTF-8 BOM if there is one; otherwise, the UTF-8 CharsetDecoder will pass it through as if it were text.
UTF-16 would be much more difficult, because there are very few byte sequences that are always invalid. The only ones I can think of offhand are high-surrogate characters that are missing their low-surrogate companions, or vice versa. Beyond that, you would need some context to decide whether a given sequence is valid. You might have a Cyrillic letter followed by a Chinese ideogram followed by a smiley-face dingbat, but it would be perfectly valid UTF-16.
The question assumes that there is a fundamental difference between a string and binary data. While this is intuitively so, it is next to impossible to define precisely what that difference is.
A Java String is a sequence of 16 bit quantities that correspond to one of the (almost) 2**16 Unicode basic codepoints. But if you look at those 16 bit 'characters', each one could equally represent an integer, a pair of bytes, a pixel, and so on. The bit patterns don't have anything intrinsic about that says what they represent.
Now suppose that you rephrased your question as asking for a way to distinguish UTF-8 encoded TEXT from arbitrary binary data. Does this help? In theory no, because the bit patterns that encode any written text can also be a sequence of numbers. (It is hard to say what "arbitrary" really means here. Can you tell me how to test if a number is "arbitrary"?)
The best we can do here is the following:
Test if the bytes are a valid UTF-8 encoding.
Test if the decoded 16-bit quantities are all legal, "assigned" UTF-8 code-points. (Some 16 bit quantities are illegal (e.g. 0xffff) and others are not currently assigned to correspond to any character.) But what if a text document really uses an unassigned codepoint?
Test if the Unicode codepoints belong to the "planes" that you expect based on the assumed language of the document. But what if you don't know what language to expect, or if a document that uses multiple languages?
Test is the sequences of codepoints look like words, sentences, or whatever. But what if we had some "binary data" that happened to include embedded text sequences?
In summary, you can tell that a byte sequence is definitely not UTF-8 if the decode fails. Beyond that, if you make assumptions about language, you can say that a byte sequence is probably or probably not a UTF-8 encoded text document.
IMO, the best thing you can do is to avoid getting into a situation where you program needs to make this decision. And if cannot avoid it, recognize that your program may get it wrong. With thought and hard work, you can make that unlikely, but the probability will never be zero.
In the original question: How can I check whether a byte array contains a Unicode string in Java?; I found that the term Java Unicode is essentially referring to Utf16 Code Units. I went through this problem myself and created some code that could help anyone with this type of question on their mind find some answers.
I have created 2 main methods, one will display Utf-8 Code Units and the other will create Utf-16 Code Units. Utf-16 Code Units is what you will encounter with Java and JavaScript...commonly seen in the form "\ud83d"
For more help with Code Units and conversion try the website;
https://r12a.github.io/apps/conversion/
Here is code...
byte[] array_bytes = text.toString().getBytes();
char[] array_chars = text.toString().toCharArray();
System.out.println();
byteArrayToUtf8CodeUnits(array_bytes);
System.out.println();
charArrayToUtf16CodeUnits(array_chars);
public static void byteArrayToUtf8CodeUnits(byte[] byte_array)
{
/*for (int k = 0; k < array.length; k++)
{
System.out.println(name + "[" + k + "] = " + "0x" + byteToHex(array[k]));
}*/
System.out.println("array.length: = " + byte_array.length);
//------------------------------------------------------------------------------------------
for (int k = 0; k < byte_array.length; k++)
{
System.out.println("array byte: " + "[" + k + "]" + " converted to hex" + " = " + byteToHex(byte_array[k]));
}
//------------------------------------------------------------------------------------------
}
public static void charArrayToUtf16CodeUnits(char[] char_array)
{
/*Utf16 code units are also known as Java Unicode*/
System.out.println("array.length: = " + char_array.length);
//------------------------------------------------------------------------------------------
for (int i = 0; i < char_array.length; i++)
{
System.out.println("array char: " + "[" + i + "]" + " converted to hex" + " = " + charToHex(char_array[i]));
}
//------------------------------------------------------------------------------------------
}
static public String byteToHex(byte b)
{
//Returns hex String representation of byte b
char hexDigit[] =
{
'0', '1', '2', '3', '4', '5', '6', '7',
'8', '9', 'a', 'b', 'c', 'd', 'e', 'f'
};
char[] array = { hexDigit[(b >> 4) & 0x0f], hexDigit[b & 0x0f] };
return new String(array);
}
static public String charToHex(char c)
{
//Returns hex String representation of char c
byte hi = (byte) (c >>> 8);
byte lo = (byte) (c & 0xff);
return byteToHex(hi) + byteToHex(lo);
}
If the byte array begins with a Byte Order Mark (BOM) then it will be easy to distinguish what encoding has been used. The standard Java classes for processing text streams will probably deal with this for you automatically.
If you do not have a BOM in your byte data this will be substantially more difficult — .NET classes can perform statistical analysis to try and work out the encoding, but I think this is on the assumption that you know that you are dealing with text data (just don't know which encoding was used).
If you have any control over the format for your input data your best choice would be to ensure that it contains a Byte Order Mark.
Try decoding it. If you do not get any errors, then it is a valid UTF-8 string.
I think Michael has explained it well in his answer this may be the only way to find out if a byte array contains all valid utf-8 sequences. I am using following code in php
function is_utf8($string) {
return preg_match('%^(?:
[\x09\x0A\x0D\x20-\x7E] # ASCII
| [\xC2-\xDF][\x80-\xBF] # non-overlong 2-byte
| \xE0[\xA0-\xBF][\x80-\xBF] # excluding overlongs
| [\xE1-\xEC\xEE\xEF][\x80-\xBF]{2} # straight 3-byte
| \xED[\x80-\x9F][\x80-\xBF] # excluding surrogates
| \xF0[\x90-\xBF][\x80-\xBF]{2} # planes 1-3
| [\xF1-\xF3][\x80-\xBF]{3} # planes 4-15
| \xF4[\x80-\x8F][\x80-\xBF]{2} # plane 16
)*$%xs', $string);
}
Taken it from W3.org

Categories