Separating Unicode ligature characters

Separating Unicode ligature characters - java

Throughout the vast number of unicode characters, there are some that actually represent more than one character, like the U+FB00 ligature ﬀ for two 'f' characters. Is there any way easy to convert characters like these into multiple single characters? Preferably something available in the standard Java API, but I can refer to an external library if need be.

U+FB00 is a compatibility character. Normally Unicode doesn't support separate codepoints for ligatures (arguing that it's a layout decision if and when a ligature should be used and should not influence how the data is stored). A few of those still exist to allow round-trip conversion compatibility with older encodings that do represent ligatures as separate entities.
Luckily, the information which characters the ligature represents is present in the Unicode data file and most capable string handling systems have that data built-in.
In Java, you'll need to use the Normalizer class and the NFKC form:
String ff ="\uFB00";
String normalized = Normalizer.normalize(ff, Form.NFKC);
System.out.println(ff + " = " + normalized);
This will print
ﬀ = ff

The process you are talking about is called Normalization and is specified in the Unicode Normalization Forms technical note.
There is a class in the Java SE class library called java.text.Normalizer which implements this process. However, you need to read the Unicode document linked above to figure out which of the "normalization forms" you need to use to get the result you want. It is not straightforward ....

You could try the java.text.Normalizer, but I am not really sure if that works for ligatures.

Related

How to create my own unique Charset in Java?

I would like to make my own Charset in Java and then use it for the encoding purpose.
I need to add some particular symbols to my Charset as well as all of the numbers and 4 languages (Traditional Chinese, US English, Polish and Russian).
I tried to browse Charset class but didn`t really find a solution.

Basil's answer explains that you don't need to define a custom Charset in order to support some non-standard symbols.
But if you really do need to do it, you will have to write a custom class that extends Charset. There are 3 abstract methods that you have to implement:
boolean contains(Charset cs) - Tells whether or not this charset contains the given charset.
CharsetDecoder newDecoder() Constructs a new decoder for this charset.
CharsetEncoder newEncoder() Constructs a new encoder for this charset.
The other methods in the Charset API most likely don't need to be overridden.
The decoder and encoder need to be able to convert between a ByteBuffer containing text in your charset's encoding and Unicode codepoints in a CharBuffer. While both CharsetDecoder and CharsetEncoder are also abstract classes, they require you to implement a decodeLoop or encodeLoop method (respectively) which has complicated requirements.
I am not aware of any specific documentation or tutorials on how to implement a custom Charset and its CharsetDecoder and CharsetEncoder class. But you should be able to find example code in the OpenJDK Java SE codebase. (They will be internal classes ...)
I tried to browse Charset class but didn't really find a solution.
Well the "solution" is that you will need to study existing examples ... or conclude that you don't need to solve this problem at all. See above.

Private Use Areas within Unicode
You’ve not really explained what goal you are trying to achieve, but likely there is no need to invent either:
a character set (a collection of numbers each assigned to a particular character)
a character encoding (a way to represent instances of those numbers as bits and bytes).
Unicode defines over 144,000 characters, each assigned a number from a range of zero to just over a million. That leaves large gaps of numbers unassigned. Some of those empty sub-ranges are reserved for future use. But, of interest to you, some of those sub-ranges are set aside for “private use”, never ever to be assigned to a character by the Unicode Consortium. See Wikipedia.
👉 You are free to assign any meaning you wish to any number within those “private use areas”. So that works as your character set.
👉 As for your character encoding, using UTF-8 is almost always best. This is true for several reasons, as discussed here.
Java supports all of Unicode. So no extra programming needed to support your characters. Everything works the same whether encountering characters from inside or outside the private use areas.
If you want to involve other people in your endeavor, or want to share documents, then you should be aware that there is an unofficial registry of characters assigned to Private Use numbers. This unofficial registry is a volunteer effort, made outside of the Unicode Consortium. This registry is for characters that would never be accepted for inclusion in Unicode. This includes imaginary languages such as Klingon from Star Trek. When selecting code point numbers for your characters, you may want to avoid these unofficially registered code points.

String that cannot be represented in UTF-8

I am creating a set of tests for the size of a String to do so I am using something like this myString.getBytes("UTF-8").length > MAX_SIZE for which java has a checked exception UnsupportedEncodingException.
Just for curiosity, and to further consider other possible test scenarios, is there a text that cannot be represented by UTF-8 character encoding?
BTW: I did my homework, but nowhere (that I can find) specifies that indeed UTF-8/Unicode includes ALL the characters which are possible. I know that its size is 2^32 and many of them are still empty, but the question remains.

The official FAQ from the Unicode Consortium is pretty clear on the matter, and is a great source of information on all questions related to UTF-8, UTF-16, etc.
In particular, notice the following quote (emphasis mine):
Q: What is a UTF?
A: A Unicode transformation format (UTF) is an
algorithmic mapping from every Unicode code point (except surrogate
code points) to a unique byte sequence. The ISO/IEC 10646 standard
uses the term “UCS transformation format” for UTF; the two terms are
merely synonyms for the same concept.
Each UTF is reversible, thus every UTF supports lossless round
tripping: mapping from any Unicode coded character sequence S to a
sequence of bytes and back will produce S again. To ensure round
tripping, a UTF mapping must map all code points (except surrogate
code points) to unique byte sequences. This includes reserved
(unassigned) code points and the 66 noncharacters (including U+FFFE
and U+FFFF).
So, as you can see, by definition, all UTF encodings (including UTF-8) must be able to handle all Unicode code points (except the surrogate code points of course, but they are not real characters anyways).
Additionally, here is a quote directly from the Unicode Standard that also talks about this:
The Unicode Standard supports three character encoding forms: UTF-32,
UTF-16, and UTF-8. Each encoding form maps the Unicode code points
U+0000..U+D7FF and U+E000..U+10FFFF to unique code unit sequences.
As you can see, the specified range of characters covers the whole assigned Unicode range (excluding the surrogate character range of course).

is there a text that cannot be represented by UTF-8 character encoding?
Java strings use UTF-16, and standard UTF-8 is designed to handle every Unicode codepoint that UTF-16 can handle (and then some).
However, do be careful, because Java also uses a Modified UTF-8 in some areas, and that does have some differences/limitations from standard UTF-8.

How can Java handle characters that are unencodeable in UTF-16?

Since Java holds characters internally in UTF-16, what if you need to output in a certain encoding that includes characters that are not in unicode at all?

Java can only handle characters which are present in Unicode, basically. Text outside the BMP (i.e. above U+FFFF) is encoded as surrogate pairs (as each char is a UTF-16 code unit)... but if you want characters which aren't in Unicode at all, you're on your own - you could probably find some area of Unicode which is reserved for private use, and map the characters there... but you may well have "fun" in all kinds of odd ways.
Do you definitely need to handle characters which aren't in Unicode? I thought it covered almost everything these days...

Java regex to distinguish special characters while allowing non english chars

I am trying to do above. One option is get a set of chars which are special characters and then with some java logic we can accomplish this. But then I have to make sure I include all special chars.
Is there any better way of doing this ?

You need to decide what constitutes a special character. One method that may be of interest is Character.getType(char) which returns an int which will match one of the constant values of Character such as Character.LOWERCASE_LETTER or Character.CURRENCY_SYMBOL. This lets you determine the general category of a character, and then you need to decide which categories count as 'special' characters and which you will accept as part of text.
Note that Java uses UTF-16 to encode its char and String values, and consequently you may need to deal with supplementary characters (see the link in the description of the getType method). This is a nuisance, but the Character method does offer methods which help you detect this situation and work around it. See the Character.isSupplementaryCodepoint(int) and Character.codepointAt(char[], int) methods.
Also be aware that Java 6 is far less knowledgeable about Unicode than is Java 7. The newest version of Java has added far more to its Unicode database, but code running on Java 6 will not recognise some (actually quite a few) exotic codepoints as being part of a Unicode block or general category, so you need to bear this in mind when writing your code.

It sounds like you would like to remove all control characters from a Unicode string. You can accomplish this by using a Unicode character category identifier in a regex. The category "Cc" contains those characters, see http://www.fileformat.info/info/unicode/category/Cc/list.htm.
myString = myString.replaceAll("[\p{Cc}]+", "");

JAVA - writing chars to a file in one byte per char

I couldn't find any documentation about this...
I want to write to a file a bunch of char and make sure that the file's size is # of chars bytes.
does anyone know what class to use?

I want to write to a file a bunch of char and make sure that the file's size is # of chars bytes.
Okay - so you need to pick an encoding which only uses a single byte per character, such as ISO-8859-1. Create a FileOutputStream, wrap it in an OutputStreamWriter specifying the encoding, and you're away. However, you need to be aware that you're limiting the range of characters which can be represented in your file.

Take a "Writer"
Writer do output chars
http://docs.oracle.com/javase/1.4.2/docs/api/java/io/FileWriter.html
http://docs.oracle.com/javase/1.4.2/docs/api/java/io/OutputStreamWriter.html
OutputStream do output bytes
You may try to use an other encoding.
In that case you should supply an CharSetEncoder as this has an onUnmappableCharacter method
http://docs.oracle.com/javase/1.4.2/docs/api/java/nio/charset/CharsetEncoder.html#onUnmappableCharacter%28java.nio.charset.CodingErrorAction%29

First figure out which kinds of chars you are going to be talking about.
In C a char is eight bits, even if you need two or more chars in sequence to represent one glyph, or in human-terms, one typed character. It gets worse, there are also glyphs that represent two "typed" characters, like the conjoined ff and ll glyphs you often see in typesetting.
If you are talking about C chars, then by definition every file contains the same number of chars as chars. If you are talking about any other meaning of the word character, then you need to make some choices.
Eight bit characters are guaranteed for the ASCII character set in UTF-8, which is by far the best character set to choose going forward, as it has explicit support in web protocols (thank you w3c!). This means that as long as you verify that every java char in your string is less than 128 (integer value), you are going to get one byte per char with UTF-8.
ISO-8859-1 is a character set which also uses only one byte per character. The downside to ISO-8859-1 is that it tends to not be the default character set of anything other than Microsoft systems. Even within the Microsoft realm, UTF-8 has been making a lot of headway.
The cost to convert between the two is not overly high, but the extensibility of the two differ dramatically. Basically, if you are using ISO-8859-1 and someone tells you that the next product must support language "X", then in some cases, you must first convert to a different character set and then add the language support. With UTF-8 such a need to convert to another character set prior to adding support is rare. I mean very rare, like so rare that you should consider just using images because the language is likely dead, is likely of historical interest only, and is likely to have been documented as a dialect from a lesser tribe on an island where the primary language has full support.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.