I'm trying to concatenate several strings containing both arabic and western characters (mixed in the same string). The problem is that the result is a String that is, most likely, semantically correct, but different from what I want to obtain, because the order of the characters is altered by the Unicode Bidirectional Algorithm. Basically, I just want to concatenate as if they were all LTR, ignoring the fact that some are RTL, a sort of "agnostic" concatenation.
I'm not sure if I was clear in my explanation, but I don't think I can do it any better.
Hope someone can help me.
Kind regards,
Carlos Ferreira
BTW, the strings are being obtained from the database.
EDIT
The first 2 Strings are the strings I want to concatenate and the third is the result.
EDIT 2
Actually, the concatenated String is a little different from the one in the image, it got altered during the copy+paste, the 1 is after the first A and not immediately before the second A.
You can embed bidi regions using unicode format control codepoints:
Left-to-right embedding (U+202A)
Right-to-left embedding (U+202B)
Pop directional formatting (U+202C)
So in java, to embed a RTL language like Arabic in an LTR language like English, you would do
myEnglishString + "\u202B" + myArabicString + "\u202C" + moreEnglish
and to do the reverse
myArabicString + "\u202A" + myEnglishString + "\u202C" + moreArabic
See Bidirectional General Formatting for more details, or the Unicode specification chapter on "Directional Formatting Codes" for the source material.
It's very likely that you need to insert Unicode directional formatting codes into your string to get your string display correctly. For details see Directional Formatting Codes of the Unicode Bidirectional Algorithm specification.
Maybe the Bidi class can help you in determining the correct sequence, as it implements the Unicode Bidirectional Algorithm.
It's not changing order of the codepoints. What's happening is that when it comes to display the string, it sees that the string starts with a right-to-left script, so it displays it right-to-left.
Related
I have a String named fancy, the String fancy is this "𝖑𝖒𝖆𝖔", however, I need to make "lmao" out of it.
I've tried calling String#trim, however with no success.
Example code:
var fancy = "𝖑𝖒𝖆𝖔"
var normal = //Magic to convert 𝖑𝖒𝖆𝖔 to lmao
EDIT: So I figured out, if I take the UTF-8 code of this fancy character, and subtract it by 120101, I get the original character, however, there are more types of these fancy texts so it does not seem like a solution for my problem.
You can take advantage of the fact that your "𝖆" character decomposes to a regular "a":
Decomposition LATIN SMALL LETTER A (U+0061)
Java's java.text.Normalizer class contains different normalizer forms. The NKFD and NKFC forms use the above decomposition rule.
String normal = Normalizer.normalize(fancy, Normalizer.Form.NFKC);
Using compatibility equivalence is what you need here:
Compatibility equivalence is a weaker type of equivalence between characters or sequences of characters which represent the same abstract character (or sequence of abstract characters), but which may have distinct visual appearances or behaviors.
(The reason you do not lose diacritics is because this process simply separates these diacritic marks from their base letters - and then re-combines them if you use the relevant form.)
Those are unicode characters: https://unicode-table.com also provides reverse lookup to identify them (copy-paste them into the search).
The fancy characters identify as:
𝖑 Mathematical Bold Fraktur Small L (U+1D591)
𝖒 Mathematical Bold Fraktur Small M 'U+1D592)
𝖆 Mathematical Bold Fraktur Small A (U+1D586)
𝖔 Mathematical Bold Fraktur Small O (U+1D594)
You also find them as 'old style english alphabet' on this list: https://unicode-table.com/en/sets/fancy-letters. There we notice that they are ordered and in the same way that the alphabetic characters are. So the characters have a fixed offset:
int offset = 0x1D586 - 'a' // 𝖆 is U+1D586
You can thus transform the characters back by subtracting that offset.
Now comes the tricky part: these unicode code points cannot be represented by a single char data type, which is only 16 bit, and thus cannot represent every single unicode character on its own (1-4 chars are actually needed, depending on unicode char).
The proper way to deal with this is to work with the code points directly:
String fancy = "𝖑𝖒𝖆𝖔";
int offset = 0x1D586 - 'a' // 𝖆 is U+1D586
String plain = fancy.codePoints()
.map(i-> i - offset)
.mapToObj(c-> (char)c)
.map(String::valueOf)
.collect(java.util.stream.Collectors.joining());
System.out.println(plain);
This then prints lmao.
I am having some trouble with encoding this string into barcode symbology - Code 128.
Text to encode:
1021448642241082212700794828592311
I am using the universal encoder from idautomation.com:
https://www.bcgen.com/fontencoder/
I get the following output for the encoded text for Code 128:
Í*5LvJ8*r5;ÂoP<[7+.Î
However, in ";Âo" the character between the semi-colon and o (let us call it special A) - is not part of the extended character set used in Code128. (See the Latin Supplements at https://www.fonts2u.com/code-128.font)
Yet the same string shows a valid barcode at
https://www.bcgen.com/linear-barcode-creator.html
How?
If I use the output with the Special A on a webpage with a font face for barcodes, the special A character does not show up as the barcode (and that seems correct since the special A is not part of the character set).
What gives? Please help.
I am using the IDAutomation utility to encode the string to 128c symbology. If you can share code to do the encoding (in Java/Python/C/Perl) that would help too.
There are multiple fonts for Code128 that may use different characters to represent the barcode symbols. Make sure the font and the encoding logic match each other.
I used this one http://www.jtbarton.com/Barcodes/Code128.aspx (there is also sample code how to encode it on the site, but you have to translate it from VB). The font works for all three encodings (A, B and C).
Sorry, this is very late.
When you are dealing with the encoding of code 128, in any subset, it's a good idea to think of that coding in terms of numbers, not characters. At this level, when you have shifts, code-changes, checksums and stuff, intermixed with the data, the whole concept of "character" is lost.
However, this is what is happening:
The semicolon in the output corresponds to "27"
The lowercase o corresponds to "48" and the P to "79"
The "A with Macron" corresponds to your "00" sequence. This is why you should be dealing with numbers, not characters, at this level of encoding.
How would you expect it to show a character with a code of 00 ? That would be a space of NULL, neither of which is particularly visible.
Your software has simply rendered it the best way it can, which is to make the character 'visible' by adding 0x80 to it. If you look at charmap, you will see that code 0x80 is indeed A with macron.
The rest (indeed all) of your encoded string looks correct for a setc-encodation.
I want to output "Arabic" and "English" text at the same time in Java for example, outputting the following statement: مرحبا I am Adham.
I searched the internet and I found that the BiDi algorithm is needed in this case. Are there any java classes for BiDi.
I have tried this class BiDiReferenceJava and I tested it, but when I call runSample() in the class BidiReferenceTest and entering an arabic string as parameter, I got an OutOfIndexException as the count of the character is duplicated (exactly at this line of code in the class BidiReferenceTestCharmap)
byte[] result = new byte[count];
Where if the string length is 4 the count is 8!
The ICU4J is more or less the standard comprehensive Unicode library for Java, and thus supports the bidirectional algorithm. I really wonder why you need this, though; BiDi is usually applied by the display layer, unless you're a word-processor or something.
BidiReference.java is apparently a demonstration piece; it's designed to show how the algorithm works on ASCII characters instead of using actual Unicode characters.
I am trying to do above. One option is get a set of chars which are special characters and then with some java logic we can accomplish this. But then I have to make sure I include all special chars.
Is there any better way of doing this ?
You need to decide what constitutes a special character. One method that may be of interest is Character.getType(char) which returns an int which will match one of the constant values of Character such as Character.LOWERCASE_LETTER or Character.CURRENCY_SYMBOL. This lets you determine the general category of a character, and then you need to decide which categories count as 'special' characters and which you will accept as part of text.
Note that Java uses UTF-16 to encode its char and String values, and consequently you may need to deal with supplementary characters (see the link in the description of the getType method). This is a nuisance, but the Character method does offer methods which help you detect this situation and work around it. See the Character.isSupplementaryCodepoint(int) and Character.codepointAt(char[], int) methods.
Also be aware that Java 6 is far less knowledgeable about Unicode than is Java 7. The newest version of Java has added far more to its Unicode database, but code running on Java 6 will not recognise some (actually quite a few) exotic codepoints as being part of a Unicode block or general category, so you need to bear this in mind when writing your code.
It sounds like you would like to remove all control characters from a Unicode string. You can accomplish this by using a Unicode character category identifier in a regex. The category "Cc" contains those characters, see http://www.fileformat.info/info/unicode/category/Cc/list.htm.
myString = myString.replaceAll("[\p{Cc}]+", "");
Throughout the vast number of unicode characters, there are some that actually represent more than one character, like the U+FB00 ligature ff for two 'f' characters. Is there any way easy to convert characters like these into multiple single characters? Preferably something available in the standard Java API, but I can refer to an external library if need be.
U+FB00 is a compatibility character. Normally Unicode doesn't support separate codepoints for ligatures (arguing that it's a layout decision if and when a ligature should be used and should not influence how the data is stored). A few of those still exist to allow round-trip conversion compatibility with older encodings that do represent ligatures as separate entities.
Luckily, the information which characters the ligature represents is present in the Unicode data file and most capable string handling systems have that data built-in.
In Java, you'll need to use the Normalizer class and the NFKC form:
String ff ="\uFB00";
String normalized = Normalizer.normalize(ff, Form.NFKC);
System.out.println(ff + " = " + normalized);
This will print
ff = ff
The process you are talking about is called Normalization and is specified in the Unicode Normalization Forms technical note.
There is a class in the Java SE class library called java.text.Normalizer which implements this process. However, you need to read the Unicode document linked above to figure out which of the "normalization forms" you need to use to get the result you want. It is not straightforward ....
You could try the java.text.Normalizer, but I am not really sure if that works for ligatures.