Resetting fancy font to normal - java

I have a String named fancy, the String fancy is this "𝖑𝖒𝖆𝖔", however, I need to make "lmao" out of it.
I've tried calling String#trim, however with no success.
Example code:
var fancy = "𝖑𝖒𝖆𝖔"
var normal = //Magic to convert 𝖑𝖒𝖆𝖔 to lmao
EDIT: So I figured out, if I take the UTF-8 code of this fancy character, and subtract it by 120101, I get the original character, however, there are more types of these fancy texts so it does not seem like a solution for my problem.

You can take advantage of the fact that your "𝖆" character decomposes to a regular "a":
Decomposition LATIN SMALL LETTER A (U+0061)
Java's java.text.Normalizer class contains different normalizer forms. The NKFD and NKFC forms use the above decomposition rule.
String normal = Normalizer.normalize(fancy, Normalizer.Form.NFKC);
Using compatibility equivalence is what you need here:
Compatibility equivalence is a weaker type of equivalence between characters or sequences of characters which represent the same abstract character (or sequence of abstract characters), but which may have distinct visual appearances or behaviors.
(The reason you do not lose diacritics is because this process simply separates these diacritic marks from their base letters - and then re-combines them if you use the relevant form.)

Those are unicode characters: https://unicode-table.com also provides reverse lookup to identify them (copy-paste them into the search).
The fancy characters identify as:
𝖑 Mathematical Bold Fraktur Small L (U+1D591)
𝖒 Mathematical Bold Fraktur Small M 'U+1D592)
𝖆 Mathematical Bold Fraktur Small A (U+1D586)
𝖔 Mathematical Bold Fraktur Small O (U+1D594)
You also find them as 'old style english alphabet' on this list: https://unicode-table.com/en/sets/fancy-letters. There we notice that they are ordered and in the same way that the alphabetic characters are. So the characters have a fixed offset:
int offset = 0x1D586 - 'a' // 𝖆 is U+1D586
You can thus transform the characters back by subtracting that offset.
Now comes the tricky part: these unicode code points cannot be represented by a single char data type, which is only 16 bit, and thus cannot represent every single unicode character on its own (1-4 chars are actually needed, depending on unicode char).
The proper way to deal with this is to work with the code points directly:
String fancy = "𝖑𝖒𝖆𝖔";
int offset = 0x1D586 - 'a' // 𝖆 is U+1D586
String plain = fancy.codePoints()
.map(i-> i - offset)
.mapToObj(c-> (char)c)
.map(String::valueOf)
.collect(java.util.stream.Collectors.joining());
System.out.println(plain);
This then prints lmao.

Related

Java string text normalizer dublicate korean character

I hava a project for string character change. When i use following code with korean character result string have dublicate characters. How can i fix it?
#Test
public void testKoreanCharacters() {
String test = "카디코이";
String replacedStr = Normalizer.normalize(test, Normalizer.Form.NFD).replaceAll("\\p{Mn}", "");
Assert.assertEquals(test.length(),replacedStr.length());
}
Output:
java.lang.AssertionError:
Expected :4
Actual :8
Japanese characters represent syllables, not single phonetic sounds. Therefore most characters represent two or three 'latin' characters. See the first yellow block on the 1 Introduction section of the Unicode Norma
The Unicode Standard defines two equivalences between characters: canonical equivalence and compatibility equivalence. Canonical equivalence is a basic equivalency between characters or sequences of characters. The following figure illustrates this equivalence:
So it is correct behavior to make two characters out of one.
However, you have chosen the NFD form, which already sais 'canonical decomposition'.
I think you don't have to remove \\p{Mn}, because you don't get the canonical composition at all.
NFC
Canonical decomposition, followed by canonical composition.
NFD
Canonical decomposition.
NFKC
Compatibility decomposition, followed by canonical composition.
NFKD
Compatibility decomposition.
You test assumption is incorrect, the input and output sequence need not be the same length.

How can I tell if a Unicode code point is one complete printable glyph(or grapheme cluster)?

Let's say there's a Unicode String object, and I want to print each Unicode character in that String one by one.
In my simple test with very limited languages, I could successively achieve this just assuming one code point is always the same as one glyph.
But I know this is not the case, and the code logic above may easily cause unexpected results in some countries or languages.
So my question is, is there any way to tell if one Unicode code point is one complete printable glyph in Java or C#?
If I have to write code in C/C++, that's fine too.
I googled for hours but all I got is about code units and code points. It's very easy to tell if a code unit is a part of a surrogate-pair but nothing about graphemes..
Could anyone point me in the right direction, please?
You're definitely right that a single glyph is often composed of more than one code point. For example, the letter Γ© (e with acute accent) may be equivalently written \u00E9 or with a combining accent as \u0065\u0301. Unicode normalization cannot always merge things like this into one code point, especially if there are multiple combining characters. So you'll need to use some Unicode segmentation rules to identify the boundaries you want.
What you are calling a "printable glyph" is called a user-perceived character or (extended) grapheme cluster. In Java, the way to iterate over these is with BreakIterator.getCharacterInstance(Locale):
BreakIterator boundary = BreakIterator.getCharacterInstance(Locale.WHATEVER);
boundary.setText(yourString);
for (int start = boundary.first(), end = boundary.next();
end != BreakIterator.DONE;
start = end, end = boundary.next()) {
String chunk = yourString.substring(start, end);
}

Java regex to distinguish special characters while allowing non english chars

I am trying to do above. One option is get a set of chars which are special characters and then with some java logic we can accomplish this. But then I have to make sure I include all special chars.
Is there any better way of doing this ?
You need to decide what constitutes a special character. One method that may be of interest is Character.getType(char) which returns an int which will match one of the constant values of Character such as Character.LOWERCASE_LETTER or Character.CURRENCY_SYMBOL. This lets you determine the general category of a character, and then you need to decide which categories count as 'special' characters and which you will accept as part of text.
Note that Java uses UTF-16 to encode its char and String values, and consequently you may need to deal with supplementary characters (see the link in the description of the getType method). This is a nuisance, but the Character method does offer methods which help you detect this situation and work around it. See the Character.isSupplementaryCodepoint(int) and Character.codepointAt(char[], int) methods.
Also be aware that Java 6 is far less knowledgeable about Unicode than is Java 7. The newest version of Java has added far more to its Unicode database, but code running on Java 6 will not recognise some (actually quite a few) exotic codepoints as being part of a Unicode block or general category, so you need to bear this in mind when writing your code.
It sounds like you would like to remove all control characters from a Unicode string. You can accomplish this by using a Unicode character category identifier in a regex. The category "Cc" contains those characters, see http://www.fileformat.info/info/unicode/category/Cc/list.htm.
myString = myString.replaceAll("[\p{Cc}]+", "");

Checking for specific strings with regex

I have a list of arbitrary length of Type String, I need to ensure each String element in the list is alphanumerical or numerical with no spaces and special characters such as - \ / _ etc.
Example of accepted strings include:
J0hn-132ss/sda
Hdka349040r38yd
Hd(ersd)3r4y743-2\d3
123456789
Examples of unacceptable strings include:
Hello
Joe
King
etc basically no words.
I’m currently using stringInstance.matches("regex") but not too sure on how to write the appropriate expression
if (str.matches("^[a-zA-Z0-9_/-\\|]*$")) return true;
else return false;
This method will always return true for words that don't conform to the format I mentioned.
A description of the regex I’m looking for in English would be something like:
Any String, where the String contains characters from (a-zA-Z AND 0-9 AND special characters)
OR (0-9 AND Special characters)
OR (0-9)
Edit: I have come up with the following expression which works but I feel that it may be bad in terms of it being unclear or to complex.
The expression:
(([\\pL\\pN\\pP]+[\\pN]+|[\\pN]+[\\pL\\pN\\pP]+)|([\\pN]+[\\pP]*)|([\\pN]+))+
I've used this website to help me: http://xenon.stanford.edu/~xusch/regexp/analyzer.html
Note that I’m still new to regex
WARNING: β€œNever” Write A-Z
All instances of ranges like A-Z or 0-9 that occur outside an RFC definition are virtually always ipso facto wrong in Unicode. In particular, things like [A-Za-z] are horrible antipatterns: they’re sure giveaways that the programmer has a caveman mentality about text that is almost wholly inappropriate this side of the Millennium. The Unicode patterns work on ASCII, but the ASCII patterns break on Uniocode, sometimes in ways that leave you open to security violations. Always write the Unicode version of the pattern no matter whether you are using 1970s data or modern Unicode, because that way you won’t screw up when you actually use real Java character data. It’s like the way you use your turn signal even when you β€œknow” there is no one behind you, because if you’re wrong, you do no harm, whereas the other way, you very most certainly do. Get used to using the 7 Unicode categories:
\pL for Letters. Notice how \pL is a lot shorter to type than [A-Za-z].
\pN for Numbers.
\pM for Marks that combine with other code points.
\pS for Symbols, Signs, and Sigils. :)
\pP for Punctuation.
\pZ for Separators like spaces (but not control characters)
\pC for other invisible formatting and Control characters, including unassigned code points.
Solution
If you just want a pattern, you want
^[\pL\pN]+$
although in Java 7 you can do this:
(?U)^\w+$
assuming you don’t mind underscores and letters with arbitrary combining marks. Otherwise you have to write the very awkward:
(?U)^[[:alpha:]\pN]+$
The (?U) is new to Java 7. It corresponds to the Pattern class’s UNICODE_CHARACTER_CLASSES compilation flag. It switches the POSIX character classes like [:alpha:] and the simple shortcuts like \w to actually work with the full Java character set. Normally, they work only on the 1970sish ASCII set, which can be a security hole.
There is no way to make Java 7 always do this with its patterns without being told to, but you can write a frontend function that does this for you. You just have to remember to call yours instead.
Note that patterns in Java before v1.7 cannot be made to work according to the way UTS#18 on Unicode Regular Expressions says they must. Because of this, you leave yourself open to a wide range of bugs, infelicities, and paradoxes if you do not use the new Unicode flag. For example, the trivial and common pattern \b\w+\b will not be found to match anywhere at all within the string "élève", let alone in its entirety.
Therefore, if you are using patterns in pre-1.7 Java, you need to be extremely careful, far more careful than anyone ever is. You cannot use any of the POSIX charclasses or charclass shortcuts, including \w, \s, and \b, all of which break on anything but stone-age ASCII data. They cannot be used on Java’s native character set.
In Java 7, they can β€” but only with the right flag.
It is possible to refrase the description of needed regex to "contains at least one number" so the followind would work /.*[\pN].*/. Or, if you would like to limit your search to letters numbers and punctuation you shoud use /[\pL\pN\pP]*[\pN][\pL\pN\pP]*/. I've tested it on your examples and it works fine.
You can further refine your regexp by using lazy quantifiers like this /.*?[\pN].*?/. This way it would fail faster if there are no numbers.
I would like to recomend you a great book on regular expressions: Mastering regular expressions, it has a great introduction, in depth explanation of how regular expressions work and a chapter on regular expressions in java.
It looks like you just want to make sure that there are no spaces in the string. If so, you can this very simply:
return str.indexOf(" ") == -1;
This will return true if there are no spaces (valid by my understanding of your rules), and false if there is a space anywhere in the string (invalid).
Here is a partial answer, which does 0-9 and special characters OR 0-9.
^([\d]+|[\\/\-_]*)*$
This can be read as ((1 or more digits) OR (0 or more special char \ / - '_')) 0 or more times. It requires a digit, will take digits only, and will reject strings consisting of only special characters.
I used regex tester to test several of the strings.
Adding alphabetic characters seems easy, but a repetition of the given regexp may be required.

String class internals - caching character offset to byte relationship if using UTF-8

When writing a custom string class that stores UTF-8 internally (to save memory) rather than UTF-16 from scratch is it feasible to some extent cache the relationship between byte offset and character offset to increase performance when applications use the class with random access?
Does Perl do this kind of caching of character offset to byte offset relationship? How do Python strings work internally?
What about Objective-C and Java? Do they use UTF-8 internally?
EDIT
Found this reference to Perl 5 using UTF-8 internally:
"$flag = utf8::is_utf8(STRING)
(Since Perl 5.8.1) Test whether STRING is in UTF-8 internally. Functionally the same as Encode::is_utf8()."
On page
http://perldoc.perl.org/utf8.html
EDIT
In the applications I have in mind, the strings have 1-2K XML stanzas in an XMPP stream. About 1% of the messages are going to have I expect up to 50% (by character count) of Unicode values > 127 (this is XML). In servers, the messages are rule-checked and routed conditionally on a small (character volume wise) subset of fields. The servers are Wintel boxes operating in a farm. In clients, the data comes from and is fed into UI toolkits.
EDIT
But the app wil inevitably evolve and want to do some random access too. Can the performance hit when this happens be minimised: I was also interested if a more general class design exists that eg manages b-trees of character offset <-> byte offset relationships for big UTF8 strings (or some other algorithm found to be efficient in the general case.)
Perl distinguishes between Unicode and non-Unicode strings. Unicode strings are implemented using UTF-8 internally. Non-Unicode does not necessarily mean 7-bit ASCII, though, it could be any character that can be represented in the current locale as a single byte.
I think the answer is: in general, it's not really worth trying to do this. In your specific case, maybe.
If most of your characters are plain ASCII, and you rarely have UTF sequences, then it might be worth building some kind of sparse data structure with the offsets.
In the general case, every single character might be non-ASCII and you might have many many offsets to store. Really, the most general case would be to make a string of bytes that is exactly as long as your string of Unicode characters, and have each byte value be the offset of the next character. But this means one whole byte per character, and thus a net savings of only one byte per Unicode character; probably not worth the effort. And that implies that indexing into your string is now an O(n) operation, as you run through these offsets and sum them to find the actual index.
If you do want to try the sparse data structure, I suggest an array of pairs of values, the first value being the index within the Unicode string of a character, and the second one being the index within the byte sequence where this character actually appears. Then after each UTF8 escape sequence, you would add the two values to find the next character in the string. Finally, when given an index to a Unicode character, your code could do a binary search of this array, to find the highest index within the sparse array that is lower than the requested index, and then use that to find the actual byte that represents the start of the desired character.
If you need to save memory, you might want to consider using a data compression library. Slurp in the Unicode strings as full Unicode, then compress them; then to index into a string, first you uncompress that string. This will really save memory, and it will be easy and fast to get the code correct to make it work; but it may add too much CPU overhead to be reasonable.
Java's strings are UTF-16 internally:
A String represents a string in the UTF-16 format in which supplementary characters are represented by surrogate pairs (see the section Unicode Character Representations in the Character class for more information). Index values refer to char code units, so a supplementary character uses two positions in a String.
java.lang.String

Categories