Java string text normalizer dublicate korean character

Java string text normalizer dublicate korean character - java

I hava a project for string character change. When i use following code with korean character result string have dublicate characters. How can i fix it?
#Test
public void testKoreanCharacters() {
String test = "카디코이";
String replacedStr = Normalizer.normalize(test, Normalizer.Form.NFD).replaceAll("\\p{Mn}", "");
Assert.assertEquals(test.length(),replacedStr.length());
}
Output:
java.lang.AssertionError:
Expected :4
Actual :8

Japanese characters represent syllables, not single phonetic sounds. Therefore most characters represent two or three 'latin' characters. See the first yellow block on the 1 Introduction section of the Unicode Norma
The Unicode Standard defines two equivalences between characters: canonical equivalence and compatibility equivalence. Canonical equivalence is a basic equivalency between characters or sequences of characters. The following figure illustrates this equivalence:
So it is correct behavior to make two characters out of one.
However, you have chosen the NFD form, which already sais 'canonical decomposition'.
I think you don't have to remove \\p{Mn}, because you don't get the canonical composition at all.
NFC
Canonical decomposition, followed by canonical composition.
NFD
Canonical decomposition.
NFKC
Compatibility decomposition, followed by canonical composition.
NFKD
Compatibility decomposition.
You test assumption is incorrect, the input and output sequence need not be the same length.

Related

Resetting fancy font to normal

I have a String named fancy, the String fancy is this "𝖑𝖒𝖆𝖔", however, I need to make "lmao" out of it.
I've tried calling String#trim, however with no success.
Example code:
var fancy = "𝖑𝖒𝖆𝖔"
var normal = //Magic to convert 𝖑𝖒𝖆𝖔 to lmao
EDIT: So I figured out, if I take the UTF-8 code of this fancy character, and subtract it by 120101, I get the original character, however, there are more types of these fancy texts so it does not seem like a solution for my problem.

You can take advantage of the fact that your "𝖆" character decomposes to a regular "a":
Decomposition LATIN SMALL LETTER A (U+0061)
Java's java.text.Normalizer class contains different normalizer forms. The NKFD and NKFC forms use the above decomposition rule.
String normal = Normalizer.normalize(fancy, Normalizer.Form.NFKC);
Using compatibility equivalence is what you need here:
Compatibility equivalence is a weaker type of equivalence between characters or sequences of characters which represent the same abstract character (or sequence of abstract characters), but which may have distinct visual appearances or behaviors.
(The reason you do not lose diacritics is because this process simply separates these diacritic marks from their base letters - and then re-combines them if you use the relevant form.)

Those are unicode characters: https://unicode-table.com also provides reverse lookup to identify them (copy-paste them into the search).
The fancy characters identify as:
𝖑 Mathematical Bold Fraktur Small L (U+1D591)
𝖒 Mathematical Bold Fraktur Small M 'U+1D592)
𝖆 Mathematical Bold Fraktur Small A (U+1D586)
𝖔 Mathematical Bold Fraktur Small O (U+1D594)
You also find them as 'old style english alphabet' on this list: https://unicode-table.com/en/sets/fancy-letters. There we notice that they are ordered and in the same way that the alphabetic characters are. So the characters have a fixed offset:
int offset = 0x1D586 - 'a' // 𝖆 is U+1D586
You can thus transform the characters back by subtracting that offset.
Now comes the tricky part: these unicode code points cannot be represented by a single char data type, which is only 16 bit, and thus cannot represent every single unicode character on its own (1-4 chars are actually needed, depending on unicode char).
The proper way to deal with this is to work with the code points directly:
String fancy = "𝖑𝖒𝖆𝖔";
int offset = 0x1D586 - 'a' // 𝖆 is U+1D586
String plain = fancy.codePoints()
.map(i-> i - offset)
.mapToObj(c-> (char)c)
.map(String::valueOf)
.collect(java.util.stream.Collectors.joining());
System.out.println(plain);
This then prints lmao.

Remove illegal characters from file in notepad++ or java

I have a huge file and that file contains a lot of illegal characters like in the image below, but these are not all. They are of many different kinds so it's not possible to search for them all and replace them.
Is there a way i can remove these characters. I've tried a lot of solutions like converting to ANSI, or some regex expression but they didn't work. Please help.
EDIT: Even if anyone can tell me how to remove these characters in java, that will be fine too.

Instead of removing specific characters it's easier to implement a white-list filter if you know which types of characters you are expecting.
As per this answer, which explains how to remove emoticons you can try:
String characterFilter = "[^\\p{L}\\p{M}\\p{N}\\p{P}\\p{Z}\\p{Cf}\\p{Cs}\\s]";
String emotionless = aString.replaceAll(characterFilter, "");
To understand what \p{} groups are available look at Classes for Unicode scripts, blocks, categories and binary properties docs:
\p{IsLatin} A Latin script character (script)
\p{InGreek} A character in the Greek block (block)
\p{Lu} An uppercase letter (category)
\p{IsAlphabetic} An alphabetic character (binary property)
\p{Sc} A currency symbol
\P{InGreek} Any character except one in the Greek block (negation)
[\p{L}&&[^\p{Lu}]] Any letter except an uppercase letter (subtraction)

Undocumented Java regex character class: \p{C}

I found an interesting regex in a Java project: "[\\p{C}&&\\S]"
I understand that the && means "set intersection", and \S is "non-whitespace", but what is \p{C}, and is it okay to use?
The java.util.regex.Pattern documentation doesn't mention it. The only similar class on the list is \p{Cntrl}, but they behave differently: they both match on control characters, but \p{C} matches twice on Unicode characters above U+FFFF, such as PILE OF POO:
public class StrangePattern {
public static void main(String[] argv) {
// As far as I can tell, this is the simplest way to create a String
// with code points above U+FFFF.
String poo = new String(Character.toChars(0x1F4A9));
System.out.println(poo); // prints `💩`
System.out.println(poo.replaceAll("\\p{C}", "?")); // prints `??`
System.out.println(poo.replaceAll("\\p{Cntrl}", "?")); // prints `💩`
}
}
The only mention I've found anywhere is here:
\p{C} or \p{Other}: invisible control characters and unused code points.
However, \p{Other} does not seem to exist in Java, and the matching code points are not unused.
My Java version info:
$ java -version
java version "1.8.0_92"
Java(TM) SE Runtime Environment (build 1.8.0_92-b14)
Java HotSpot(TM) 64-Bit Server VM (build 25.92-b14, mixed mode)
Bonus question: what is the likely intent of the original pattern, "[\\p{C}&&\\S]"? It occurs in a method which validates a string before it is sent in an email: if that pattern is matched, an exception with the message "Invalid string" is raised.

Buried down in the Pattern docs under Unicode Support, we find the following:
This class is in conformance with Level 1 of Unicode Technical Standard #18: Unicode Regular Expression, plus RL2.1 Canonical Equivalents.
...
Categories may be specified with the optional prefix Is: Both \p{L}
and \p{IsL} denote the category of Unicode letters. Same as scripts
and blocks, categories can also be specified by using the keyword
general_category (or its short form gc) as in general_category=Lu or
gc=Lu.
The supported categories are those of The Unicode Standard in the
version specified by the Character class. The category names are those
defined in the Standard, both normative and informative.
From Unicode Technical Standard #18, we find that C is defined to match any Other General_Category value, and that support for this is part of the requirements for Level 1 conformance. Java implements \p{C} because it claims conformance to Level 1 of UTS #18.
It probably should support \p{Other}, but apparently it doesn't.
Worse, it's violating RL1.7, required for Level 1 conformance, which requires that matching happen by code point instead of code unit:
To meet this requirement, an implementation shall handle the full range of Unicode code points, including values from U+FFFF to U+10FFFF. In particular, where UTF-16 is used, a sequence consisting of a leading surrogate followed by a trailing surrogate shall be handled as a single code point in matching.
There should be no matches for \p{C} in your test string, because your test string should be matched as a single emoji code point with General_Category=So (Other Symbol) instead of as two surrogates.

According to https://regex101.com/, \p{C} matches
Invisible control characters and unused code points
(the \ has to be escaped because java string, so string \\p{C} is regex \p{C})
I'm guessing this is a 'hacked string check' as a \p{C} probably should never appear inside a valid (character filled) string, but the author should have left a comment as what they checked and what they wanted to check are usually 2 different things.

Anything other than a valid two-letter Unicode category code or a single letter that begins a Unicode category code is illegal since Java supports only single letter and two-letter abbreviations for Unicode categories. That's why \p{Other} doesn't work here.
\p{C} matches twice on Unicode characters above U+FFFF, such as PILE
OF POO.
Right. Java uses UTF-16 encoding internally for Unicode characters and 💩 is encoded as two 16-bit code units (0xD83D 0xDCA9) called surrogate pairs (high surrogates) and since \p{C} matches each half separately
\p{Cs} or \p{Surrogate}: one half of a surrogate pair in UTF-16
encoding.
you see two matches in result set.
What is the likely intent of the original pattern, [\\p{C}&&\\S]?
I don't see a much valid reason but it seems developer worried about characters in category Other (like avoiding spammy goomojies in email's subject) so simply tried to block them.

As for the Bonus question: the expression [\\p{C}&&\\S] finds control characters excluding whitespace characters like tabs or line feeds in Java. These characters have no value in regular mails and therefore it is a good idea to filter them away (or, as in this case, declare an email content as faulty). Be aware that the double backslashes (\\) are only necessary to escape the expression for Java processing. The correct regular expression would be: [\p{C}&&\S]

escaped html won't unescaped (now: unescaped html won't escape back)

So I'm currently using the commons lang apache library.
When I tried unescaping this string: 😀
This returns the same string: 😀
String characters = "😀"
StringEscapeUtils.unescapeHtml(characters);
Output: 😀
But when I tried unescaping a String with a less few characters, it works:
String characters = "㈳"
StringEscapeUtils.unescapeHtml(characters);
Output: ㈳
Any ideas? When I tried unescaping this String "😀" on online unescaping utility, it works, so maybe it's a bug in the apache common langs library? Or can anyone recommend another library?
Thanks.
UPDATES:
I'm now able to unescape the String successfully. The problem now is when I tried to escaped the result of that unescape, it won't bring back the String (😀).

unescapeHtml() leaves 😀 untouched because – as the documentation says – it only unescapes HTML 4.0 entities, which are limited to 65,536 characters. Unfortunately, 128,512 is far beyond that limit.
Have you tried using unescapeXml()?
XML supports up to 1,114,111 (10FFFFh) character entities (link).

This is a unicode character whose index is U+1F600 (128512) - GRINNING FACE
Refer the URL for details
The String you have mentioned is HTML Escape of U+1F600, If you unescape it using Apache commons lang it will draw you the required smiley as provided in screenshot
The set of characters from U+0000 to U+FFFF is sometimes referred to as the Basic Multilingual Plane (BMP). Characters whose code points are greater than U+FFFF are called supplementary characters. The Java platform uses the UTF-16 representation in char arrays and in the String and StringBuffer classes. In this representation, supplementary characters are represented as a pair of char values, the first from the high-surrogates range, (\uD800-\uDBFF), the second from the low-surrogates range (\uDC00-\uDFFF).
Regarding your update that its not converting back to 😀
You can also represent a character using a Numeric Character Reference, of the form &#dddd;, where dddd is the decimal value representing the character's Unicode scalar value. You can alternatively use a hexadecimal representation &#xhhhh;, where hhhh is the hexadecimal value equivalent to the decimal value.
A good site for this
Have added few SoP to help you understand this unicode better.

Well - the solution is pretty easy:
use org.apache.commons.lang3.StringEscapeUtils.unescapeHtml4 instead! (unless you're using Java <1.5, which you probably won't)
String characters = "😀";
StringEscapeUtils.unescapeHtml4(characters);

i think the problem is that there is no unicode character "😀"
so the method simply returns this string.
the doc of the function says only
Returns: a new unescaped String, null if null string input

If it's a HTML specific question, then you can just use JavaScript for this purpose.
You can do
escape("😀") which gives you %26%23128512%3B
unescape("%26%23128512%3B") which gives you back 😀

Pattern objects not matching with different languages

I have the following reg expression that works fine when the user's inputs English.
But it always fails when using Portuguese characters.
Pattern p = Pattern.compile("^[a-zA-Z]*$");
Matcher matcher = p.matcher(fieldName);
if (!matcher.matches())
{
....
}
Is there any way to get the pattern object to recognise valid Portuguese characters such as ÁÂÃÀÇÉÊÍÓÔÕÚç....?
Thanks

You want a regular expression that will match the class of all alphabetic letters. Across all the scripts of the world, there's loads of those, but luckily we can tell Java 6's RE engine that we're after a letter and it will use the magic of Unicode classes to do the rest. In particular, the L class matches all types of letters, upper, lower and “oh, that concept doesn't apply in my language”:
Pattern p = Pattern.compile("^\\p{L}*$");
// the rest is identical, so won't repeat it...
When reading the docs, remember that backslashes will need to be doubled up if placed in a Java literal so as to stop the Java compiler from interpreting them as something else. (Also be aware that that RE is not suitable for things like validating the names of people, which is an entirely different and much more difficult problem.)

It should work with "^\p{IsAlphabetic}*$", that takes into account Unicode characters. For reference see the options in http://docs.oracle.com/javase/7/docs/api/java/util/regex/Pattern.html

Check out the Pattern doc and particularly the section on Unicode:
Unicode blocks and categories are written with the \p and \P
constructs as in Perl. \p{prop} matches if the input has the property
prop, while \P{prop} does not match if the input has that property.
Blocks are specified with the prefix In, as in InMongolian. Categories
may be specified with the optional prefix Is: Both \p{L} and \p{IsL}
denote the category of Unicode letters. Blocks and categories can be
used both inside and outside of a character class.
(for Java 1.4.x). I suspect you're interested in identifying Unicode letters and not particularly Portuguese letters?

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.