How to compare Chinese characters in Java using 'equals()'

How to compare Chinese characters in Java using 'equals()' - java

I want to compare a string portion (i.e. character) against a Chinese character. I assume due to the Unicode encoding it counts as two characters, so I'm looping through the string with increments of two. Now I ran into a roadblock where I'm trying to detect the '兒' character, but equals() doesn't match it, so what am I missing ? This is the code snippet:
for (int CharIndex = 0; CharIndex < tmpChar.length(); CharIndex=CharIndex+2) {
// Account for 'r' like in dianr/huir
if (tmpChar.substring(CharIndex,CharIndex+2).equals("兒")) {
Also, feel free to suggest a more elegant way to parse this ...
[UPDATE] Some pics from the debugger, showing that it doesn't match, even though it should. I pasted the Chinese character from the spreadsheet I use as input, so I don't think it's a copy and paste issue (unless the unicode gets lost along the way)
oh, dang, apparently it does not work simply copy and pasting:

Use CharSequence.codePoints(), which returns a stream of the codepoints, rather than having to deal with chars:
tmpChar.codePoints().forEach(c -> {
if (c == '兒') {
// ...
}
});
(Of course, you could have used tmpChar.codePoints().filter(c -> c == '兒').forEach(c -> { /* ... */ })).

Either characters, accepting 兒 as substring.
String s = ...;
if (s.contains("兒")) { ... }
int position = s.indexOf("兒");
if (position != -1) {
int position2 = position + "兒".length();
s = s.substring(0, position) + "*" + s.substring(position2);
}
if (s.startsWith("兒", i)) {
// At position i there is a 兒.
}
Or code points where it would be one code point. As that is not really easier, variable substring seem fine.

if (tmpChar.substring(CharIndex,CharIndex+2).equals("兒")) {
Is your problem. 兒 is only one UTF-16 character. Many Chinese characters can be represented in UTF-16 in one code unit; Java uses UTF-16. However, other characters are two code units.
There are a variety of APIs on the String class for coping.
As offered in another answer, obtaining the IntStream from codepoints allows you to get a 32-bit code point for each character. You can compare that to the code point value for the character you are looking for.
Or, you can use the ICU4J library with a richer set of facilities for all of this.

Related

Create string with emoji unicode flag countries

i need to create a String with a country flag unicode emoji..I did this:
StringBuffer sb = new StringBuffer();
sb.append(StringEscapeUtils.unescapeJava("\\u1F1EB"));
sb.append(StringEscapeUtils.unescapeJava("\\u1F1F7"));
Expecting one country flag but i havent..How can i get a unicode country flag emoji in String with the unicodes characters?

The problem is, that the "\uXXXX" notation is for 4 hexadecimal digits, forming a 16 bit char.
You have Unicode code points above the 16 bit range, both U+F1EB and U+1F1F7. This will be represented with two chars, a so called surrogate pair.
You can either use the codepoints to create a string:
int[] codepoints = {0x1F1EB, 0x1F1F7};
String s = new String(codepoints, 0, codepoints.length);
Or use the surrogate pairs, derivable like this:
System.out.print("\"");
for (char ch : s.toCharArray()) {
System.out.printf("\\u%04X", (int)ch);
}
System.out.println("\"");
Giving
"\uD83C\uDDEB\uD83C\uDDF7"
Response to the comment: How to Decode
"\uD83C\uDDEB" are two surrogate 16 bit chars representing U+1F1EB and "\uD83C\uDDF7" is the surrogate pair for U+1F1F7.
private static final int CP_REGIONAL_INDICATOR = 0x1F1E7; // A-Z flag codes.
/**
* Get the flag codes of two (or one) regional indicator symbols.
* #param s string starting with 1 or 2 regional indicator symbols.
* #return one or two ASCII letters for the flag, or null.
*/
public static String regionalIndicator(String s) {
int cp0 = regionalIndicatorCodePoint(s);
if (cp0 == -1) {
return null;
}
StringBuilder sb = new StringBuilder();
sb.append((char)(cp0 - CP_REGIONAL_INDICATOR + 'A'));
int n0 = Character.charCount(cp0);
int cp1 = regionalIndicatorCodePoint(s.substring(n0));
if (cp1 != -1) {
sb.append((char)(cp1 - CP_REGIONAL_INDICATOR + 'A'));
}
return sb.toString();
}
private static int regionalIndicatorCodePoint(String s) {
if (s.isEmpty()) {
return -1;
}
int cp0 = s.codePointAt(0);
return CP_REGIONAL_INDICATOR > cp0 || cp0 >= CP_REGIONAL_INDICATOR + 26 ? -1 : cp0;
}
System.out.println("Flag: " + regionalIndicator("\uD83C\uDDEB\uD83C\uDDF7"));
Flag: EQ

You should be able to do that simply using toChars from java.lang.Character.
This works for me:
StringBuffer sb = new StringBuffer();
sb.append(Character.toChars(127467));
sb.append(Character.toChars(127479));
System.out.println(sb);
prints 🇫🇷, which the client can chose to display like a french flag, or in other ways.

If you want to use emojis often, it could be good to use a library that would handle that unicode stuff for you: emoji-java
You would just add the maven dependency:
<dependency>
<groupId>com.vdurmont</groupId>
<artifactId>emoji-java</artifactId>
<version>1.0.0</version>
</dependency>
And call the EmojiManager:
Emoji emoji = EmojiManager.getForAlias("fr");
System.out.println("HEY: " + emoji.getUnicode());
The entire list of supported emojis is here.

I suppose you want to achieve something like this
Let me give you 2 example of unicodes for country flags:
for ROMANIA ---> \uD83C\uDDF7\uD83C\uDDF4
for AMERICA ---> \uD83C\uDDFA\uD83C\uDDF8
You can get this and other country flags unicodes from this site Emoji Unicodes
Once you enter the site, you will see a table with a lot of emoji. Select the tab with FLAGS from that table (is easy to find it) then will appear all the country flags. You need to select one flag from the list, any flag you want... but only ONE. After that will appear a text code in the message box...that is not important. Important is that you have to look in the right of the site where will appear flag and country name of your selected flag. CLICK on that, and on the page that will open you need to find the TABLE named Emoji Character Encoding Data. Scroll until the last part of table where sais: C/C++/Java Src .. there you will find the correct unicode flag. Attention, always select the unicode that is long like that, some times if you are not carefull you can select a simple unicode, not long like that. So, keep that in mind.
Indications image 1
Indication image 2
In the end i will post a sample code from an Android app of mine that will work on java the same way.
ArrayList<String> listLanguages = new ArrayList<>();
listLanguages.add("\uD83C\uDDFA\uD83C\uDDF8 " + getString(R.string.English));
listLanguages.add("\uD83C\uDDF7\uD83C\uDDF4 " + getString(R.string.Romanian));
Another simple custom example:
String flagCountryName = "\uD83C\uDDEF\uD83C\uDDF2 Jamaica";
You can use this variable where you need it. This will show you the flag of Jamaica in front of the text.
This is all, if you did not understand something just ask.

Look at Creating Unicode character from its number
Could not get my machine to print the Unicode you have there, but for other values it works.

How to verify whether an instance of CharSequence is a sequence of Unicode scalar values?

I have an instance of java.lang.CharSequence. I need to determine whether this instance is a sequence of Unicode scalar values (that is, whether the instance is in UTF-16 encoding form). Despite the assurances of java.lang.String, a Java string is not necessarily in UTF-16 encoding form (at least not according to the latest Unicode specification, currently 6.2), since it may contain isolated surrogate code units. (A Java string is, however, a Unicode 16-bit string.)
There are several obvious ways in which to go about this, including:
Iterate over the code points of the sequence, explicitly validating each as a Unicode scalar value.
Use a regular expression to search for isolated surrogate code points.
Pipe the character sequence through a character-set encoder that reports encoding errors.
It seems as though something like this should already exist as a library function, however. I just can't find it in the standard API. Am I missing it, or do I need to implement it?

try this func
static boolean isValidUTF16(String s) {
for (int i = 0; i < s.length(); i++) {
if (Character.isLowSurrogate(s.charAt(i)) && (i == 0 || !Character.isHighSurrogate(s.charAt(i - 1)))
|| Character.isHighSurrogate(s.charAt(i)) && (i == s.length() -1 || !Character.isLowSurrogate(s.charAt(i + 1)))) {
return false;
}
}
return true;
}
here's a test
public static void main(String args[]) {
System.out.println(isValidUTF16("\uDC00\uDBFF"));
System.out.println(isValidUTF16("\uDBFF\uDC00"));
}

Java: looking for the fastest way to check String for presence of Unicode chars in certain range

I need to implement a very crude language identification algorithm. In my world, there are only two languages: English and not-English. I have ArrayList and I need to determine if each String is likely in English or the other language which has its Unicode chars in a certain range. So what I want to do is to check each String against this range using some type of "presence" test. If it passes the test, I say the String is not English, otherwise it's English. I want to try two type of tests:
TEST-ANY: If any char in the string falls within the range, the string passes the test
TEST-ALL: If all chars in the string fall within the range, the string passes the test
Since the array might be very long, I need to implement this very efficiently. What would be the fastest way of doing this in Java?
Thx
UPDATE: I am specifically checking for non-English by looking at a specific range of Unicodes rather then checking for whether the characters are ASCII, in part to take care of the "resume" problem mentioned below. What I am trying to figure out is whether Java provides any classes/methods that essentially implement TEST-ANY or TEST-ALL (or another similar test) as efficiently as possible. In other words, I am trying to avoid reinventing the wheel especially if the wheel invented before me is better anyway.

Here's how I ended up implementing TEST-ANY:
// TEST-ANY
String str = "wordToTest";
int UrangeLow = 1234; // can get range from e.g. http://www.utf8-chartable.de/unicode-utf8-table.pl
int UrangeHigh = 2345;
for(int iLetter = 0; iLetter < str.length() ; iLetter++) {
int cp = str.codePointAt(iLetter);
if (cp >= UrangeLow && cp <= UrangeHigh) {
// word is NOT English
return;
}
}
// word is English
return;

I really don't think that this solution is ideal for determining language, but if you want to check to see if a string is all ascii, you could do something like this:
public static boolean isASCII(String s){
boolean ret = true;
for(int i = 0; i < s.length() ; i++) {
if(s.charAt(i)>=128){
ret = false;
break;
}
}
return ret;
}
So then if you try this:
boolean r = isASCII("Hello");
r would equal true. But if you try:
boolean r = isASCII("Grüß dich");
then r would equal false. I haven't tested performance, but this would work reasonably fast, because all it does is compare a character to the number 128.
But as #AlexanderPogrebnyak mentioned in the comments above, this will return false if you give it "résumé". Be aware of that.
Update:
I am specifically checking for non-English by looking at a specific range of Unicodes rather then checking for whether the characters are ASCII
But ASCII is a range in Unicode (well at least in UTF-8). Unicode is just an extension of ASCII. What the code #mP. and I provided does is it checks to see whether each character is in a certain range. I chose that range to be ASCII, which is any Unicode character that has a decimal value of less than 128. You can just as well choose any other range. But the reason I chose ASCII is because it's the one with the Latin alphabet, the Arabic numbers, and some other common characters that would normally be in an 'English' string.

public static boolean isAscii( String s ){
int length = s.length;
for( int i = 0; i < length; i++){
final char c = s.charAt( i );
if( c > 'z' ){
return false;
}
}
return true;
}
#Hassan thanks for picking the typo replaced test against big Z with little z.

Trim String in Java while preserve full word

I need to trim a String in java so that:
The quick brown fox jumps over the laz dog.
becomes
The quick brown...
In the example above, I'm trimming to 12 characters. If I just use substring I would get:
The quick br...
I already have a method for doing this using substring, but I wanted to know what is the fastest (most efficient) way to do this because a page may have many trim operations.
The only way I can think off is to split the string on spaces and put it back together until its length passes the given length. Is there an other way? Perhaps a more efficient way in which I can use the same method to do a "soft" trim where I preserve the last word (as shown in the example above) and a hard trim which is pretty much a substring.
Thanks,

Below is a method I use to trim long strings in my webapps.
The "soft" boolean as you put it, if set to true will preserve the last word.
This is the most concise way of doing it that I could come up with that uses a StringBuffer which is a lot more efficient than recreating a string which is immutable.
public static String trimString(String string, int length, boolean soft) {
if(string == null || string.trim().isEmpty()){
return string;
}
StringBuffer sb = new StringBuffer(string);
int actualLength = length - 3;
if(sb.length() > actualLength){
// -3 because we add 3 dots at the end. Returned string length has to be length including the dots.
if(!soft)
return escapeHtml(sb.insert(actualLength, "...").substring(0, actualLength+3));
else {
int endIndex = sb.indexOf(" ",actualLength);
return escapeHtml(sb.insert(endIndex,"...").substring(0, endIndex+3));
}
}
return string;
}
Update
I've changed the code so that the ... is appended in the StringBuffer, this is to prevent needless creations of String implicitly which is slow and wasteful.
Note: escapeHtml is a static import from apache commons:
import static org.apache.commons.lang.StringEscapeUtils.escapeHtml;
You can remove it and the code should work the same.

Here is a simple, regex-based, 1-line solution:
str.replaceAll("(?<=.{12})\\b.*", "..."); // How easy was that!? :)
Explanation:
(?<=.{12}) is a negative look behind, which asserts that there are at least 12 characters to the left of the match, but it is a non-capturing (ie zero-width) match
\b.* matches the first word boundary (after at least 12 characters - above) to the end
This is replaced with "..."
Here's a test:
public static void main(String[] args) {
String input = "The quick brown fox jumps over the lazy dog.";
String trimmed = input.replaceAll("(?<=.{12})\\b.*", "...");
System.out.println(trimmed);
}
Output:
The quick brown...
If performance is an issue, pre-compile the regex for an approximately 5x speed up (YMMV) by compiling it once:
static Pattern pattern = Pattern.compile("(?<=.{12})\\b.*");
and reusing it:
String trimmed = pattern.matcher(input).replaceAll("...");

Please try following code:
private String trim(String src, int size) {
if (src.length() <= size) return src;
int pos = src.lastIndexOf(" ", size - 3);
if (pos < 0) return src.substring(0, size);
return src.substring(0, pos) + "...";
}

Try searching for the last occurence of a space that is in a position less or more than 11 and trim the string there, by adding "...".

Your requirements aren't clear. If you have trouble articulating them in a natural language, it's no surprise that they'll be difficult to translate into a computer language like Java.
"preserve the last word" implies that the algorithm will know what a "word" is, so you'll have to tell it that first. The split is a way to do it. A scanner/parser with a grammar is another.
I'd worry about making it work before I concerned myself with efficiency. Make it work, measure it, then see what you can do about performance. Everything else is speculation without data.

How about:
mystring = mystring.replaceAll("^(.{12}.*?)\b.*$", "$1...");

I use this hack : suppose that the trimmed string must have 120 of length :
String textToDisplay = textToTrim.substring(0,(textToTrim.length() > 120) ? 120 : textToTrim.length());
if (textToDisplay.lastIndexOf(' ') != textToDisplay.length() &&textToDisplay.length()!=textToTrim().length()) {
textToDisplay = textToDisplay + textToTrim.substring(textToDisplay.length(),textToTrim.indexOf(" ", textToDisplay.length()-1))+ " ...";
}

String capitalize - better way

What method of capitalizing is better?
mine:
char[] charArray = string.toCharArray();
charArray[0] = Character.toUpperCase(charArray[0]);
return new String(charArray);
or
commons lang - StringUtils.capitalize:
return new StringBuffer(strLen)
.append(Character.toTitleCase(str.charAt(0)))
.append(str.substring(1))
.toString();
I think mine is better, but i would rather ask.

I guess your version will be a little bit more performant, since it does not allocate as many temporary String objects.
I'd go for this (assuming the string is not empty):
StringBuilder strBuilder = new StringBuilder(string);
strBuilder.setCharAt(0, Character.toUpperCase(strBuilder.charAt(0))));
return strBuilder.toString();
However, note that they are not equivalent in that one uses toUpperCase() and the other uses toTitleCase().
From a forum post:
Titlecase <> uppercase
Unicode
defines three kinds of case mapping:
lowercase, uppercase, and titlecase.
The difference between uppercasing and
titlecasing a character or character
sequence can be seen in compound
characters (that is, a single
character that represents a compount
of two characters).
For example, in Unicode, character
U+01F3 is LATIN SMALL LETTER DZ. (Let
us write this compound character
using ASCII as "dz".) This character
uppercases to character U+01F1, LATIN
CAPITAL LETTER DZ. (Which is
basically "DZ".) But it titlecases to
to character U+01F2, LATIN CAPITAL
LETTER D WITH SMALL LETTER Z. (Which
we can write "Dz".)
character uppercase titlecase
--------- --------- ---------
dz DZ Dz

If I were to write a library, I'd try to make sure I got my Unicode right beofre worrying about performance. Off the top of my head:
int len = str.length();
if (len == 0) {
return str;
}
int head = Character.toUpperCase(str.codePointAt(0));
String tail = str.substring(str.offsetByCodePoints(0, 1));
return new String(new int[] { head }).concat(tail);
(I'd probably also look up the difference between title and upper case before I committed.)

Performance is equal.
Your code copies the char[] calling string.toCharArray() and new String(charArray).
The apache code on buffer.append(str.substring(1)) and buffer.toString(). The apache code has an extra string instance that has the base char[1,length] content. But this will not be copied when the instance String is created.

StringBuffer is declared to be thread safe, so it might be less effective to use it (but one shouldn't bet on it before actually doing some practical tests).

StringBuilder (from Java 5 onwards) is faster than StringBuffer if you don't need it to be thread safe but as others have said you need to test if this is better than your solution in your case.

Have you timed both?
Honestly, they're equivalent.. so the one that performs better for you is the better one :)

Not sure what the difference between toUpperCase and toTitleCase is, but it looks as if your solution requires one less instantiation of the String class, while the commons lang implementation requires two (substring and toString create new Strings I assume, since String is immutable).
Whether that's "better" (I guess you mean faster) I don't know. Why don't you profile both solutions?

look at this question titlecase-conversion . apache FTW.

/**
* capitalize the first letter of a string
*
* #param String
* #return String
* */
public static String capitalizeFirst(String s) {
if (s == null || s.length() == 0) {
return "";
}
char first = s.charAt(0);
if (Character.isUpperCase(first)) {
return s;
} else {
return Character.toUpperCase(first) + s.substring(1);
}
}

If you only capitalize limited words, you better cache it.
#Test
public void testCase()
{
String all = "At its base, a shell is simply a macro processor that executes commands. The term macro processor means functionality where text and symbols are expanded to create larger expressions.\n" +
"\n" +
"A Unix shell is both a command interpreter and a programming language. As a command interpreter, the shell provides the user interface to the rich set of GNU utilities. The programming language features allow these utilities to be combined. Files containing commands can be created, and become commands themselves. These new commands have the same status as system commands in directories such as /bin, allowing users or groups to establish custom environments to automate their common tasks.\n" +
"\n" +
"Shells may be used interactively or non-interactively. In interactive mode, they accept input typed from the keyboard. When executing non-interactively, shells execute commands read from a file.\n" +
"\n" +
"A shell allows execution of GNU commands, both synchronously and asynchronously. The shell waits for synchronous commands to complete before accepting more input; asynchronous commands continue to execute in parallel with the shell while it reads and executes additional commands. The redirection constructs permit fine-grained control of the input and output of those commands. Moreover, the shell allows control over the contents of commands’ environments.\n" +
"\n" +
"Shells also provide a small set of built-in commands (builtins) implementing functionality impossible or inconvenient to obtain via separate utilities. For example, cd, break, continue, and exec cannot be implemented outside of the shell because they directly manipulate the shell itself. The history, getopts, kill, or pwd builtins, among others, could be implemented in separate utilities, but they are more convenient to use as builtin commands. All of the shell builtins are described in subsequent sections.\n" +
"\n" +
"While executing commands is essential, most of the power (and complexity) of shells is due to their embedded programming languages. Like any high-level language, the shell provides variables, flow control constructs, quoting, and functions.\n" +
"\n" +
"Shells offer features geared specifically for interactive use rather than to augment the programming language. These interactive features include job control, command line editing, command history and aliases. Each of these features is described in this manual.";
String[] split = all.split("[\\W]");
// 10000000
// upper Used 606
// hash Used 114
// 100000000
// upper Used 5765
// hash Used 1101
HashMap<String, String> cache = Maps.newHashMap();
long start = System.currentTimeMillis();
for (int i = 0; i < 100000000; i++)
{
String upper = split[i % split.length].toUpperCase();
// String s = split[i % split.length];
// String upper = cache.get(s);
// if (upper == null)
// {
// cache.put(s, upper = s.toUpperCase());
//
// }
}
System.out.println("Used " + (System.currentTimeMillis() - start));
}
The text is picked from here.
Currently, I need to upper case the table name and columns, many many more times, but they are limited.Use the hashMap to cache will be better.
:-)

use this method for capitalizing of string. its totally working without any bug
public String capitalizeString(String value)
{
String string = value;
String capitalizedString = "";
System.out.println(string);
for(int i = 0; i < string.length(); i++)
{
char ch = string.charAt(i);
if(i == 0 || string.charAt(i-1)==' ')
ch = Character.toUpperCase(ch);
capitalizedString += ch;
}
return capitalizedString;
}

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.