How to use Unicode to split Japanese from English - java

I have a string variable which is a paragraph containing both English and Japanese words.
I want to split Japanese from English.
So I use the Unicode to decide whether the character falls into \u+0000~ \u+007F (basic Latin unicode)
But I don't know how to write the Java code to convert char to unicode, and how to compare unicode.
Anyone can give me a sample?
public void split(String str){
char[]cstr=str.toCharArray();
String en = "";
String jp = "";
for(char c: cstr){
//(1) To Unicode?
//(2) How to check whether fall into \u0000 ~ \u007F
if(is_en) en+=c;
else jp+=c;
}
}

Assuming the string you have is 16-bit Unicode, and that you aren't trying to go to full Unicode, you can use:
if ('\u0000' <= c && c <= '\u007f')
{ // c is English }
else { // c is other }
I don't know, however, that this does exactly what you want. Many of the characters in that range are actually punctuation, for instance. And I found a reference here to a set of Unicode characters that are a mix of Roman and "half-width kanji". Just be aware that actually differentiating between all the Unicode characters that might represent English letters and all others might not be this simple, it will depend on your environment.

Related

Java check if string only contains english keyboard letters

I want to disallow users from using any special characters in their name.
They should be able to use the whole english keyboard, so
a-z, 0-9, [], (), &, ", %, $, ^, °, #, *, +, ~, §, ., ,, -, ', =, }{
and so on. So they should be allowed to use every "normal" english character which you can type with your keyboard.
How can I check that?
Use regex to match name with English alphabets.
Solution 1:
if(name.matches("[a-zA-Z]+")) {
// Accept name
}
else {
// Ask to enter again
}
Solution 2:
while(!name.matches("[a-zA-Z]+")) {
// Ask to enter again
}
// Accept name
We can do like:
String str = "My string";
System.out.println(str.matches("^[a-zA-Z][a-zA-Z\\s]+$"));//true
str = "My string1";
System.out.println(str.matches("^[a-zA-Z][a-zA-Z\\s]+$"));//false
You can use a regular expression for this.
Since you have lots of characters that have special meaning in a regular expression, I recommend putting them in a separate string and quoting them:
String specialCharacters = "-[]()&...";
Pattern allowedCharactersPattern = Pattern.compile("[A-Za-z0-9" + Pattern.quote(specialCharacters) + "]*");
boolean containsOnlyAllowedCharacters(String str) {
return allowedCharactersPattern.matcher(str).matches();
}
As for how to obtain the string of special characters in the first place, there is no way to list all the characters that can be typed with the user's current keyboard layout. In fact, since there are ways to type any Unicode character at all such a list would be useless anyway.
I find the requirement to be quite strange , in that I can't see the rationale behind accepting § but not, say, å, and I have not checked the list of characters you want to accept in any detail.
But, it seems to me that what you're asking is to accept any character whose codepoint value is less than 0x0080, with the oddball exception of § (0x00A7). So I'd code it to make that check explicitly, and not get involved with regular expressions. I assume you want to exclude control characters, even though they can be typed on an English keyboard.
Pseudocode:
for each character ch in string
if ch < 0x0020 || (ch >= 0x007f && ch != `§')
then it's not allowed
Your requirements are oddly-stated though, in that you want to disallow "special characters" but allow `!##$%6&*()_+' for example. What's your definition of "special character"?
For arbitrary definition of 'allowable characters' I'd use a bitset.
static BitSet valid = new Bitset();
static {
valid.set('A', 'Z'+1);
valid.set('a', 'z'+1);
valid.set('0', '9'+1);
valid.set('.');
valid.set('_');
...etc...
}
then
for (int j=0; j<str.length(); j++)
if (!valid.get(str.charAt(j))
...illegal...

Test for English only A-Z upper case of a character

I need to test a character for uppercase only A-Z. Not any other special unicode or other languages.
I was reading the documentation for Character.isUpperCase. It seems like it would pass if it was a unicode character that was considered uppercase but not technically between A-Z. And it seems like it would pass uppercase characters from other languages besides english.
Do i just need to use regular expressions or am i reading into Character.isUpperCase incorrectly?
Thanks
From the documentation you linked:
Many other Unicode characters are uppercase too.
So yes, using isUpperCase will match things other than A-Z.
One way to do the test though is like this.
boolean isUpperCaseEnglish(char c){
return c >= 'A' && c <= 'Z';
}
isUpperCase indeed does not promise the character is between 'A' and 'Z'. You could use a regex:
String s = ...;
Pattern p = Pattern.compile("[A-Z]*");
Matcher m = p.matcher(s);
boolean matches = m.matches();
Character.isUpperCase() does accept things based off of other languages. For instance, Ω would be considered uppercase.
But you can do a check to make sure it is between A and Z:
public static boolean isUpperCaseInEnglish(char c) {
return (c >= 'A' && c <= 'Z');
}

Remove non-ASCII non-printable characters from a String

I get user input including non-ASCII characters and non-printable characters, such as
\xc2d
\xa0
\xe7
\xc3\ufffdd
\xc3\ufffdd
\xc2\xa0
\xc3\xa7
\xa0\xa0
for example:
email : abc#gmail.com\xa0\xa0
street : 123 Main St.\xc2\xa0
desired output:
email : abc#gmail.com
street : 123 Main St.
What is the best way to removing them using Java?
I tried the following, but doesn't seem to work
public static void main(String args[]) throws UnsupportedEncodingException {
String s = "abc#gmail\\xe9.com";
String email = "abc#gmail.com\\xa0\\xa0";
System.out.println(s.replaceAll("\\P{Print}", ""));
System.out.println(email.replaceAll("\\P{Print}", ""));
}
Output
abc#gmail\xe9.com
abc#gmail.com\xa0\xa0
Your requirements are not clear. All characters in a Java String are Unicode characters, so if you remove them, you'll be left with an empty string. I assume what you mean is that you want to remove any non-ASCII, non-printable characters.
String clean = str.replaceAll("\\P{Print}", "");
Here, \p{Print} represents a POSIX character class for printable ASCII characters, while \P{Print} is the complement of that class. With this expression, all characters that are not printable ASCII are replaced with the empty string. (The extra backslash is because \ starts an escape sequence in string literals.)
Apparently, all the input characters are actually ASCII characters that represent a printable encoding of non-printable or non-ASCII characters. Mongo shouldn't have any trouble with these strings, because they contain only plain printable ASCII characters.
This all sounds a little fishy to me. What I believe is happening is that the data really do contain non-printable and non-ASCII characters, and another component (like a logging framework) is replacing these with a printable representation. In your simple tests, you are failing to translate the printable representation back to the original string, so you mistakenly believe the first regular expression is not working.
That's my guess, but if I've misread the situation and you really do need to strip out literal \xHH escapes, you can do it with the following regular expression.
String clean = str.replaceAll("\\\\x\\p{XDigit}{2}", "");
The API documentation for the Pattern class does a good job of listing all of the syntax supported by Java's regex library. For more elaboration on what all of the syntax means, I have found the Regular-Expressions.info site very helpful.
I know it's maybe late but for future reference:
String clean = str.replaceAll("\\P{Print}", "");
Removes all non printable characters, but that includes \n (line feed), \t(tab) and \r(carriage return), and sometimes you want to keep those characters.
For that problem use inverted logic:
String clean = str.replaceAll("[^\\n\\r\\t\\p{Print}]", "");
With Google Guava's CharMatcher, you can remove any non-printable characters and then retain all ASCII characters (dropping any accents) like this:
String printable = CharMatcher.INVISIBLE.removeFrom(input);
String clean = CharMatcher.ASCII.retainFrom(printable);
Not sure if that's what you really want, but it removes anything expressed as escape sequences in your question's sample data.
You can try this code:
public String cleanInvalidCharacters(String in) {
StringBuilder out = new StringBuilder();
char current;
if (in == null || ("".equals(in))) {
return "";
}
for (int i = 0; i < in.length(); i++) {
current = in.charAt(i);
if ((current == 0x9)
|| (current == 0xA)
|| (current == 0xD)
|| ((current >= 0x20) && (current <= 0xD7FF))
|| ((current >= 0xE000) && (current <= 0xFFFD))
|| ((current >= 0x10000) && (current <= 0x10FFFF))) {
out.append(current);
}
}
return out.toString().replaceAll("\\s", " ");
}
It works for me to remove invalid characters from String.
You can use java.text.normalizer
Input => "This \u7279text \u7279is what I need"
Output => "This text is what I need"
If you are trying to remove Unicode characters from a string like above this code will work
Pattern unicodeCharsPattern = Pattern.compile("\\\\u(\\p{XDigit}{4})");
Matcher unicodeMatcher = unicodeChars.matcher(data);
String cleanData = null;
if (unicodeMatcher.find()) {
cleanData = unicodeMatcher.replaceAll("");
}

Convert Latin characters to Normal text in Java

I have the following characters.
Ą¢¥ŚŠŞŤŹŽŻąľśšşťźžżÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖØÙÚÛÜÝÞßàáâãäåæçèéêëìíîïðñòóôõöøùúûüýþÿ
I need to convert to
AcYSSSTZZZalssstzzzAAAAAAACEEEEIIIIDNOOOOOOUUUUYTSaaaaaaaceeeeiiiionoooooouuuuyty
I am using Java 1.4.
Normalizer.decompose(text, true, 0).replaceAll(
"\p{InCombiningDiacriticalMarks}+", ""); only replaces characters with diacritics.
Characters like ¢¥ÆÐÞßæðøþ is not getting converted.
How can I do that, what is the efficient way to do the conversion in JDK 1.4.
Please help.
Regards,
Sridevi
Check out the ICU project, especially the icu4j part.
The Transliterator class will solve your problem.
Here is an example a Transliterator that converts any script to latin chars and removes any accents and non-ascii chars:
Transliterator accentsConverter = Transliterator.getInstance("Any-Latin; NFD; [:M:] Remove; NFC; [^\\p{ASCII}] Remove");
The Any-Latin part performs the conversion, NFD; [:M:] Remove; NFC removes the accents and [^\\p{ASCII}] Remove removes any non-ascii chars remaining.
You just call accentsConverter.transliterate(yourString) to get the results.
You can read more about how to build the transformation ID (the parameter of Transliterator.getInstance) in the ICU Transformations guide.
How can I do that, what is the efficient way to do the conversion in JDK 1.4.
The most efficient way is to use a lookup table implemented as either an array or a HashMap. But, of course, you need to populate the table.
Characters like ¢¥ÆÐÞßæðøþ is not getting converted.
Well none of those characters is really a Roman letter and can't be translated to a Roman letter ... without taking outrageous liberties with the semantics. For example:
¢ and ¥ are currency symbols,
Æ and æ are ligatures that in some languages represent two letters, and in others are a distinct letter,
ß is the german representation for a double-s.
I would do something like this;
UPDATED FOR Java 1.4 (removed generics)
public class StringConverter {
char[] source = new char[]{'Ą', '¢', '¥', 'Ś'}; // all your chars here...
char[] target = new char[]{'A', 'c', 'Y', 'S'}; // all your chars here...
//Build a map
HashMap map;
public StringConverter() {
map = new HashMap();
for (int i = 0; i < source.length; i++) {
map.put(new Character(source[i]), new Character(target[i]));
}
}
public String convert(String s) {
char[] chars = s.toCharArray();
for (int i = 0; i < chars.length; i++) {
chars[i] = map.get(chars[i]);
}
return new String(chars);
}
}

Java Regexp to Match ASCII Characters

What regex would match any ASCII character in java?
I've already tried:
^[\\p{ASCII}]*$
but found that it didn't match lots of things that I wanted (like spaces, parentheses, etc...). I'm hoping to avoid explicitly listing all 127 ASCII characters in a format like:
^[a-zA-Z0-9!##$%^*(),.<>~`[]{}\\/+=-\\s]*$
The first try was almost correct
"^\\p{ASCII}*$"
I have never used \\p{ASCII} but I have used ^[\\u0000-\\u007F]*$
If you only want the printable ASCII characters you can use ^[ -~]*$ - i.e. all characters between space and tilde.
https://en.wikipedia.org/wiki/ASCII#ASCII_printable_code_chart
For JavaScript it'll be /^[\x00-\x7F]*$/.test('blah')
I think question about getting ASCII characters from a raw string which has both ASCII and special characters...
public String getOnlyASCII(String raw) {
Pattern asciiPattern = Pattern.compile("\\p{ASCII}*$");
Matcher matcher = asciiPattern.matcher(raw);
String asciiString = null;
if (matcher.find()) {
asciiString = matcher.group();
}
return asciiString;
}
The above program will remove the non ascii string and return the string. Thanks to #Oleg Pavliv for pattern.
For ex:
raw = ��+919986774157
asciiString = +919986774157

Categories