Convert Latin characters to Normal text in Java

Convert Latin characters to Normal text in Java - java

I have the following characters.
Ą¢¥ŚŠŞŤŹŽŻąľśšşťźžżÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖØÙÚÛÜÝÞßàáâãäåæçèéêëìíîïðñòóôõöøùúûüýþÿ
I need to convert to
AcYSSSTZZZalssstzzzAAAAAAACEEEEIIIIDNOOOOOOUUUUYTSaaaaaaaceeeeiiiionoooooouuuuyty
I am using Java 1.4.
Normalizer.decompose(text, true, 0).replaceAll(
"\p{InCombiningDiacriticalMarks}+", ""); only replaces characters with diacritics.
Characters like ¢¥ÆÐÞßæðøþ is not getting converted.
How can I do that, what is the efficient way to do the conversion in JDK 1.4.
Please help.
Regards,
Sridevi

Check out the ICU project, especially the icu4j part.
The Transliterator class will solve your problem.
Here is an example a Transliterator that converts any script to latin chars and removes any accents and non-ascii chars:
Transliterator accentsConverter = Transliterator.getInstance("Any-Latin; NFD; [:M:] Remove; NFC; [^\\p{ASCII}] Remove");
The Any-Latin part performs the conversion, NFD; [:M:] Remove; NFC removes the accents and [^\\p{ASCII}] Remove removes any non-ascii chars remaining.
You just call accentsConverter.transliterate(yourString) to get the results.
You can read more about how to build the transformation ID (the parameter of Transliterator.getInstance) in the ICU Transformations guide.

How can I do that, what is the efficient way to do the conversion in JDK 1.4.
The most efficient way is to use a lookup table implemented as either an array or a HashMap. But, of course, you need to populate the table.
Characters like ¢¥ÆÐÞßæðøþ is not getting converted.
Well none of those characters is really a Roman letter and can't be translated to a Roman letter ... without taking outrageous liberties with the semantics. For example:
¢ and ¥ are currency symbols,
Æ and æ are ligatures that in some languages represent two letters, and in others are a distinct letter,
ß is the german representation for a double-s.

I would do something like this;
UPDATED FOR Java 1.4 (removed generics)
public class StringConverter {
char[] source = new char[]{'Ą', '¢', '¥', 'Ś'}; // all your chars here...
char[] target = new char[]{'A', 'c', 'Y', 'S'}; // all your chars here...
//Build a map
HashMap map;
public StringConverter() {
map = new HashMap();
for (int i = 0; i < source.length; i++) {
map.put(new Character(source[i]), new Character(target[i]));
}
}
public String convert(String s) {
char[] chars = s.toCharArray();
for (int i = 0; i < chars.length; i++) {
chars[i] = map.get(chars[i]);
}
return new String(chars);
}
}

Related

How do I remove illegal characters in a subdomain?

I'm using Java 6. Using an Amazon AWS library, I'm dynamically creating domains. However, I'm looking for a function that can strip out illegal characters from a subdomain. E.g. if my function were about to create
dave'ssite.mydomain.com
I would like to pass the string "dave'ssite" to some function, which would strip out the apostrophe, or whatever other illegal characters lurked in the subdomain.
How do I do taht? THe more specific quesiton is, how do I identify what the illegal subdomain characters are?

Subdomains are same as Domains, so most likely the allowed characters are A-Z a-z 0-9 and -. There fore you can use Regex.
...
String s = "dave's-site.mydomain.com";
//prints daves-sitemydomaincom
System.out.println(s.replaceAll("[^A-Za-z0-9\\-]",""));
...

Here is some I made for a game. Its basically the same thing except I used it for removing invalid characters from a username.
char[] validChars = {a, b, c, d, etc...};//Put all valid characters in this array, so for subdomains put all letters and numbers
public static String cleanString(String text){
StringBuilder sb = new StringBuilder("");
for(int i = 0;i < text.length();i++){
for (int j = 0; j < validCharslength; j++) {
if (validChars[j] == text.charAt(i)) {
sb.append(text.charAt(i));
break;
}
}
}
return sb.toString();
}
As said my the comment in the code, the char array contains all the valid characters and anything else will be removed. Keep in mind that his is a return method.

I stumbled on here looking for a c# solution to the same problem and well here it is. Look how elegant c# LINQ makes this ;)
if (model.UserName.All(char.IsLetterOrDigit) && !model.UserName.StartsWith("-"))
{
//oh yeah, valid subdomain
}
Not sure what the spec says about length though.

How to use Unicode to split Japanese from English

I have a string variable which is a paragraph containing both English and Japanese words.
I want to split Japanese from English.
So I use the Unicode to decide whether the character falls into \u+0000~ \u+007F (basic Latin unicode)
But I don't know how to write the Java code to convert char to unicode, and how to compare unicode.
Anyone can give me a sample?
public void split(String str){
char[]cstr=str.toCharArray();
String en = "";
String jp = "";
for(char c: cstr){
//(1) To Unicode?
//(2) How to check whether fall into \u0000 ~ \u007F
if(is_en) en+=c;
else jp+=c;
}
}

Assuming the string you have is 16-bit Unicode, and that you aren't trying to go to full Unicode, you can use:
if ('\u0000' <= c && c <= '\u007f')
{ // c is English }
else { // c is other }
I don't know, however, that this does exactly what you want. Many of the characters in that range are actually punctuation, for instance. And I found a reference here to a set of Unicode characters that are a mix of Roman and "half-width kanji". Just be aware that actually differentiating between all the Unicode characters that might represent English letters and all others might not be this simple, it will depend on your environment.

char[] to String sequence mismatching in Java for Unicode characters

I have a method like below (please ignore the code optimization issue.) This method replaces the Unicode character (Bengali characters)
static String swap(String temp, char c)
{
Integer length=temp.length();
char[] charArray = temp.toCharArray();
for(int u=0;u<length;u++)
{
if(charArray[u]==c)
{
char g=charArray[u];
charArray[u]=charArray[u-1];
charArray[u-1]=g;
}
}
String string2 = new String(charArray);
return string2;
}
while debugging, i got the values of charArray like the below image:
please note that the characters are in a sequenced format what I want. But after the execution of the statement, the value stored in String variable is mismatched. like below:
I want to display the string as "রেরেরে" but it is displaying "েরেরের" what i not want. Please tell me what I am doing wrong.

Note - I don't know Bengali, but I know a bit (or a lot, depending on whom you ask) about Unicode and how Java supports it. The answer assumes knowledge of the latter and not the former.
Going by the Unicode 6.0 Bengali chart, রে is a combination of the dependent vowel sign ে (0x09C7) and the consonant র (0x09B0) and is represented as a sequence of two characters in the character array.
If you are getting the dependent vowel sign alone, in the resulting character sequence (and hence the string), then your optimization is likely to be kooky, as it appears to assume that Bengali characters in Unicode can be represented as a single Unicode codepoint or a single char variable in Java; this would result in the scenario where a consonant would be replaced by another consonant, but the dependent vowel preceding the consonant would never be replaced.
I think a correct optimization must therefore consider the presence of dependent vowels, and compare the following consonant in addition to the vowel , i.e. it must compare two characters in the character array, instead of comparing individual characters. This might also imply that your method signature must be changed to allow for a char[] to be passed, instead of a single char, so that Bengali characters can be replaced with the intended Bengali character, instead of replacing a Unicode codepoint with another, which is what is being done currently.
The notes in other answers on the ArrayIndexOutofBoundsException is valid. The following example that uses your character replacement algorithm demonstrates that not only is your algorithm incorrect, but it is quite possible for the exception to be thrown:
class CodepointReplacer
{
public static void main(String[] args)
{
String str1 = "রেরেরে";
/*
* The following is a linguistically invalid sequence,
* but Java does not concern itself with linguistical correctness
* if the String or char sequence has been constructed incorrectly.
*/
String str2 = "েরেরের";
/*
* replacement character র for our strings
* It is not রে as one would anticipate.
*/
char c = str1.charAt(1);
optimizeKookily(str1, c);
optimizeKookily(str2, c);
}
private static void optimizeKookily(String temp, char c)
{
Integer length = temp.length();
char[] charArray = temp.toCharArray();
for (int u = 0; u < length; u++)
{
if (charArray[u] == c)
{
char g = charArray[u];
charArray[u] = charArray[u - 1]; //throws exception on second invocation of this method.
charArray[u - 1] = g;
}
}
}
}
A better character replacement strategy would therefore be to use the String.replace (the CharSequence variant) or String.replaceAll functions, assuming that you would know how to use these with Bengali characters.

problem is in
for(int u=0;u<length;u++)
{
if(charArray[u]==c)
{
char g=charArray[u];
charArray[u]=charArray[u-1];
charArray[u-1]=g;
}
}
See when u=0 what is the value of charArray[u-1] that is the index -1.Modify your for loop or just put the condition where u=0.

Your code will cause an IndexOutOfBound Exception.
When u=0, charArray[u-1]=-1.

How to remove high-ASCII characters from string like ®, ©, ™ in Java

I want to detect and remove high-ASCII characters like ®, ©, ™ from a String in Java. Is there any open-source library that can do this?

If you need to remove all non-US-ASCII (i.e. outside 0x0-0x7F) characters, you can do something like this:
s = s.replaceAll("[^\\x00-\\x7f]", "");
If you need to filter many strings, it would be better to use a precompiled pattern:
private static final Pattern nonASCII = Pattern.compile("[^\\x00-\\x7f]");
...
s = nonASCII.matcher(s).replaceAll();
And if it's really performance-critical, perhaps Alex Nikolaenkov's suggestion would be better.

I think that you can easily filter your string by hand and check code of the particular character. If it fits your requirements then add it to a StringBuilder and do toString() to it in the end.
public static String filter(String str) {
StringBuilder filtered = new StringBuilder(str.length());
for (int i = 0; i < str.length(); i++) {
char current = str.charAt(i);
if (current >= 0x20 && current <= 0x7e) {
filtered.append(current);
}
}
return filtered.toString();
}

A nice way to do this is to use Google Guava CharMatcher:
String newString = CharMatcher.ASCII.retainFrom(string);
newString will contain only the ASCII characters (code point < 128) from the original string.
This reads more naturally than a regular expression. Regular expressions can take more effort to understand for subsequent readers of your code.

I understand that you need to delete: ç,ã,Ã , but for everybody that need to convert ç,ã,Ã ---> c,a,A please have a look at this piece of code:
Example Code:
final String input = "Tĥïŝ ĩš â fůňķŷ Šťŕĭńġ";
System.out.println(
Normalizer
.normalize(input, Normalizer.Form.NFD)
.replaceAll("[^\\p{ASCII}]", "")
);
Output:
This is a funky String

Java how can I add an accented "e" to a string?

With the help of tucuxi from the existing post Java remove HTML from String without regular expressions I have built a method that will parse out any basic HTML tags from a string. Sometimes, however, the original string contains html hexadecimal characters like &#x00E9 (which is an accented e). I have started to add functionality which will translate these escaped characters into real characters.
You're probably asking: Why not use regular expressions? Or a third party library? Unfortunately I cannot, as I am developing on a BlackBerry platform which does not support regular expressions and I have never been able to successfully add a third party library to my project.
So, I have gotten to the point where any &#x00E9 is replaced with "e". My question now is, how do I add an actual 'accented e' to a string?
Here is my code:
public static String removeHTML(String synopsis) {
char[] cs = synopsis.toCharArray();
String sb = new String();
boolean tag = false;
for (int i = 0; i < cs.length; i++) {
switch (cs[i]) {
case '<':
if (!tag) {
tag = true;
break;
}
case '>':
if (tag) {
tag = false;
break;
}
case '&':
char[] copyTo = new char[7];
System.arraycopy(cs, i, copyTo, 0, 7);
String result = new String(copyTo);
if (result.equals("&#x00E9")) {
sb += "e";
}
i += 7;
break;
default:
if (!tag)
sb += cs[i];
}
}
return sb.toString();
}
Thanks!

Java Strings are unicode.
sb += '\u00E9'; # lower case e + '
sb += '\u00C9'; # upper case E + '

You can print out just about any character you like in Java as it uses the Unicode character set.
To find the character you want take a look at the charts here:
http://www.unicode.org/charts/
In the Latin Supplement document you'll see all the unicode numbers for the accented characters. You should see the hex number 00E9 listed for é for example. The numbers for all Latin accented characters are in this document so you should find this pretty useful.
To print use character in a String, just use the Unicode escape sequence of \u followed by the character code like so:
System.out.print("Let's go to the caf\u00E9");
Would produce: "Let's go to the café"
Depending in which version of Java you're using you might find StringBuilders (or StringBuffers if you're multi-threaded) more efficient than using the + operator to concatenate Strings too.

try this:
if (result.equals("&#x00E9")) {
sb += char(130);
}
instead of
if (result.equals("&#x00E9")) {
sb += "e";
}
The thing is that you're not adding an accent to the top of the 'e' character, but rather that is a separate character all together. This site lists out the ascii codes for characters.

For a table of accented in characters in Java take a look at this reference.
To decode the html part, use Apache StringEscapeUtils from Apache commons lang:
import org.apache.commons.lang.StringEscapeUtils;
...
String withCharacters = StringEscapeUtils.unescapeHtml(yourString);
See also this Stack Overflow thread:
Replace HTML codes with equivalent characters in Java

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Convert Latin characters to Normal text in Java - java

Related

How do I remove illegal characters in a subdomain?

How to use Unicode to split Japanese from English

char[] to String sequence mismatching in Java for Unicode characters

How to remove high-ASCII characters from string like ®, ©, ™ in Java

Java how can I add an accented "e" to a string?

Categories

Resources