I have a huge file and that file contains a lot of illegal characters like in the image below, but these are not all. They are of many different kinds so it's not possible to search for them all and replace them.
Is there a way i can remove these characters. I've tried a lot of solutions like converting to ANSI, or some regex expression but they didn't work. Please help.
EDIT: Even if anyone can tell me how to remove these characters in java, that will be fine too.
Instead of removing specific characters it's easier to implement a white-list filter if you know which types of characters you are expecting.
As per this answer, which explains how to remove emoticons you can try:
String characterFilter = "[^\\p{L}\\p{M}\\p{N}\\p{P}\\p{Z}\\p{Cf}\\p{Cs}\\s]";
String emotionless = aString.replaceAll(characterFilter, "");
To understand what \p{} groups are available look at Classes for Unicode scripts, blocks, categories and binary properties docs:
\p{IsLatin} A Latin script character (script)
\p{InGreek} A character in the Greek block (block)
\p{Lu} An uppercase letter (category)
\p{IsAlphabetic} An alphabetic character (binary property)
\p{Sc} A currency symbol
\P{InGreek} Any character except one in the Greek block (negation)
[\p{L}&&[^\p{Lu}]] Any letter except an uppercase letter (subtraction)
I'm using jflex and i have to recognize characters, which can be:
Normal chars, like 'a'
Numbers, like '\126'
I've made this regular expression (Integer is a macro already defined):
Character = (\'.\')|(\'\\{Integer}\')
I don't know if it's ok, but my real problem is that i don't know what code i have to put to turn both type of strings into Characters, because this doesn't work:
{Character} { this.yylval = new Character(yytext());
return Parser.CHARACTER; }
Any idea?
You have to write valid Java: the only constructor for Character is Character(char) but you are invoking Character(String).
You need to extract what you want from yytext().
I have a Java String like this: "peque\u00f1o". Note that it has an embedded Unicode character: '\u00f1'.
Is there a method in Java that will replace these Unicode character sequences with the actual characters? That is, a method that would return "pequeño" if you gave it "peque\u00f1o" as input?
Note that I have a string that has 12 chars (those that we see, that happen to be in the ASCII range).
Actually the string is "pequeño".
String s = "peque\u00f1o";
System.out.println(s.length());
System.out.println(s);
yields
7
pequeño
i.e. seven chars and the correct representation on System.out.
I remember giving the same response last week, use org.apache.commons.lang.StringEscapeUtils.
If you have the appropriate fonts, a println or setting the string in a JLabel or JTextArea should do the trick. The escaping is only for the compiler.
If you plan to copy-paste the readable strings in source, remember to also choose a suitable file encoding like UTF8.
Testing out someone elses code, I noticed a few JSP pages printing funky non-ASCII characters. Taking a dip into the source I found this tidbit:
// remove any periods from first name e.g. Mr. John --> Mr John
firstName = firstName.trim().replace('.','\0');
Does replacing a character in a String with a null character even work in Java? I know that '\0' will terminate a C-string. Would this be the culprit to the funky characters?
Does replacing a character in a String with a null character even work in Java? I know that '\0' will terminate a c-string.
That depends on how you define what is working. Does it replace all occurrences of the target character with '\0'? Absolutely!
String s = "food".replace('o', '\0');
System.out.println(s.indexOf('\0')); // "1"
System.out.println(s.indexOf('d')); // "3"
System.out.println(s.length()); // "4"
System.out.println(s.hashCode() == 'f'*31*31*31 + 'd'); // "true"
Everything seems to work fine to me! indexOf can find it, it counts as part of the length, and its value for hash code calculation is 0; everything is as specified by the JLS/API.
It DOESN'T work if you expect replacing a character with the null character would somehow remove that character from the string. Of course it doesn't work like that. A null character is still a character!
String s = Character.toString('\0');
System.out.println(s.length()); // "1"
assert s.charAt(0) == 0;
It also DOESN'T work if you expect the null character to terminate a string. It's evident from the snippets above, but it's also clearly specified in JLS (10.9. An Array of Characters is Not a String):
In the Java programming language, unlike C, an array of char is not a String, and neither a String nor an array of char is terminated by '\u0000' (the NUL character).
Would this be the culprit to the funky characters?
Now we're talking about an entirely different thing, i.e. how the string is rendered on screen. Truth is, even "Hello world!" will look funky if you use dingbats font. A unicode string may look funky in one locale but not the other. Even a properly rendered unicode string containing, say, Chinese characters, may still look funky to someone from, say, Greenland.
That said, the null character probably will look funky regardless; usually it's not a character that you want to display. That said, since null character is not the string terminator, Java is more than capable of handling it one way or another.
Now to address what we assume is the intended effect, i.e. remove all period from a string, the simplest solution is to use the replace(CharSequence, CharSequence) overload.
System.out.println("A.E.I.O.U".replace(".", "")); // AEIOU
The replaceAll solution is mentioned here too, but that works with regular expression, which is why you need to escape the dot meta character, and is likely to be slower.
Should be probably changed to
firstName = firstName.trim().replaceAll("\\.", "");
I think it should be the case. To erase the character, you should use replace(".", "") instead.
Does replacing a character in a String
with a null character even work in
Java?
No.
Would this be the culprit to the funky characters?
Quite likely.
This does cause "funky characters":
System.out.println( "Mr. Foo".trim().replace('.','\0'));
produces:
Mr[] Foo
in my Eclipse console, where the [] is shown as a square box. As others have posted, use String.replace().
The problem is that, as you know, there are thousands of characters in the Unicode chart and I want to convert all the similar characters to the letters which are in English alphabet.
For instance here are a few conversions:
ҥ->H
Ѷ->V
Ȳ->Y
Ǭ->O
Ƈ->C
tђє Ŧค๓เℓy --> the Family
...
and I saw that there are more than 20 versions of letter A/a. and I don't know how to classify them. They look like needles in the haystack.
The complete list of unicode chars is at http://www.ssec.wisc.edu/~tomw/java/unicode.html or http://unicode.org/charts/charindex.html . Just try scrolling down and see the variations of letters.
How can I convert all these with Java? Please help me :(
Reposting my post from How do I remove diacritics (accents) from a string in .NET?
This method works fine in java (purely for the purpose of removing diacritical marks aka accents).
It basically converts all accented characters into their deAccented counterparts followed by their combining diacritics. Now you can use a regex to strip off the diacritics.
import java.text.Normalizer;
import java.util.regex.Pattern;
public String deAccent(String str) {
String nfdNormalizedString = Normalizer.normalize(str, Normalizer.Form.NFD);
Pattern pattern = Pattern.compile("\\p{InCombiningDiacriticalMarks}+");
return pattern.matcher(nfdNormalizedString).replaceAll("");
}
It's a part of Apache Commons Lang as of ver. 3.0.
org.apache.commons.lang3.StringUtils.stripAccents("Añ");
returns An
Also see http://www.drillio.com/en/software-development/java/removing-accents-diacritics-in-any-language/
Attempting to "convert them all" is the wrong approach to the problem.
Firstly, you need to understand the limitations of what you are trying to do. As others have pointed out, diacritics are there for a reason: they are essentially unique letters in the alphabet of that language with their own meaning / sound etc.: removing those marks is just the same as replacing random letters in an English word. This is before you even go onto consider the Cyrillic languages and other script based texts such as Arabic, which simply cannot be "converted" to English.
If you must, for whatever reason, convert characters, then the only sensible way to approach this it to firstly reduce the scope of the task at hand. Consider the source of the input - if you are coding an application for "the Western world" (to use as good a phrase as any), it would be unlikely that you would ever need to parse Arabic characters. Similarly, the Unicode character set contains hundreds of mathematical and pictorial symbols: there is no (easy) way for users to directly enter these, so you can assume they can be ignored.
By taking these logical steps you can reduce the number of possible characters to parse to the point where a dictionary based lookup / replace operation is feasible. It then becomes a small amount of slightly boring work creating the dictionaries, and a trivial task to perform the replacement. If your language supports native Unicode characters (as Java does) and optimises static structures correctly, such find and replaces tend to be blindingly quick.
This comes from experience of having worked on an application that was required to allow end users to search bibliographic data that included diacritic characters. The lookup arrays (as it was in our case) took perhaps 1 man day to produce, to cover all diacritic marks for all Western European languages.
Since the encoding that turns "the Family" into "tђє Ŧค๓เℓy" is effectively random and not following any algorithm that can be explained by the information of the Unicode codepoints involved, there's no general way to solve this algorithmically.
You will need to build the mapping of Unicode characters into latin characters which they resemble. You could probably do this with some smart machine learning on the actual glyphs representing the Unicode codepoints. But I think the effort for this would be greater than manually building that mapping. Especially if you have a good amount of examples from which you can build your mapping.
To clarify: a few of the substitutions can actually be solved via the Unicode data (as the other answers demonstrate), but some letters simply have no reasonable association with the latin characters which they resemble.
Examples:
"ђ" (U+0452 CYRILLIC SMALL LETTER DJE) is more related to "d" than to "h", but is used to represent "h".
"Ŧ" (U+0166 LATIN CAPITAL LETTER T WITH STROKE) is somewhat related to "T" (as the name suggests) but is used to represent "F".
"ค" (U+0E04 THAI CHARACTER KHO KHWAI) is not related to any latin character at all and in your example is used to represent "a"
String tested : ÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖØÙÚÛÜÝß
Tested :
Output from Apache Commons Lang3 : AAAAAÆCEEEEIIIIÐNOOOOOØUUUUYß
Output from ICU4j : AAAAAÆCEEEEIIIIÐNOOOOOØUUUUYß
Output from JUnidecode : AAAAAAECEEEEIIIIDNOOOOOOUUUUUss (problem with Ý and another issue)
Output from Unidecode : AAAAAAECEEEEIIIIDNOOOOOOUUUUYss
The last choice is the best.
The original request has been answered already.
However, I am posting the below answer for those who might be looking for generic transliteration code to transliterate any charset to Latin/English in Java.
Naive meaning of tranliteration:
Translated string in it's final form/target charset sounds like the string in it's original form.
If we want to transliterate any charset to Latin(English alphabets), then ICU4(ICU4J library in java ) will do the job.
Here is the code snippet in java:
import com.ibm.icu.text.Transliterator; //ICU4J library import
public static String TRANSLITERATE_ID = "NFD; Any-Latin; NFC";
public static String NORMALIZE_ID = "NFD; [:Nonspacing Mark:] Remove; NFC";
/**
* Returns the transliterated string to convert any charset to latin.
*/
public static String transliterate(String input) {
Transliterator transliterator = Transliterator.getInstance(TRANSLITERATE_ID + "; " + NORMALIZE_ID);
String result = transliterator.transliterate(input);
return result;
}
If the need is to convert "òéışöç->oeisoc", you can use this a starting point :
public class AsciiUtils {
private static final String PLAIN_ASCII =
"AaEeIiOoUu" // grave
+ "AaEeIiOoUuYy" // acute
+ "AaEeIiOoUuYy" // circumflex
+ "AaOoNn" // tilde
+ "AaEeIiOoUuYy" // umlaut
+ "Aa" // ring
+ "Cc" // cedilla
+ "OoUu" // double acute
;
private static final String UNICODE =
"\u00C0\u00E0\u00C8\u00E8\u00CC\u00EC\u00D2\u00F2\u00D9\u00F9"
+ "\u00C1\u00E1\u00C9\u00E9\u00CD\u00ED\u00D3\u00F3\u00DA\u00FA\u00DD\u00FD"
+ "\u00C2\u00E2\u00CA\u00EA\u00CE\u00EE\u00D4\u00F4\u00DB\u00FB\u0176\u0177"
+ "\u00C3\u00E3\u00D5\u00F5\u00D1\u00F1"
+ "\u00C4\u00E4\u00CB\u00EB\u00CF\u00EF\u00D6\u00F6\u00DC\u00FC\u0178\u00FF"
+ "\u00C5\u00E5"
+ "\u00C7\u00E7"
+ "\u0150\u0151\u0170\u0171"
;
// private constructor, can't be instanciated!
private AsciiUtils() { }
// remove accentued from a string and replace with ascii equivalent
public static String convertNonAscii(String s) {
if (s == null) return null;
StringBuilder sb = new StringBuilder();
int n = s.length();
for (int i = 0; i < n; i++) {
char c = s.charAt(i);
int pos = UNICODE.indexOf(c);
if (pos > -1){
sb.append(PLAIN_ASCII.charAt(pos));
}
else {
sb.append(c);
}
}
return sb.toString();
}
public static void main(String args[]) {
String s =
"The result : È,É,Ê,Ë,Û,Ù,Ï,Î,À,Â,Ô,è,é,ê,ë,û,ù,ï,î,à,â,ô,ç";
System.out.println(AsciiUtils.convertNonAscii(s));
// output :
// The result : E,E,E,E,U,U,I,I,A,A,O,e,e,e,e,u,u,i,i,a,a,o,c
}
}
The JDK 1.6 provides the java.text.Normalizer class that can be used for this task.
See an example here
The problem with "converting" arbitrary Unicode to ASCII is that the meaning of a character is culture-dependent. For example, “ß” to a German-speaking person should be converted to "ss" while an English-speaker would probably convert it to “B”.
Add to that the fact that Unicode has multiple code points for the same glyphs.
The upshot is that the only way to do this is create a massive table with each Unicode character and the ASCII character you want to convert it to. You can take a shortcut by normalizing characters with accents to normalization form KD, but not all characters normalize to ASCII. In addition, Unicode does not define which parts of a glyph are "accents".
Here is a tiny excerpt from an app that does this:
switch (c)
{
case 'A':
case '\u00C0': // À LATIN CAPITAL LETTER A WITH GRAVE
case '\u00C1': // Á LATIN CAPITAL LETTER A WITH ACUTE
case '\u00C2': // Â LATIN CAPITAL LETTER A WITH CIRCUMFLEX
// and so on for about 20 lines...
return "A";
break;
case '\u00C6':// Æ LATIN CAPITAL LIGATURE AE
return "AE";
break;
// And so on for pages...
}
You could try using unidecode, which is available as a ruby gem and as a perl module on cpan. Essentially, it works as a huge lookup table, where each unicode code point relates to an ascii character or string.
There is no easy or general way to do what you want because it is just your subjective opinion that these letters look loke the latin letters you want to convert to. They are actually separate letters with their own distinct names and sounds which just happen to superficially look like a latin letter.
If you want that conversion, you have to create your own translation table based on what latin letters you think the non-latin letters should be converted to.
(If you only want to remove diacritial marks, there are some answers in this thread: How do I remove diacritics (accents) from a string in .NET? However you describe a more general problem)
I'm late to the party, but after facing this issue today, I found this answer to be very good:
String asciiName = Normalizer.normalize(unicodeName, Normalizer.Form.NFD)
.replaceAll("[^\\p{ASCII}]", "");
Reference:
https://stackoverflow.com/a/16283863
Following Class does the trick:
org.apache.lucene.analysis.miscellaneous.ASCIIFoldingFilter