How can non-ASCII characters be removed from a string? - java

I have strings "A função", "Ãugent" in which I need to replace characters like ç, ã, and à with empty strings.
How can I remove those non-ASCII characters from my string?
I have attempted to implement this using the following function, but it is not working properly. One problem is that the unwanted characters are getting replaced by the space character.
public static String matchAndReplaceNonEnglishChar(String tmpsrcdta) {
String newsrcdta = null;
char array[] = Arrays.stringToCharArray(tmpsrcdta);
if (array == null)
return newsrcdta;
for (int i = 0; i < array.length; i++) {
int nVal = (int) array[i];
boolean bISO =
// Is character ISO control
Character.isISOControl(array[i]);
boolean bIgnorable =
// Is Ignorable identifier
Character.isIdentifierIgnorable(array[i]);
// Remove tab and other unwanted characters..
if (nVal == 9 || bISO || bIgnorable)
array[i] = ' ';
else if (nVal > 255)
array[i] = ' ';
}
newsrcdta = Arrays.charArrayToString(array);
return newsrcdta;
}

This will search and replace all non ASCII letters:
String resultString = subjectString.replaceAll("[^\\x00-\\x7F]", "");

FailedDev's answer is good, but can be improved. If you want to preserve the ascii equivalents, you need to normalize first:
String subjectString = "öäü";
subjectString = Normalizer.normalize(subjectString, Normalizer.Form.NFD);
String resultString = subjectString.replaceAll("[^\\x00-\\x7F]", "");
=> will produce "oau"
That way, characters like "öäü" will be mapped to "oau", which at least preserves some information. Without normalization, the resulting String will be blank.

This would be the Unicode solution
String s = "A função, Ãugent";
String r = s.replaceAll("\\P{InBasic_Latin}", "");
\p{InBasic_Latin} is the Unicode block that contains all letters in the Unicode range U+0000..U+007F (see regular-expression.info)
\P{InBasic_Latin} is the negated \p{InBasic_Latin}

You can try something like this. Special Characters range for alphabets starts from 192, so you can avoid such characters in the result.
String name = "A função";
StringBuilder result = new StringBuilder();
for(char val : name.toCharArray()) {
if(val < 192) result.append(val);
}
System.out.println("Result "+result.toString());

[Updated solution]
can be used with "Normalize" (Canonical decomposition) and "replaceAll", to replace it with the appropriate characters.
import java.text.Normalizer;
import java.text.Normalizer.Form;
import java.util.regex.Pattern;
public final class NormalizeUtils {
public static String normalizeASCII(final String string) {
final String normalize = Normalizer.normalize(string, Form.NFD);
return Pattern.compile("\\p{InCombiningDiacriticalMarks}+")
.matcher(normalize)
.replaceAll("");
} ...

Or you can use the function below for removing non-ascii character from the string.
You will get know internal working.
private static String removeNonASCIIChar(String str) {
StringBuffer buff = new StringBuffer();
char chars[] = str.toCharArray();
for (int i = 0; i < chars.length; i++) {
if (0 < chars[i] && chars[i] < 127) {
buff.append(chars[i]);
}
}
return buff.toString();
}

The ASCII table contains 128 codes, with a total of 95 printable characters, of which only 52 characters are letters:
[0-127] ASCII codes
[32-126] printable characters
[48-57] digits [0-9]
[65-90] uppercase letters [A-Z]
[97-122] lowercase letters [a-z]
You can use String.codePoints method to get a stream over int values of characters of this string and filter out non-ASCII characters:
String str1 = "A função, Ãugent";
String str2 = str1.codePoints()
.filter(ch -> ch < 128)
.mapToObj(Character::toString)
.collect(Collectors.joining());
System.out.println(str2); // A funo, ugent
Or you can explicitly specify character ranges. For example filter out everything except letters:
String str3 = str1.codePoints()
.filter(ch -> ch >= 'A' && ch <= 'Z'
|| ch >= 'a' && ch <= 'z')
.mapToObj(Character::toString)
.collect(Collectors.joining());
System.out.println(str3); // Afunougent
See also: How do I not take Special Characters in my Password Validation (without Regex)?

String s = "A função";
String stripped = s.replaceAll("\\P{ASCII}", "");
System.out.println(stripped); // Prints "A funo"
or
private static final Pattern NON_ASCII_PATTERN = Pattern.compile("\\P{ASCII}");
public static String matchAndReplaceNonEnglishChar(String tmpsrcdta) {
return NON_ASCII_PATTERN.matcher(s).replaceAll("");
}
public static void main(String[] args) {
matchAndReplaceNonEnglishChar("A função"); // Prints "A funo"
}
Explanation
The method String.replaceAll(String regex, String replacement) replaces all instances of a given regular expression (regex) with a given replacement string.
Replaces each substring of this string that matches the given regular expression with the given replacement.
Java has the "\p{ASCII}" regular expression construct which matches any ASCII character, and its inverse, "\P{ASCII}", which matches any non-ASCII character. The matched characters can then be replaced with the empty string, effectively removing them from the resulting string.
String s = "A função";
String stripped = s.replaceAll("\\P{ASCII}", "");
System.out.println(stripped); // Prints "A funo"
The full list of valid regex constructs is documented in the Pattern class.
Note: If you are going to be calling this pattern multiple times within a run, it will be more efficient to use a compiled Pattern directly, rather than String.replaceAll. This way the pattern is compiled only once and reused, rather than each time replaceAll is called:
public class AsciiStripper {
private static final Pattern NON_ASCII_PATTERN = Pattern.compile("\\P{ASCII}");
public static String stripNonAscii(String s) {
return NON_ASCII_PATTERN.matcher(s).replaceAll("");
}
}

An easily-readable, ascii-printable, streams solution:
String result = str.chars()
.filter(c -> isAsciiPrintable((char) c))
.mapToObj(c -> String.valueOf((char) c))
.collect(Collectors.joining());
private static boolean isAsciiPrintable(char ch) {
return ch >= 32 && ch < 127;
}
To convert to "_": .map(c -> isAsciiPrintable((char) c) ? c : '_')
32 to 127 is equivalent to the regex [^\\x20-\\x7E] (from comment on the regex solution)
Source for isAsciiPrintable: http://www.java2s.com/Code/Java/Data-Type/ChecksifthestringcontainsonlyASCIIprintablecharacters.htm

CharMatcher.retainFrom can be used, if you're using the Google Guava library:
String s = "A função";
String stripped = CharMatcher.ascii().retainFrom(s);
System.out.println(stripped); // Prints "A funo"

Related

Removing dashes in a Java string and capitalizing the character after the dash [duplicate]

This question already has answers here:
What is the most elegant way to convert a hyphen separated word (e.g. "do-some-stuff") to the lower camel-case variation (e.g. "doSomeStuff")?
(11 answers)
Closed 5 years ago.
I have a String nba-west-teams blazers and I want to convert the string into a format like nbaWestTeams blazers. Essentially, I want to remove all the dashes and replace the characters after the dash with it's uppercase equivalent.
I know I can use the String method replaceAll to remove all the dashes, but how do I get the character after the dash and uppercase it?
// Input
String withDashes = "nba-west-teams blazers"
String removeDashes = withDashes.replaceAll(....?)
// Expected conversion
String withoutDashes = "nbaWestTeams blazers"
Check out the indexOf and the replace method of the StringBuilder class. StringBuilder allows fast editing of Strings.
When you are finished use toString.
If you need more help just make a comment.
You can use Patterns with regex like this \-([a-z]):
String str = "nba-west-teams blazers";
String regex = "\\-([a-z])";
Pattern pattern = Pattern.compile(regex);
Matcher matcher = pattern.matcher(str);
while (matcher.find()) {
str = str.replaceFirst(matcher.group(), matcher.group(1).toUpperCase());
}
System.out.println(str);//Output = nbaWestTeams blazers
So it will matche the first alphabets after the dash and replace the matched with the upper alphabets
You can iterate through the string and when a hyphen is found, just skip the hyphen and transform the next character to uppercase. You can use a StringBuilder to store the partial results as follows:
public static String toCamelCase(String str) {
// if the last char is '-', lets set the length to length - 1 to avoid out of bounds
final int len = str.charAt(str.length() - 1) == '-' ? str.length() - 1 : str.length();
StringBuilder builder = new StringBuilder(len);
for (int i = 0; i < len; ++i) {
char c = str.charAt(i);
if (c == '-') {
++i;
builder.append(Character.toUpperCase(str.charAt(i)));
} else {
builder.append(c);
}
}
return builder.toString();
}
You can split the string at the space and use https://github.com/google/guava/wiki/StringsExplained#caseformat to convert the dashed substring into a camel cased string.

Replace special characters (non ASCII) in String by corresponding unicode [duplicate]

I have strings "A função", "Ãugent" in which I need to replace characters like ç, ã, and à with empty strings.
How can I remove those non-ASCII characters from my string?
I have attempted to implement this using the following function, but it is not working properly. One problem is that the unwanted characters are getting replaced by the space character.
public static String matchAndReplaceNonEnglishChar(String tmpsrcdta) {
String newsrcdta = null;
char array[] = Arrays.stringToCharArray(tmpsrcdta);
if (array == null)
return newsrcdta;
for (int i = 0; i < array.length; i++) {
int nVal = (int) array[i];
boolean bISO =
// Is character ISO control
Character.isISOControl(array[i]);
boolean bIgnorable =
// Is Ignorable identifier
Character.isIdentifierIgnorable(array[i]);
// Remove tab and other unwanted characters..
if (nVal == 9 || bISO || bIgnorable)
array[i] = ' ';
else if (nVal > 255)
array[i] = ' ';
}
newsrcdta = Arrays.charArrayToString(array);
return newsrcdta;
}
This will search and replace all non ASCII letters:
String resultString = subjectString.replaceAll("[^\\x00-\\x7F]", "");
FailedDev's answer is good, but can be improved. If you want to preserve the ascii equivalents, you need to normalize first:
String subjectString = "öäü";
subjectString = Normalizer.normalize(subjectString, Normalizer.Form.NFD);
String resultString = subjectString.replaceAll("[^\\x00-\\x7F]", "");
=> will produce "oau"
That way, characters like "öäü" will be mapped to "oau", which at least preserves some information. Without normalization, the resulting String will be blank.
This would be the Unicode solution
String s = "A função, Ãugent";
String r = s.replaceAll("\\P{InBasic_Latin}", "");
\p{InBasic_Latin} is the Unicode block that contains all letters in the Unicode range U+0000..U+007F (see regular-expression.info)
\P{InBasic_Latin} is the negated \p{InBasic_Latin}
You can try something like this. Special Characters range for alphabets starts from 192, so you can avoid such characters in the result.
String name = "A função";
StringBuilder result = new StringBuilder();
for(char val : name.toCharArray()) {
if(val < 192) result.append(val);
}
System.out.println("Result "+result.toString());
[Updated solution]
can be used with "Normalize" (Canonical decomposition) and "replaceAll", to replace it with the appropriate characters.
import java.text.Normalizer;
import java.text.Normalizer.Form;
import java.util.regex.Pattern;
public final class NormalizeUtils {
public static String normalizeASCII(final String string) {
final String normalize = Normalizer.normalize(string, Form.NFD);
return Pattern.compile("\\p{InCombiningDiacriticalMarks}+")
.matcher(normalize)
.replaceAll("");
} ...
Or you can use the function below for removing non-ascii character from the string.
You will get know internal working.
private static String removeNonASCIIChar(String str) {
StringBuffer buff = new StringBuffer();
char chars[] = str.toCharArray();
for (int i = 0; i < chars.length; i++) {
if (0 < chars[i] && chars[i] < 127) {
buff.append(chars[i]);
}
}
return buff.toString();
}
The ASCII table contains 128 codes, with a total of 95 printable characters, of which only 52 characters are letters:
[0-127] ASCII codes
[32-126] printable characters
[48-57] digits [0-9]
[65-90] uppercase letters [A-Z]
[97-122] lowercase letters [a-z]
You can use String.codePoints method to get a stream over int values of characters of this string and filter out non-ASCII characters:
String str1 = "A função, Ãugent";
String str2 = str1.codePoints()
.filter(ch -> ch < 128)
.mapToObj(Character::toString)
.collect(Collectors.joining());
System.out.println(str2); // A funo, ugent
Or you can explicitly specify character ranges. For example filter out everything except letters:
String str3 = str1.codePoints()
.filter(ch -> ch >= 'A' && ch <= 'Z'
|| ch >= 'a' && ch <= 'z')
.mapToObj(Character::toString)
.collect(Collectors.joining());
System.out.println(str3); // Afunougent
See also: How do I not take Special Characters in my Password Validation (without Regex)?
String s = "A função";
String stripped = s.replaceAll("\\P{ASCII}", "");
System.out.println(stripped); // Prints "A funo"
or
private static final Pattern NON_ASCII_PATTERN = Pattern.compile("\\P{ASCII}");
public static String matchAndReplaceNonEnglishChar(String tmpsrcdta) {
return NON_ASCII_PATTERN.matcher(s).replaceAll("");
}
public static void main(String[] args) {
matchAndReplaceNonEnglishChar("A função"); // Prints "A funo"
}
Explanation
The method String.replaceAll(String regex, String replacement) replaces all instances of a given regular expression (regex) with a given replacement string.
Replaces each substring of this string that matches the given regular expression with the given replacement.
Java has the "\p{ASCII}" regular expression construct which matches any ASCII character, and its inverse, "\P{ASCII}", which matches any non-ASCII character. The matched characters can then be replaced with the empty string, effectively removing them from the resulting string.
String s = "A função";
String stripped = s.replaceAll("\\P{ASCII}", "");
System.out.println(stripped); // Prints "A funo"
The full list of valid regex constructs is documented in the Pattern class.
Note: If you are going to be calling this pattern multiple times within a run, it will be more efficient to use a compiled Pattern directly, rather than String.replaceAll. This way the pattern is compiled only once and reused, rather than each time replaceAll is called:
public class AsciiStripper {
private static final Pattern NON_ASCII_PATTERN = Pattern.compile("\\P{ASCII}");
public static String stripNonAscii(String s) {
return NON_ASCII_PATTERN.matcher(s).replaceAll("");
}
}
An easily-readable, ascii-printable, streams solution:
String result = str.chars()
.filter(c -> isAsciiPrintable((char) c))
.mapToObj(c -> String.valueOf((char) c))
.collect(Collectors.joining());
private static boolean isAsciiPrintable(char ch) {
return ch >= 32 && ch < 127;
}
To convert to "_": .map(c -> isAsciiPrintable((char) c) ? c : '_')
32 to 127 is equivalent to the regex [^\\x20-\\x7E] (from comment on the regex solution)
Source for isAsciiPrintable: http://www.java2s.com/Code/Java/Data-Type/ChecksifthestringcontainsonlyASCIIprintablecharacters.htm
CharMatcher.retainFrom can be used, if you're using the Google Guava library:
String s = "A função";
String stripped = CharMatcher.ascii().retainFrom(s);
System.out.println(stripped); // Prints "A funo"

Remove all punctuation from the end of a string

Examples:
// A B C. -> A B C
// !A B C! -> !A B C
// A? B?? C??? -> A? B?? C
Here's what I have so far:
while (endsWithRegex(word, "\\p{P}")) {
word = word.substring(0, word.length() - 1);
}
public static boolean endsWithRegex(String word, String regex) {
return word != null && !word.isEmpty() &&
word.substring(word.length() - 1).replaceAll(regex, "").isEmpty();
}
This current solution works, but since it's already calling String.replaceAll within endsWithRegex, we should be able to do something like this:
word = word.replaceAll(/* regex */, "");
Any advice?
I suggest using
\s*\p{Punct}+\s*$
It will match optional whitespace and punctuation at the end of the string.
If you do not care about the whitespace, just use \p{Punct}+$.
Do not forget that in Java strings, backslashes should be doubled to denote literal backslashes (that must be used as regex escape symbols).
Java demo
String word = "!Words word! ";
word = word.replaceAll("\\s*\\p{Punct}+\\s*$", "");
System.out.println(word); // => !Words word
You can use:
str = str.replaceFirst("\\p{P}+$", "");
To include space also:
str = str.replaceFirst("[\\p{Space}\\p{P}]+$", "")
how about this, if you can take a minor hit in efficiency.
reverse the input string
keep removing characters until you hit an alphabet
reverse the string and return
I have modified the logic of your method
public static boolean endsWithRegex(String word, String regex) {
return word != null && !word.isEmpty() && word.matches(regex);
}
and your regex is : regex = ".*[^a-zA-Z]$";

Java Unicode replacement char with hex code

I need to replace all special unicode chars in a string with an escape char "\" and its hex value.
For example, this string:
String test="This is a string with the special è unicode char";
Shuld be replaced with:
"This is a string with the special \E8 unicode char";
where E8 is the hex value of unicode value of char "è".
I have two problem:
1) How to find "special chars", may I check for each char value if it is >127?
That depends what your definition of "special" character is. Testing for characters greater than 127 is testing for non-ASCII characters. It is up to you to decide if that is what you want.
2) And how to get hex value as string.
The Integer.toHexString method can be used for that .
3) May I use a regex to find it or a loop for every char of the string?
A loop is simpler.
I'm trying this solution with a regex replacement (I don't know if it is better to loop for each char in string or to use the regex...)
String test="This is a string with the special è unicode char";
Pattern pat=Pattern.compile("([^\\x20-\\x7E])");
int offset=0;
Matcher m=pat.matcher(test);
while(!m.hitEnd()) {
if (m.find(offset)) {
out.append(test.substring(offset,m.start()));
for(int i=0;i<m.group(1).length();i++){
out.append(ESCAPE_CHAR);
out.append(Integer.toHexString(m.group(1).charAt(i)).toUpperCase());
}
offset=m.end();
}
}
return out.toString();
If your unicode characters are all in the range \u0080 - \u00FF you might use this solution.
public class Convert {
public static void main(String[] args) {
String in = "This is a string with the special è unicode char";
StringBuilder out = new StringBuilder(in.length());
for (int i = 0; i < in.length(); i++) {
char charAt = in.charAt(i);
if (charAt > 127) {
// if there is a high number (several e.g. 100.000) of conversions
// this lead in less objects to be garbadge collected
out.append('\\').append(Integer.toHexString(charAt).toUpperCase());
// out.append(String.format("\\%X", (int) charAt));
} else {
out.append(charAt);
}
}
System.out.println(out);
}
}

Handling delimiter with escape characters in Java String.split() method

I have searched the web for my query, but didn't get the answer which fits my requirement exactly. I have my string like below:
A|B|C|The Steading\|Keir Allan\|Braco|E
My Output should look like below:
A
B
C
The Steading|Keir Allan|Braco
E
My requirement is to skip the delimiter if it is preceded by the escape sequence. I have tried the following using negative lookbehinds in String.split():
(?<!\\)\|
But, my problem is the delimiter will be defined by the end user dynamically and it need not be always |. It can be any character on the keyboard (no restrictions). Hence, my doubt is that the above regex might fail for some of the special characters which are not allowed in regex.
I just wanted to know if this is the perfect way to do it.
You can use Pattern.quote():
String regex = "(?<!\\\\)" + Pattern.quote(delim);
Using your example:
String delim = "|";
String regex = "(?<!\\\\)" + Pattern.quote(delim);
for (String s : "A|B|C|The Steading\\|Keir Allan\\|Braco|E".split(regex))
System.out.println(s);
A
B
C
The Steading\|Keir Allan\|Braco
E
You can extend this to use a custom escape sequence as well:
String delim = "|";
String esc = "+";
String regex = "(?<!" + Pattern.quote(esc) + ")" + Pattern.quote(delim);
for (String s : "A|B|C|The Steading+|Keir Allan+|Braco|E".split(regex))
System.out.println(s);
A
B
C
The Steading+|Keir Allan+|Braco
E
I know this is an old thread, but the lookbehind solution has an issue, that it doesn't allow escaping of the escape character (the split would not occur on A|B|C|The Steading\\|Keir Allan\|Braco|E)).
The positive matching solution in thread Regex and escaped and unescaped delimiter works better (with modification using Pattern.quote() if the delimiter is dynamic).
private static void splitString(String str, char escapeCharacter, char delimiter, Consumer<String> resultConsumer) {
final StringBuilder sb = new StringBuilder();
boolean isEscaped = false;
for (int i = 0; i < str.length(); i++) {
char c = str.charAt(i);
if (c == escapeCharacter) {
isEscaped = ! isEscaped;
sb.append(c);
} else if (c == delimiter) {
if (isEscaped) {
sb.append(c);
isEscaped = false;
} else {
resultConsumer.accept(sb.toString());
sb.setLength(0);
}
} else {
isEscaped = false;
sb.append(c);
}
}
resultConsumer.accept(sb.toString());
}

Categories