Remove non-ASCII non-printable characters from a String - java

I get user input including non-ASCII characters and non-printable characters, such as
\xc2d
\xa0
\xe7
\xc3\ufffdd
\xc3\ufffdd
\xc2\xa0
\xc3\xa7
\xa0\xa0
for example:
email : abc#gmail.com\xa0\xa0
street : 123 Main St.\xc2\xa0
desired output:
email : abc#gmail.com
street : 123 Main St.
What is the best way to removing them using Java?
I tried the following, but doesn't seem to work
public static void main(String args[]) throws UnsupportedEncodingException {
String s = "abc#gmail\\xe9.com";
String email = "abc#gmail.com\\xa0\\xa0";
System.out.println(s.replaceAll("\\P{Print}", ""));
System.out.println(email.replaceAll("\\P{Print}", ""));
}
Output
abc#gmail\xe9.com
abc#gmail.com\xa0\xa0

Your requirements are not clear. All characters in a Java String are Unicode characters, so if you remove them, you'll be left with an empty string. I assume what you mean is that you want to remove any non-ASCII, non-printable characters.
String clean = str.replaceAll("\\P{Print}", "");
Here, \p{Print} represents a POSIX character class for printable ASCII characters, while \P{Print} is the complement of that class. With this expression, all characters that are not printable ASCII are replaced with the empty string. (The extra backslash is because \ starts an escape sequence in string literals.)
Apparently, all the input characters are actually ASCII characters that represent a printable encoding of non-printable or non-ASCII characters. Mongo shouldn't have any trouble with these strings, because they contain only plain printable ASCII characters.
This all sounds a little fishy to me. What I believe is happening is that the data really do contain non-printable and non-ASCII characters, and another component (like a logging framework) is replacing these with a printable representation. In your simple tests, you are failing to translate the printable representation back to the original string, so you mistakenly believe the first regular expression is not working.
That's my guess, but if I've misread the situation and you really do need to strip out literal \xHH escapes, you can do it with the following regular expression.
String clean = str.replaceAll("\\\\x\\p{XDigit}{2}", "");
The API documentation for the Pattern class does a good job of listing all of the syntax supported by Java's regex library. For more elaboration on what all of the syntax means, I have found the Regular-Expressions.info site very helpful.

I know it's maybe late but for future reference:
String clean = str.replaceAll("\\P{Print}", "");
Removes all non printable characters, but that includes \n (line feed), \t(tab) and \r(carriage return), and sometimes you want to keep those characters.
For that problem use inverted logic:
String clean = str.replaceAll("[^\\n\\r\\t\\p{Print}]", "");

With Google Guava's CharMatcher, you can remove any non-printable characters and then retain all ASCII characters (dropping any accents) like this:
String printable = CharMatcher.INVISIBLE.removeFrom(input);
String clean = CharMatcher.ASCII.retainFrom(printable);
Not sure if that's what you really want, but it removes anything expressed as escape sequences in your question's sample data.

You can try this code:
public String cleanInvalidCharacters(String in) {
StringBuilder out = new StringBuilder();
char current;
if (in == null || ("".equals(in))) {
return "";
}
for (int i = 0; i < in.length(); i++) {
current = in.charAt(i);
if ((current == 0x9)
|| (current == 0xA)
|| (current == 0xD)
|| ((current >= 0x20) && (current <= 0xD7FF))
|| ((current >= 0xE000) && (current <= 0xFFFD))
|| ((current >= 0x10000) && (current <= 0x10FFFF))) {
out.append(current);
}
}
return out.toString().replaceAll("\\s", " ");
}
It works for me to remove invalid characters from String.

You can use java.text.normalizer

Input => "This \u7279text \u7279is what I need"
Output => "This text is what I need"
If you are trying to remove Unicode characters from a string like above this code will work
Pattern unicodeCharsPattern = Pattern.compile("\\\\u(\\p{XDigit}{4})");
Matcher unicodeMatcher = unicodeChars.matcher(data);
String cleanData = null;
if (unicodeMatcher.find()) {
cleanData = unicodeMatcher.replaceAll("");
}

Related

Regex to consolidate multiple rules

I'm looking at optimising my string manipulation code and consolidating all of my replaceAll's to just one pattern if possible
Rules -
strip all special chars except -
replace space with -
condense consecutive - 's to just one -
Remove leading and trailing -'s
My code -
public static String slugifyTitle(String value) {
String slugifiedVal = null;
if (StringUtils.isNotEmpty(value))
slugifiedVal = value
.replaceAll("[ ](?=[ ])|[^-A-Za-z0-9 ]+", "") // strips all special chars except -
.replaceAll("\\s+", "-") // converts spaces to -
.replaceAll("--+", "-"); // replaces consecutive -'s with just one -
slugifiedVal = StringUtils.stripStart(slugifiedVal, "-"); // strips leading -
slugifiedVal = StringUtils.stripEnd(slugifiedVal, "-"); // strips trailing -
return slugifiedVal;
}
Does the job but obviously looks shoddy.
My test assertions -
Heading with symbols *~!##$%^&()_+-=[]{};',.<>?/ ==> heading-with-symbols
Heading with an asterisk* ==> heading-with-an-asterisk
Custom-id-&-stuff ==> custom-id-stuff
--Custom-id-&-stuff-- ==> custom-id-stuff
Disclaimer: I don't think a regex approach to this problem is wrong, or that this is an objectively better approach. I am merely presenting an alternative approach as food for thought.
I have a tendency against regex approaches to problems where you have to ask how to solve with regex, because that implies you're going to struggle to maintain that solution in the future. There is an opacity to regexes where "just do this" is obvious, when you know just to do this.
Some problems typically solved with regex, like this one, can be solved using imperative code. It tends to be more verbose, but it uses simple, apparent, code constructs; it's easier to debug; and can be faster because it doesn't involve the full "machinery" of the regex engine.
static String slugifyTitle(String value) {
boolean appendHyphen = false;
StringBuilder sb = new StringBuilder(value.length());
// Go through value one character at a time...
for (int i = 0; i < value.length(); i++) {
char c = value.charAt(i);
if (isAppendable(c)) {
// We have found a character we want to include in the string.
if (appendHyphen) {
// We previously found character(s) that we want to append a single
// hyphen for.
sb.append('-');
appendHyphen = false;
}
sb.append(c);
} else if (requiresHyphen(c)) {
// We want to replace hyphens or spaces with a single hyphen.
// Only append a hyphen if it's not going to be the first thing in the output.
// Doesn't matter if this is set for trailing hyphen/whitespace,
// since we then never hit the "isAppendable" condition.
appendHyphen = sb.length() > 0;
} else {
// Other characters are simply ignored.
}
}
// You can lowercase when appending the character, but `Character.toLowerCase()`
// recommends using `String.toLowerCase` instead.
return sb.toString().toLowerCase(Locale.ROOT);
}
// Some predicate on characters you want to include in the output.
static boolean isAppendable(char c) {
return (c >= 'A' && c <= 'Z')
|| (c >= 'a' && c <= 'z')
|| (c >= '0' && c <= '9');
}
// Some predicate on characters you want to replace with a single '-'.
static boolean requiresHyphen(char c) {
return c == '-' || Character.isWhitespace(c);
}
(This code is wildly over-commented, for the purpose of explaining it in this answer. Strip out the comments and unnecessary things like the else, it's actually not super complicated).
Consider the following regex parts:
Any special chars other than -: [\p{S}\p{P}&&[^-]]+ (character class subtraction)
Any one or more whitespace or hyphens: [^-\s]+ (this will be used to replace with a single -)
You will still need to remove leading/trailing hyphens, it will be a separate post-processing step. If you wish, you can use a ^-+|-+$ regex.
So, you can only reduce this to three .replaceAll invocations keeping the code precise and readable:
public static String slugifyTitle(String value) {
String slugifiedVal = null;
if (value != null && !value.trim().isEmpty())
slugifiedVal = value.toLowerCase()
.replaceAll("[\\p{S}\\p{P}&&[^-]]+", "") // strips all special chars except -
.replaceAll("[\\s-]+", "-") // converts spaces/hyphens to -
.replaceAll("^-+|-+$", ""); // remove trailing/leading hyphens
return slugifiedVal;
}
See the Java demo:
List<String> strs = Arrays.asList("Heading with symbols *~!##$%^&()_+-=[]{};',.<>?/",
"Heading with an asterisk*",
"Custom-id-&-stuff",
"--Custom-id-&-stuff--");
for (String str : strs)
System.out.println("\"" + str + "\" => " + slugifyTitle(str));
}
Output:
"Heading with symbols *~!##$%^&()_+-=[]{};',.<>?/" => heading-with-symbols
"Heading with an asterisk*" => heading-with-an-asterisk
"Custom-id-&-stuff" => custom-id-stuff
"--Custom-id-&-stuff--" => custom-id-stuff
NOTE: if your strings can contain any Unicode whitespace, replace "[\\s-]+" with "(?U)[\\s-]+".

unwrapping String within String

I received a message from a queuing service, which I thought would be a UTF-8 encoded String. It turned out to be a quoted and escaped String within a String. That is, the first and last characters of the String itself are ", each newline is two characters \n, quotation marks (numerous because this is XML) are \", and single UTF-8 characters in foreign languages are represented as six characters (e.g., \uABCD). I know I can unwrap all this by rolling my own, but I thought there must be a combination of methods that can do this already. What might that incantation be?
After feedback from #JonSkeet and #njzk2, I came up with this, which worked:
// gradle: 'org.apache.commons:commons-lang3:3.3.2'
import org.apache.commons.lang3.StringEscapeUtils;
String s = serviceThatSometimesReturnsQuotedStringWithinString();
String usable = null;
if (s.length() > 0 && s.charAt(0) == '"' && s.charAt(s.length()-1) == '"') {
usable = StringEscapeUtils.unescapeEcmaScript(s.substring(1, s.length()-1));
} else {
usable = s;
}

Remove Special Characters For A Pattern Java

I want to remove that characters from a String:
+ - ! ( ) { } [ ] ^ ~ : \
also I want to remove them:
/*
*/
&&
||
I mean that I will not remove & or | I will remove them if the second character follows the first one (/* */ && ||)
How can I do that efficiently and fast at Java?
Example:
a:b+c1|x||c*(?)
will be:
abc1|xc*?
This can be done via a long, but actually very simple regex.
String aString = "a:b+c1|x||c*(?)";
String sanitizedString = aString.replaceAll("[+\\-!(){}\\[\\]^~:\\\\]|/\\*|\\*/|&&|\\|\\|", "");
System.out.println(sanitizedString);
I think that the java.lang.String.replaceAll(String regex, String replacement) is all you need:
http://docs.oracle.com/javase/6/docs/api/java/lang/String.html#replaceAll(java.lang.String, java.lang.String).
there is two way to do that :
1)
ArrayList<String> arrayList = new ArrayList<String>();
arrayList.add("+");
arrayList.add("-");
arrayList.add("||");
arrayList.add("&&");
arrayList.add("(");
arrayList.add(")");
arrayList.add("{");
arrayList.add("}");
arrayList.add("[");
arrayList.add("]");
arrayList.add("~");
arrayList.add("^");
arrayList.add(":");
arrayList.add("/");
arrayList.add("/*");
arrayList.add("*/");
String string = "a:b+c1|x||c*(?)";
for (int i = 0; i < arrayList.size(); i++) {
if (string.contains(arrayList.get(i)));
string=string.replace(arrayList.get(i), "");
}
System.out.println(string);
2)
String string = "a:b+c1|x||c*(?)";
string = string.replaceAll("[+\\-!(){}\\[\\]^~:\\\\]|/\\*|\\*/|&&|\\|\\|", "");
System.out.println(string);
Thomas wrote on How to remove special characters from a string?:
That depends on what you define as special characters, but try
replaceAll(...):
String result = yourString.replaceAll("[-+.^:,]","");
Note that the ^ character must not be the first one in the list, since
you'd then either have to escape it or it would mean "any but these
characters".
Another note: the - character needs to be the first or last one on the
list, otherwise you'd have to escape it or it would define a range (
e.g. :-, would mean "all characters in the range : to ,).
So, in order to keep consistency and not depend on character
positioning, you might want to escape all those characters that have a
special meaning in regular expressions (the following list is not
complete, so be aware of other characters like (, {, $ etc.):
String result = yourString.replaceAll("[\\-\\+\\.\\^:,]","");
If you want to get rid of all punctuation and symbols, try this regex:
\p{P}\p{S} (keep in mind that in Java strings you'd have to escape
back slashes: "\p{P}\p{S}").
A third way could be something like this, if you can exactly define
what should be left in your string:
String result = yourString.replaceAll("[^\\w\\s]","");
Here's less restrictive alternative to the "define allowed characters"
approach, as suggested by Ray:
String result = yourString.replaceAll("[^\\p{L}\\p{Z}]","");
The regex matches everything that is not a letter in any language and
not a separator (whitespace, linebreak etc.). Note that you can't use
[\P{L}\P{Z}] (upper case P means not having that property), since that
would mean "everything that is not a letter or not whitespace", which
almost matches everything, since letters are not whitespace and vice
versa.

Why am I getting an incompatible types error when trying to see if a String ends with a specific character?

I'm working on a dinky code for java, in which I have to create a program that: 1) capitalizes the first word of the input sentence, 2) capitalizes the word "I", and 3)punctuates the sentence if there is no proper punctuation. I wrote the code easily, but it's a bit messy. Specifically, I was wondering how you would use a special character for a conditional.
for example,
String sentence = IO.readString(); /* IO.readstring is irrelevant here, it's just a scanning class that reads a user input*/
int length = sentence.length();
char punctuation = sentence.charAt(length - 1);
if (punctuation != "." || punctuation != "?" || punctuation != "!")
{
sentence = sentence + ".";
}
this is giving me an incompatible types error when I try to compile it (incompatible types : char and java.lang.string)
How would I go about writing this conditional?
When you use "" that implies a String.
For characters, use '.' (the single quote).
Use single quote for characters in java:
if (punctuation != '.' || punctuation != '?' || punctuation != '!')
I haven't checked your logic since question was not entirely clear to me.
Literal characters are delimited by single quotes: 'x'.
Literal strings are delimited by double quotes: "x".
Characters are primitives, while strings are object (of the java.lang.String class), and you cannot compare primitives to objects.
A short hand for checking multiple characters is to use String.indexOf
if (".?!".indexOf(punctuation) < 0)
sentence += '.';

Remove "empty" character from String

I'm using a framwork which returns malformed Strings with "empty" characters from time to time.
"foobar" for example is represented by:
[,f,o,o,b,a,r]
The first character is NOT a whitespace (' '), so a System.out.printlin() would return "foobar" and not " foobar". Yet, the length of the String is 7 instead of 6. Obviously this makes most String methods (equals, split, substring,..) useless. Is there a way to remove empty characters from a String?
I tried to build a new String like this:
StringBuilder sb = new StringBuilder();
for (final char character : malformedString.toCharArray()) {
if (Character.isDefined(character)) {
sb.append(character);
}
}
sb.toString();
Unfortunately this doesn't work. Same with the following code:
StringBuilder sb = new StringBuilder();
for (final Character character : malformedString.toCharArray()) {
if (character != null) {
sb.append(character);
}
}
sb.toString();
I also can't check for an empty character like this:
if (character == ''){
//
}
Obviously there is something wrong with the String .. but I can't change the framework I'm using or wait for them to fix it (if it is a bug within their framework). I need to handle this String and sanatize it.
Any ideas?
Regex would be an appropriate way to sanitize the string from unwanted Unicode characters in this case.
String sanitized = dirty.replaceAll("[\uFEFF-\uFFFF]", "");
This will replace all char in \uFEFF-\uFFFF range with the empty string.
The [...] construct is called a character class, e.g. [aeiou] matches one of any of the lowercase vowels, [^aeiou] matches anything but.
You can do one of these two approaches:
replaceAll("[blacklist]", "")
replaceAll("[^whitelist]", "")
References
regular-expressions.info
It's probably the NULL character which is represented by \0. You can get rid of it by String#trim().
To nail down the exact codepoint, do so:
for (char c : string.toCharArray()) {
System.out.printf("U+%04x ", (int) c);
}
Then you can find the exact character here.
Update: as per the update:
Anyone know of a way to just include a range of valid characters instead of excluding 95% of the UTF8 range?
You can do that with help of regex. See the answer of #polygenelubricants here and this answer.
On the other hand, you can also just fix the problem in its root instead of workarounding it. Either update the files to get rid of the BOM mark, it's a legacy way to distinguish UTF-8 files from others which is nowadays worthless, or use a Reader which recognizes and skips the BOM. Also see this question.
A very simple way to remove the UTF-8 BOM from a string, using substring as Denis Tulskiy suggested. No looping needed. Just checks the first character for the mark and skips it if needed.
public static String removeUTF8BOM(String s) {
if (s.startsWith("\uFEFF")) {
s = s.substring(1);
}
return s;
}
I needed to add this to my code when using the Apache HTTPClient EntityUtil to read from a webserver. The webserver was not sending the blank mark but it was getting pulled in while reading the input stream. Original article can be found here.
Thank you Johannes Rössel. It actually was '\uFEFF'
The following code works:
final StringBuilder sb = new StringBuilder();
for (final char character : body.toCharArray()) {
if (character != '\uFEFF') {
sb.append(character);
}
}
final String sanitzedString = sb.toString();
Anyone know of a way to just include a range of valid characters instead of excluding 95% of the UTF8 range?
trim left or right removes white spaces. does it has a colon before space?
even more:
a=(long) string[0]; will show u the char code, and u can use replace() or substring.
This is what worked for me:-
StringBuilder sb = new StringBuilder();
for (char character : myString.toCharArray()) {
int i = (int) character;
if (i > 0 && i <= 256) {
sb.append(character);
}
}
return sb.toString();
The int value of my NULL characters was in the region of 8103 or something.
You can try replace:
s.replace("\u200B", "")
or
s.replace("\uFEFF", "")
Kotlin:
s.filter { it == '\u200B' }
for (int i = 0; i < s.length(); i++)
if (s.charAt(i) == ' ') {
your code....
}
Simply malformedString.trim() will solve the issue.
You could check for the whitespace like this:
if (character.equals(' ')){ // }

Categories