I'm using a framwork which returns malformed Strings with "empty" characters from time to time.
"foobar" for example is represented by:
[,f,o,o,b,a,r]
The first character is NOT a whitespace (' '), so a System.out.printlin() would return "foobar" and not " foobar". Yet, the length of the String is 7 instead of 6. Obviously this makes most String methods (equals, split, substring,..) useless. Is there a way to remove empty characters from a String?
I tried to build a new String like this:
StringBuilder sb = new StringBuilder();
for (final char character : malformedString.toCharArray()) {
if (Character.isDefined(character)) {
sb.append(character);
}
}
sb.toString();
Unfortunately this doesn't work. Same with the following code:
StringBuilder sb = new StringBuilder();
for (final Character character : malformedString.toCharArray()) {
if (character != null) {
sb.append(character);
}
}
sb.toString();
I also can't check for an empty character like this:
if (character == ''){
//
}
Obviously there is something wrong with the String .. but I can't change the framework I'm using or wait for them to fix it (if it is a bug within their framework). I need to handle this String and sanatize it.
Any ideas?
Regex would be an appropriate way to sanitize the string from unwanted Unicode characters in this case.
String sanitized = dirty.replaceAll("[\uFEFF-\uFFFF]", "");
This will replace all char in \uFEFF-\uFFFF range with the empty string.
The [...] construct is called a character class, e.g. [aeiou] matches one of any of the lowercase vowels, [^aeiou] matches anything but.
You can do one of these two approaches:
replaceAll("[blacklist]", "")
replaceAll("[^whitelist]", "")
References
regular-expressions.info
It's probably the NULL character which is represented by \0. You can get rid of it by String#trim().
To nail down the exact codepoint, do so:
for (char c : string.toCharArray()) {
System.out.printf("U+%04x ", (int) c);
}
Then you can find the exact character here.
Update: as per the update:
Anyone know of a way to just include a range of valid characters instead of excluding 95% of the UTF8 range?
You can do that with help of regex. See the answer of #polygenelubricants here and this answer.
On the other hand, you can also just fix the problem in its root instead of workarounding it. Either update the files to get rid of the BOM mark, it's a legacy way to distinguish UTF-8 files from others which is nowadays worthless, or use a Reader which recognizes and skips the BOM. Also see this question.
A very simple way to remove the UTF-8 BOM from a string, using substring as Denis Tulskiy suggested. No looping needed. Just checks the first character for the mark and skips it if needed.
public static String removeUTF8BOM(String s) {
if (s.startsWith("\uFEFF")) {
s = s.substring(1);
}
return s;
}
I needed to add this to my code when using the Apache HTTPClient EntityUtil to read from a webserver. The webserver was not sending the blank mark but it was getting pulled in while reading the input stream. Original article can be found here.
Thank you Johannes Rössel. It actually was '\uFEFF'
The following code works:
final StringBuilder sb = new StringBuilder();
for (final char character : body.toCharArray()) {
if (character != '\uFEFF') {
sb.append(character);
}
}
final String sanitzedString = sb.toString();
Anyone know of a way to just include a range of valid characters instead of excluding 95% of the UTF8 range?
trim left or right removes white spaces. does it has a colon before space?
even more:
a=(long) string[0]; will show u the char code, and u can use replace() or substring.
This is what worked for me:-
StringBuilder sb = new StringBuilder();
for (char character : myString.toCharArray()) {
int i = (int) character;
if (i > 0 && i <= 256) {
sb.append(character);
}
}
return sb.toString();
The int value of my NULL characters was in the region of 8103 or something.
You can try replace:
s.replace("\u200B", "")
or
s.replace("\uFEFF", "")
Kotlin:
s.filter { it == '\u200B' }
for (int i = 0; i < s.length(); i++)
if (s.charAt(i) == ' ') {
your code....
}
Simply malformedString.trim() will solve the issue.
You could check for the whitespace like this:
if (character.equals(' ')){ // }
Related
I've seen questions on how to prefix zeros here in SO. But not the other way!
Can you guys suggest me how to remove the leading zeros in alphanumeric text? Are there any built-in APIs or do I need to write a method to trim the leading zeros?
Example:
01234 converts to 1234
0001234a converts to 1234a
001234-a converts to 1234-a
101234 remains as 101234
2509398 remains as 2509398
123z remains as 123z
000002829839 converts to 2829839
Regex is the best tool for the job; what it should be depends on the problem specification. The following removes leading zeroes, but leaves one if necessary (i.e. it wouldn't just turn "0" to a blank string).
s.replaceFirst("^0+(?!$)", "")
The ^ anchor will make sure that the 0+ being matched is at the beginning of the input. The (?!$) negative lookahead ensures that not the entire string will be matched.
Test harness:
String[] in = {
"01234", // "[1234]"
"0001234a", // "[1234a]"
"101234", // "[101234]"
"000002829839", // "[2829839]"
"0", // "[0]"
"0000000", // "[0]"
"0000009", // "[9]"
"000000z", // "[z]"
"000000.z", // "[.z]"
};
for (String s : in) {
System.out.println("[" + s.replaceFirst("^0+(?!$)", "") + "]");
}
See also
regular-expressions.info
repetitions, lookarounds, and anchors
String.replaceFirst(String regex)
You can use the StringUtils class from Apache Commons Lang like this:
StringUtils.stripStart(yourString,"0");
If you are using Kotlin This is the only code that you need:
yourString.trimStart('0')
How about the regex way:
String s = "001234-a";
s = s.replaceFirst ("^0*", "");
The ^ anchors to the start of the string (I'm assuming from context your strings are not multi-line here, otherwise you may need to look into \A for start of input rather than start of line). The 0* means zero or more 0 characters (you could use 0+ as well). The replaceFirst just replaces all those 0 characters at the start with nothing.
And if, like Vadzim, your definition of leading zeros doesn't include turning "0" (or "000" or similar strings) into an empty string (a rational enough expectation), simply put it back if necessary:
String s = "00000000";
s = s.replaceFirst ("^0*", "");
if (s.isEmpty()) s = "0";
A clear way without any need of regExp and any external libraries.
public static String trimLeadingZeros(String source) {
for (int i = 0; i < source.length(); ++i) {
char c = source.charAt(i);
if (c != '0') {
return source.substring(i);
}
}
return ""; // or return "0";
}
To go with thelost's Apache Commons answer: using guava-libraries (Google's general-purpose Java utility library which I would argue should now be on the classpath of any non-trivial Java project), this would use CharMatcher:
CharMatcher.is('0').trimLeadingFrom(inputString);
You could just do:
String s = Integer.valueOf("0001007").toString();
Use this:
String x = "00123".replaceAll("^0*", ""); // -> 123
Use Apache Commons StringUtils class:
StringUtils.strip(String str, String stripChars);
Using Regexp with groups:
Pattern pattern = Pattern.compile("(0*)(.*)");
String result = "";
Matcher matcher = pattern.matcher(content);
if (matcher.matches())
{
// first group contains 0, second group the remaining characters
// 000abcd - > 000, abcd
result = matcher.group(2);
}
return result;
Using regex as some of the answers suggest is a good way to do that. If you don't want to use regex then you can use this code:
String s = "00a0a121";
while(s.length()>0 && s.charAt(0)=='0')
{
s = s.substring(1);
}
If you (like me) need to remove all the leading zeros from each "word" in a string, you can modify #polygenelubricants' answer to the following:
String s = "003 d0g 00ss 00 0 00";
s.replaceAll("\\b0+(?!\\b)", "");
which results in:
3 d0g ss 0 0 0
I think that it is so easy to do that. You can just loop over the string from the start and removing zeros until you found a not zero char.
int lastLeadZeroIndex = 0;
for (int i = 0; i < str.length(); i++) {
char c = str.charAt(i);
if (c == '0') {
lastLeadZeroIndex = i;
} else {
break;
}
}
str = str.subString(lastLeadZeroIndex+1, str.length());
Without using Regex or substring() function on String which will be inefficient -
public static String removeZero(String str){
StringBuffer sb = new StringBuffer(str);
while (sb.length()>1 && sb.charAt(0) == '0')
sb.deleteCharAt(0);
return sb.toString(); // return in String
}
Using kotlin it is easy
value.trimStart('0')
You could replace "^0*(.*)" to "$1" with regex
String s="0000000000046457657772752256266542=56256010000085100000";
String removeString="";
for(int i =0;i<s.length();i++){
if(s.charAt(i)=='0')
removeString=removeString+"0";
else
break;
}
System.out.println("original string - "+s);
System.out.println("after removing 0's -"+s.replaceFirst(removeString,""));
If you don't want to use regex or external library.
You can do with "for":
String input="0000008008451"
String output = input.trim();
for( ;output.length() > 1 && output.charAt(0) == '0'; output = output.substring(1));
System.out.println(output);//8008451
I made some benchmark tests and found, that the fastest way (by far) is this solution:
private static String removeLeadingZeros(String s) {
try {
Integer intVal = Integer.parseInt(s);
s = intVal.toString();
} catch (Exception ex) {
// whatever
}
return s;
}
Especially regular expressions are very slow in a long iteration. (I needed to find out the fastest way for a batchjob.)
And what about just searching for the first non-zero character?
[1-9]\d+
This regex finds the first digit between 1 and 9 followed by any number of digits, so for "00012345" it returns "12345".
It can be easily adapted for alphanumeric strings.
I have a string like this "abcd !#&$%^^&*()<>!/". I have list of all the entity codes for characters in a separate string i.e. only encode those characters which are in another string "!=&4....^=9...". I want to convert all of special characters into their entities except alphanumeric by regex as using loop on characters on by one is too slow.
e.g. it should show "abc ..;.." in other convert words all the special characters on keyboard.
Is there an efficient regex I can write ? I have tried this with loops but it is too slow to look at each character one by one and maintain a list of all special characters entities in other string
There are libraries but they do not convert all of the characters.
The code I wrote
// String to be encoded
String sDecoded = "abcd !##$%^&*();'m,";
// Special character entity list to put instead to special character. It is tokenized on cross and divide symbol as it cannot be entered by user on keyboard
String specialCharacters = "&÷$amp;×–÷–"
// Check the input
if (sDecoded == null || sDecoded.trim ().length () == 0)
return (sDecoded);
// Use StringTokenizer which is faster than split method
StringTokenizer st = new StringTokenizer(specialCharacters, "×");
String[] reg = null;
String[] charactersArray = sDecoded.split("");
String sEncoded = "";
// now loop on it and in each iteration, we will be getting a decodedCharacter:EncodedEntity pair
for(int i = 0; i < charactersArray.length; i++)
{
st = new StringTokenizer(specialCharacters, "×");
while(st.hasMoreElements())
{
reg = st.nextElement().toString().split("÷");
// This is an error, the character should not be blank ever because it will be character that we will encode
if(StringUtils.isBlank(reg[0]))
return sDecoded;
String c = charactersArray[i];
if(c.equalsIgnoreCase(reg[0]))
{
sEncoded = sEncoded + c.replace(reg[0], reg[1]);
break;
}
if(st.countTokens() == 0)
sEncoded = sEncoded + c.toString();
}
}
return (sEncoded);
I don't know what definition of "efficient" you are using, but there's the "don't reinvent the wheel" efficiency of using a simple call to Apache commons-text StringEscapeUtils utility class:
String encoded = StringEscapeUtils.escapeXml11(str);
or
String encoded = StringEscapeUtils.escapeHtml4(str);
and a variety of other similar methods, depending on which exact encoding you want.
Note: This class was originally in the commons-lang3 library, but was deprecated there and moved to the commons-text library.
Your approach is quite slow and inefficient. Maybe it looks elegant nowadays to use regex like a silver bullet for everything, but it is definitely not for this task. I see you are also using tokenizer which is also slow.Also loop inside a loop will degrade performance.
I would recomment using an iterative way with string builder which will produce blazing fast results, you will try for yourself. For each special character make an 'if' statement. Even if it looks too much code it will be very fast. Test yourself.
Try this :
class Scratch {
public static void main(String[] args) {
System.out.println(escapeSpecials("abc &"));
}
public static String escapeSpecials(String origin) {
StringBuilder result = new StringBuilder();
char[] chars = origin.toCharArray();
for (char c : chars) {
if (c == '&') {
result.append("&");
} else if (c == '\u2013') {
result.append("–");
} else {
// not a special character
result.append(c);
}
}
return result.toString();
}
}
I get user input including non-ASCII characters and non-printable characters, such as
\xc2d
\xa0
\xe7
\xc3\ufffdd
\xc3\ufffdd
\xc2\xa0
\xc3\xa7
\xa0\xa0
for example:
email : abc#gmail.com\xa0\xa0
street : 123 Main St.\xc2\xa0
desired output:
email : abc#gmail.com
street : 123 Main St.
What is the best way to removing them using Java?
I tried the following, but doesn't seem to work
public static void main(String args[]) throws UnsupportedEncodingException {
String s = "abc#gmail\\xe9.com";
String email = "abc#gmail.com\\xa0\\xa0";
System.out.println(s.replaceAll("\\P{Print}", ""));
System.out.println(email.replaceAll("\\P{Print}", ""));
}
Output
abc#gmail\xe9.com
abc#gmail.com\xa0\xa0
Your requirements are not clear. All characters in a Java String are Unicode characters, so if you remove them, you'll be left with an empty string. I assume what you mean is that you want to remove any non-ASCII, non-printable characters.
String clean = str.replaceAll("\\P{Print}", "");
Here, \p{Print} represents a POSIX character class for printable ASCII characters, while \P{Print} is the complement of that class. With this expression, all characters that are not printable ASCII are replaced with the empty string. (The extra backslash is because \ starts an escape sequence in string literals.)
Apparently, all the input characters are actually ASCII characters that represent a printable encoding of non-printable or non-ASCII characters. Mongo shouldn't have any trouble with these strings, because they contain only plain printable ASCII characters.
This all sounds a little fishy to me. What I believe is happening is that the data really do contain non-printable and non-ASCII characters, and another component (like a logging framework) is replacing these with a printable representation. In your simple tests, you are failing to translate the printable representation back to the original string, so you mistakenly believe the first regular expression is not working.
That's my guess, but if I've misread the situation and you really do need to strip out literal \xHH escapes, you can do it with the following regular expression.
String clean = str.replaceAll("\\\\x\\p{XDigit}{2}", "");
The API documentation for the Pattern class does a good job of listing all of the syntax supported by Java's regex library. For more elaboration on what all of the syntax means, I have found the Regular-Expressions.info site very helpful.
I know it's maybe late but for future reference:
String clean = str.replaceAll("\\P{Print}", "");
Removes all non printable characters, but that includes \n (line feed), \t(tab) and \r(carriage return), and sometimes you want to keep those characters.
For that problem use inverted logic:
String clean = str.replaceAll("[^\\n\\r\\t\\p{Print}]", "");
With Google Guava's CharMatcher, you can remove any non-printable characters and then retain all ASCII characters (dropping any accents) like this:
String printable = CharMatcher.INVISIBLE.removeFrom(input);
String clean = CharMatcher.ASCII.retainFrom(printable);
Not sure if that's what you really want, but it removes anything expressed as escape sequences in your question's sample data.
You can try this code:
public String cleanInvalidCharacters(String in) {
StringBuilder out = new StringBuilder();
char current;
if (in == null || ("".equals(in))) {
return "";
}
for (int i = 0; i < in.length(); i++) {
current = in.charAt(i);
if ((current == 0x9)
|| (current == 0xA)
|| (current == 0xD)
|| ((current >= 0x20) && (current <= 0xD7FF))
|| ((current >= 0xE000) && (current <= 0xFFFD))
|| ((current >= 0x10000) && (current <= 0x10FFFF))) {
out.append(current);
}
}
return out.toString().replaceAll("\\s", " ");
}
It works for me to remove invalid characters from String.
You can use java.text.normalizer
Input => "This \u7279text \u7279is what I need"
Output => "This text is what I need"
If you are trying to remove Unicode characters from a string like above this code will work
Pattern unicodeCharsPattern = Pattern.compile("\\\\u(\\p{XDigit}{4})");
Matcher unicodeMatcher = unicodeChars.matcher(data);
String cleanData = null;
if (unicodeMatcher.find()) {
cleanData = unicodeMatcher.replaceAll("");
}
I want to detect and remove high-ASCII characters like ®, ©, ™ from a String in Java. Is there any open-source library that can do this?
If you need to remove all non-US-ASCII (i.e. outside 0x0-0x7F) characters, you can do something like this:
s = s.replaceAll("[^\\x00-\\x7f]", "");
If you need to filter many strings, it would be better to use a precompiled pattern:
private static final Pattern nonASCII = Pattern.compile("[^\\x00-\\x7f]");
...
s = nonASCII.matcher(s).replaceAll();
And if it's really performance-critical, perhaps Alex Nikolaenkov's suggestion would be better.
I think that you can easily filter your string by hand and check code of the particular character. If it fits your requirements then add it to a StringBuilder and do toString() to it in the end.
public static String filter(String str) {
StringBuilder filtered = new StringBuilder(str.length());
for (int i = 0; i < str.length(); i++) {
char current = str.charAt(i);
if (current >= 0x20 && current <= 0x7e) {
filtered.append(current);
}
}
return filtered.toString();
}
A nice way to do this is to use Google Guava CharMatcher:
String newString = CharMatcher.ASCII.retainFrom(string);
newString will contain only the ASCII characters (code point < 128) from the original string.
This reads more naturally than a regular expression. Regular expressions can take more effort to understand for subsequent readers of your code.
I understand that you need to delete: ç,ã,Ã , but for everybody that need to convert ç,ã,Ã ---> c,a,A please have a look at this piece of code:
Example Code:
final String input = "Tĥïŝ ĩš â fůňķŷ Šťŕĭńġ";
System.out.println(
Normalizer
.normalize(input, Normalizer.Form.NFD)
.replaceAll("[^\\p{ASCII}]", "")
);
Output:
This is a funky String
Using Java, I want to go through the lines of a text and replace all ampersand symbols (&) with the XML entity reference &.
I scan the lines of the text and then each word in the text with the Scanner class. Then I use the CharacterIterator to iterate over each characters of the word. However, how can I replace the character? First, Strings are immutable objects. Second, I want to replace a character (&) with several characters(amp&;). How should I approach this?
CharacterIterator it = new StringCharacterIterator(token);
for(char ch = it.first(); ch != CharacterIterator.DONE; ch = it.next()) {
if(ch == '&') {
}
}
Try using String.replace() or String.replaceAll() instead.
String my_new_str = my_str.replace("&", "&");
(Both replace all occurrences; replaceAll allows use of regex.)
The simple answer is:
token = token.replace("&", "&");
Despite the name as compared to replaceAll, replace does do a replaceAll, it just doesn't use a regular expression, which seems to be in order here (both from a performance and a good practice perspective - don't use regular expressions by accident as they have special character requirements which you won't be paying attention to).
Sean Bright's answer is probably as good as is worth thinking about from a performance perspective absent some further target requirement on performance and performance testing, if you already know this code is a hot spot for performance, if that is where your question is coming from. It certainly doesn't deserve the downvotes. Just use StringBuilder instead of StringBuffer unless you need the synchronization.
That being said, there is a somewhat deeper potential problem here. Escaping characters is a known problem which lots of libraries out there address. You may want to consider wrapping the data in a CDATA section in the XML, or you may prefer to use an XML library (including the one that comes with the JDK now) to actually generate the XML properly (so that it will handle the encoding).
Apache also has an escaping library as part of Commons Lang.
StringBuilder s = new StringBuilder(token.length());
CharacterIterator it = new StringCharacterIterator(token);
for (char ch = it.first(); ch != CharacterIterator.DONE; ch = it.next()) {
switch (ch) {
case '&':
s.append("&");
break;
case '<':
s.append("<");
break;
case '>':
s.append(">");
break;
default:
s.append(ch);
break;
}
}
token = s.toString();
You may also want to check to make sure your not replacing an occurrence that has already been replaced. You can use a regular expression with negative lookahead to do this.
For example:
String str = "sdasdasa&adas&dasdasa";
str = str.replaceAll("&(?!amp;)", "&");
This would result in the string "sdasdasa&adas&dasdasa".
The regex pattern "&(?!amp;)" basically says: Match any occurrence of '&' that is not followed by 'amp;'.
Just create a string that contains all of the data in question and then use String.replaceAll() like below.
String result = yourString.replaceAll("&", "&");
You can use stream and flatMap to map & to &
String str = "begin&end";
String newString = str.chars()
.flatMap(ch -> (ch == '&') ? "&".chars() : IntStream.of(ch))
.collect(StringBuilder::new, StringBuilder::appendCodePoint, StringBuilder::append)
.toString();
Escaping strings can be tricky - especially if you want to take unicode into account. I suppose XML is one of the simpler formats/languages to escape but still. I would recommend taking a look at the StringEscapeUtils class in Apache Commons Lang, and its handy escapeXml method.
Try this code.You can replace any character with another given character.
Here I tried to replace the letter 'a' with "-" character for the give string "abcdeaa"
OutPut -->_bcdef__
public class Replace {
public static void replaceChar(String str,String target){
String result = str.replaceAll(target, "_");
System.out.println(result);
}
public static void main(String[] args) {
replaceChar("abcdefaa","a");
}
}
If you're using Spring you can simply call HtmlUtils.htmlEscape(String input) which will handle the '&' to '&' translation.
//I think this will work, you don't have to replace on the even, it's just an example.
public void emphasize(String phrase, char ch)
{
char phraseArray[] = phrase.toCharArray();
for(int i=0; i< phrase.length(); i++)
{
if(i%2==0)// even number
{
String value = Character.toString(phraseArray[i]);
value = value.replace(value,"*");
phraseArray[i] = value.charAt(0);
}
}
}
String taskLatLng = task.getTask_latlng().replaceAll( "\\(","").replaceAll("\\)","").replaceAll("lat/lng:", "").trim();