How to remove high-ASCII characters from string like ®, ©, ™ in Java

How to remove high-ASCII characters from string like ®, ©, ™ in Java - java

I want to detect and remove high-ASCII characters like ®, ©, ™ from a String in Java. Is there any open-source library that can do this?

If you need to remove all non-US-ASCII (i.e. outside 0x0-0x7F) characters, you can do something like this:
s = s.replaceAll("[^\\x00-\\x7f]", "");
If you need to filter many strings, it would be better to use a precompiled pattern:
private static final Pattern nonASCII = Pattern.compile("[^\\x00-\\x7f]");
...
s = nonASCII.matcher(s).replaceAll();
And if it's really performance-critical, perhaps Alex Nikolaenkov's suggestion would be better.

I think that you can easily filter your string by hand and check code of the particular character. If it fits your requirements then add it to a StringBuilder and do toString() to it in the end.
public static String filter(String str) {
StringBuilder filtered = new StringBuilder(str.length());
for (int i = 0; i < str.length(); i++) {
char current = str.charAt(i);
if (current >= 0x20 && current <= 0x7e) {
filtered.append(current);
}
}
return filtered.toString();
}

A nice way to do this is to use Google Guava CharMatcher:
String newString = CharMatcher.ASCII.retainFrom(string);
newString will contain only the ASCII characters (code point < 128) from the original string.
This reads more naturally than a regular expression. Regular expressions can take more effort to understand for subsequent readers of your code.

I understand that you need to delete: ç,ã,Ã , but for everybody that need to convert ç,ã,Ã ---> c,a,A please have a look at this piece of code:
Example Code:
final String input = "Tĥïŝ ĩš â fůňķŷ Šťŕĭńġ";
System.out.println(
Normalizer
.normalize(input, Normalizer.Form.NFD)
.replaceAll("[^\\p{ASCII}]", "")
);
Output:
This is a funky String

Related

How to remove leading 0 in the time timestamp 02:25PM using java? [duplicate]

I've seen questions on how to prefix zeros here in SO. But not the other way!
Can you guys suggest me how to remove the leading zeros in alphanumeric text? Are there any built-in APIs or do I need to write a method to trim the leading zeros?
Example:
01234 converts to 1234
0001234a converts to 1234a
001234-a converts to 1234-a
101234 remains as 101234
2509398 remains as 2509398
123z remains as 123z
000002829839 converts to 2829839

Regex is the best tool for the job; what it should be depends on the problem specification. The following removes leading zeroes, but leaves one if necessary (i.e. it wouldn't just turn "0" to a blank string).
s.replaceFirst("^0+(?!$)", "")
The ^ anchor will make sure that the 0+ being matched is at the beginning of the input. The (?!$) negative lookahead ensures that not the entire string will be matched.
Test harness:
String[] in = {
"01234", // "[1234]"
"0001234a", // "[1234a]"
"101234", // "[101234]"
"000002829839", // "[2829839]"
"0", // "[0]"
"0000000", // "[0]"
"0000009", // "[9]"
"000000z", // "[z]"
"000000.z", // "[.z]"
};
for (String s : in) {
System.out.println("[" + s.replaceFirst("^0+(?!$)", "") + "]");
}
See also
regular-expressions.info
repetitions, lookarounds, and anchors
String.replaceFirst(String regex)

You can use the StringUtils class from Apache Commons Lang like this:
StringUtils.stripStart(yourString,"0");

If you are using Kotlin This is the only code that you need:
yourString.trimStart('0')

How about the regex way:
String s = "001234-a";
s = s.replaceFirst ("^0*", "");
The ^ anchors to the start of the string (I'm assuming from context your strings are not multi-line here, otherwise you may need to look into \A for start of input rather than start of line). The 0* means zero or more 0 characters (you could use 0+ as well). The replaceFirst just replaces all those 0 characters at the start with nothing.
And if, like Vadzim, your definition of leading zeros doesn't include turning "0" (or "000" or similar strings) into an empty string (a rational enough expectation), simply put it back if necessary:
String s = "00000000";
s = s.replaceFirst ("^0*", "");
if (s.isEmpty()) s = "0";

A clear way without any need of regExp and any external libraries.
public static String trimLeadingZeros(String source) {
for (int i = 0; i < source.length(); ++i) {
char c = source.charAt(i);
if (c != '0') {
return source.substring(i);
}
}
return ""; // or return "0";
}

To go with thelost's Apache Commons answer: using guava-libraries (Google's general-purpose Java utility library which I would argue should now be on the classpath of any non-trivial Java project), this would use CharMatcher:
CharMatcher.is('0').trimLeadingFrom(inputString);

You could just do:
String s = Integer.valueOf("0001007").toString();

Use this:
String x = "00123".replaceAll("^0*", ""); // -> 123

Use Apache Commons StringUtils class:
StringUtils.strip(String str, String stripChars);

Using Regexp with groups:
Pattern pattern = Pattern.compile("(0*)(.*)");
String result = "";
Matcher matcher = pattern.matcher(content);
if (matcher.matches())
{
// first group contains 0, second group the remaining characters
// 000abcd - > 000, abcd
result = matcher.group(2);
}
return result;

Using regex as some of the answers suggest is a good way to do that. If you don't want to use regex then you can use this code:
String s = "00a0a121";
while(s.length()>0 && s.charAt(0)=='0')
{
s = s.substring(1);
}

If you (like me) need to remove all the leading zeros from each "word" in a string, you can modify #polygenelubricants' answer to the following:
String s = "003 d0g 00ss 00 0 00";
s.replaceAll("\\b0+(?!\\b)", "");
which results in:
3 d0g ss 0 0 0

I think that it is so easy to do that. You can just loop over the string from the start and removing zeros until you found a not zero char.
int lastLeadZeroIndex = 0;
for (int i = 0; i < str.length(); i++) {
char c = str.charAt(i);
if (c == '0') {
lastLeadZeroIndex = i;
} else {
break;
}
}
str = str.subString(lastLeadZeroIndex+1, str.length());

Without using Regex or substring() function on String which will be inefficient -
public static String removeZero(String str){
StringBuffer sb = new StringBuffer(str);
while (sb.length()>1 && sb.charAt(0) == '0')
sb.deleteCharAt(0);
return sb.toString(); // return in String
}

Using kotlin it is easy
value.trimStart('0')

You could replace "^0*(.*)" to "$1" with regex

String s="0000000000046457657772752256266542=56256010000085100000";
String removeString="";
for(int i =0;i<s.length();i++){
if(s.charAt(i)=='0')
removeString=removeString+"0";
else
break;
}
System.out.println("original string - "+s);
System.out.println("after removing 0's -"+s.replaceFirst(removeString,""));

If you don't want to use regex or external library.
You can do with "for":
String input="0000008008451"
String output = input.trim();
for( ;output.length() > 1 && output.charAt(0) == '0'; output = output.substring(1));
System.out.println(output);//8008451

I made some benchmark tests and found, that the fastest way (by far) is this solution:
private static String removeLeadingZeros(String s) {
try {
Integer intVal = Integer.parseInt(s);
s = intVal.toString();
} catch (Exception ex) {
// whatever
}
return s;
}
Especially regular expressions are very slow in a long iteration. (I needed to find out the fastest way for a batchjob.)

And what about just searching for the first non-zero character?
[1-9]\d+
This regex finds the first digit between 1 and 9 followed by any number of digits, so for "00012345" it returns "12345".
It can be easily adapted for alphanumeric strings.

Efficient encoding all the special characters in a string into entities

I have a string like this "abcd !#&$%^^&*()<>!/". I have list of all the entity codes for characters in a separate string i.e. only encode those characters which are in another string "!=&4....^=9...". I want to convert all of special characters into their entities except alphanumeric by regex as using loop on characters on by one is too slow.
e.g. it should show "abc &#4..;&#4.." in other convert words all the special characters on keyboard.
Is there an efficient regex I can write ? I have tried this with loops but it is too slow to look at each character one by one and maintain a list of all special characters entities in other string
There are libraries but they do not convert all of the characters.
The code I wrote
// String to be encoded
String sDecoded = "abcd !##$%^&*();'m,";
// Special character entity list to put instead to special character. It is tokenized on cross and divide symbol as it cannot be entered by user on keyboard
String specialCharacters = "&÷$amp;×–÷–"
// Check the input
if (sDecoded == null || sDecoded.trim ().length () == 0)
return (sDecoded);
// Use StringTokenizer which is faster than split method
StringTokenizer st = new StringTokenizer(specialCharacters, "×");
String[] reg = null;
String[] charactersArray = sDecoded.split("");
String sEncoded = "";
// now loop on it and in each iteration, we will be getting a decodedCharacter:EncodedEntity pair
for(int i = 0; i < charactersArray.length; i++)
{
st = new StringTokenizer(specialCharacters, "×");
while(st.hasMoreElements())
{
reg = st.nextElement().toString().split("÷");
// This is an error, the character should not be blank ever because it will be character that we will encode
if(StringUtils.isBlank(reg[0]))
return sDecoded;
String c = charactersArray[i];
if(c.equalsIgnoreCase(reg[0]))
{
sEncoded = sEncoded + c.replace(reg[0], reg[1]);
break;
}
if(st.countTokens() == 0)
sEncoded = sEncoded + c.toString();
}
}
return (sEncoded);

I don't know what definition of "efficient" you are using, but there's the "don't reinvent the wheel" efficiency of using a simple call to Apache commons-text StringEscapeUtils utility class:
String encoded = StringEscapeUtils.escapeXml11(str);
or
String encoded = StringEscapeUtils.escapeHtml4(str);
and a variety of other similar methods, depending on which exact encoding you want.
Note: This class was originally in the commons-lang3 library, but was deprecated there and moved to the commons-text library.

Your approach is quite slow and inefficient. Maybe it looks elegant nowadays to use regex like a silver bullet for everything, but it is definitely not for this task. I see you are also using tokenizer which is also slow.Also loop inside a loop will degrade performance.
I would recomment using an iterative way with string builder which will produce blazing fast results, you will try for yourself. For each special character make an 'if' statement. Even if it looks too much code it will be very fast. Test yourself.
Try this :
class Scratch {
public static void main(String[] args) {
System.out.println(escapeSpecials("abc &"));
}
public static String escapeSpecials(String origin) {
StringBuilder result = new StringBuilder();
char[] chars = origin.toCharArray();
for (char c : chars) {
if (c == '&') {
result.append("&");
} else if (c == '\u2013') {
result.append("–");
} else {
// not a special character
result.append(c);
}
}
return result.toString();
}
}

the best way for character replacement in String in java

I want to check a string for each character I replace it with other characters or keep it in the string. and also because it's a long string the time to do this task is so important. what is the best way of these, or any better idea?
for all of them I append the result to an StringBuilder.
check all of the characters with a for and charAt commands.
use switch like the previous way.
use replaceAll twice.
and if one of the first to methods is better is there any way to check a character with a group of characters, like :
if (st.charAt(i)=='a'..'z') ....
Edit:
please tell the less consuming in time way and tell the reason.I know all of these ways you said!

If you want to replace a single character (or a single sequence), use replace(), as other answers have suggested.
If you want to replace several characters (e.g., 'a', 'b', and 'c') with a single substitute character or character sequence (e.g., "X"), you should use a regular expression replace:
String result = original.replaceAll("[abc]", "X");
If you want to replace several characters, each with a different replacement (e.g., 'a' with 'A', 'b' with 'B'), then looping through the string yourself and building the result in a StringBuilder will probably be the most efficient. This is because, as you point out in your question, you will be going through the string only once.
String sb = new StringBuilder();
String targets = "abc";
String replacements = "ABC";
for (int i = 0; i < result.length; ++i) {
char c = original.charAt(i);
int loc = targets.indexOf(c);
sb.append(loc >= 0 ? replacements.charAt(loc) : c);
}
String result = sb.toString();

Check the documentation and find some good methods:
char from = 'a';
char to = 'b';
str = str.replace(from, to);

String replaceSample = "This String replace Example shows
how to replace one char from String";
String newString = replaceSample.replace('r', 't');
Output: This Stting teplace Example shows how to teplace one chat ftom Stting
Also, you could use contains:
str1.toLowerCase().contains(str2.toLowerCase())
To check if the substring str2 exists in str1
Edit.
Just read that the String come from a file. You can use Regex for this. That would be the best method.
http://docs.oracle.com/javase/tutorial/essential/regex/literals.html

This is your comment:
I want to replace all of the uppercases to lower cases and replace all
of the characters except a-z with space.
You can do it like this:
str = str.toLowerCase().replaceAll("[^a-z]", " ");
Your requirement should be part of the question, not in comment #7 under a posted answer...

You should look into regex for Java. You can match an entire set of characters. Strings have several functions: replace, replaceAll, and match, which you may find useful here.
You can match the set of alphanumeric, for instance, using [a-zA-Z], which may be what you're looking for.

Efficently Replacement of all unsupported chars in a String [duplicate]

This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
Converting Symbols, Accent Letters to English Alphabet
I need to replace all accented characters, such as
"à", "é", "ì", "ò", "ù"
with
"a'", "e'", "i'", "o'", "u'"...
because of an issue with reloading nested strings with accented characters after they've been saved.
Is there a way to do this without using different string replacement for all chars?
For example, I would prefer to avoid doing
text = text.replace("a", "a'");
text2 = text.replace("è", "e'");
text3 = text2.replace("ì", "i'");
text4 = text3.replace("ò", "o'");
text5 = text4.replace("ù", "u'");
etc.

I tried this from this post it seems to work.
String str= Normalizer.normalize(str, Normalizer.Form.NFD);
str= str.replaceAll("\\p{InCombiningDiacriticalMarks}+", "'");
Edit:
But replacing the Combining diacritical marks, has a side effect that you cannot distinguish between À Á Â

If you don't mind adding commons-lang as a dependency, try StringUtils.replaceEach
I believe the following perform the same task:
import org.apache.commons.lang.StringUtils;
public class ReplaceEachTest
{
public static void main(String [] args)
{
String text = "àéìòùàéìòù";
String [] searchList = {"à", "é", "ì", "ò", "ù"};
String [] replaceList = {"a'", "e'", "i'", "o'", "u'"};
String newtext = StringUtils.replaceEach(text, searchList, replaceList);
System.out.println(newtext);
}
}
This example prints a'e'i'o'u'a'e'i'o'u'
However in general I agree that since you are creating a custom character translation, you will need a solution where your explicitly specify the replacement for each character of interest.
My previous answer using replaceChars is no good because it only handles one-to-one character replacement.

After reading the comments in the main approach, I think a better option would be fix the problem - which is encoding related? - and not try to cover up the symptoms.
Also, this still requires a manual explicit mapping, which might make it less ideal than nandeesh's answer with a regular expression unicode character class.
Here is a skeleton for code to perform the mapping. It is slightly more complicated than a char-char.
This code tries to avoid extra Strings. It may or not be "more efficient". Try it with the real data/usage. YMMV.
String mapAccentChar (char ch) {
switch (ch) {
case 'à': return "a'";
// etc
}
return null;
}
String mapAccents (String input) {
StringBuilder sb = new StringBuilder();
int l = input.length();
for (int i = 0; i < l; i++) {
char ch = input.charAt(i);
String mapped = mapAccentChar(ch);
if (mapped != null) {
sb.append(mapped);
} else {
sb.append(ch);
}
return sb.toString();
}

Since there is no strict correlation between ASCII value of a char and its accented version, your replacement seems to me the most straightforward way.

How do I replace a character in a string in Java?

Using Java, I want to go through the lines of a text and replace all ampersand symbols (&) with the XML entity reference &.
I scan the lines of the text and then each word in the text with the Scanner class. Then I use the CharacterIterator to iterate over each characters of the word. However, how can I replace the character? First, Strings are immutable objects. Second, I want to replace a character (&) with several characters(amp&;). How should I approach this?
CharacterIterator it = new StringCharacterIterator(token);
for(char ch = it.first(); ch != CharacterIterator.DONE; ch = it.next()) {
if(ch == '&') {
}
}

Try using String.replace() or String.replaceAll() instead.
String my_new_str = my_str.replace("&", "&");
(Both replace all occurrences; replaceAll allows use of regex.)

The simple answer is:
token = token.replace("&", "&");
Despite the name as compared to replaceAll, replace does do a replaceAll, it just doesn't use a regular expression, which seems to be in order here (both from a performance and a good practice perspective - don't use regular expressions by accident as they have special character requirements which you won't be paying attention to).
Sean Bright's answer is probably as good as is worth thinking about from a performance perspective absent some further target requirement on performance and performance testing, if you already know this code is a hot spot for performance, if that is where your question is coming from. It certainly doesn't deserve the downvotes. Just use StringBuilder instead of StringBuffer unless you need the synchronization.
That being said, there is a somewhat deeper potential problem here. Escaping characters is a known problem which lots of libraries out there address. You may want to consider wrapping the data in a CDATA section in the XML, or you may prefer to use an XML library (including the one that comes with the JDK now) to actually generate the XML properly (so that it will handle the encoding).
Apache also has an escaping library as part of Commons Lang.

StringBuilder s = new StringBuilder(token.length());
CharacterIterator it = new StringCharacterIterator(token);
for (char ch = it.first(); ch != CharacterIterator.DONE; ch = it.next()) {
switch (ch) {
case '&':
s.append("&");
break;
case '<':
s.append("<");
break;
case '>':
s.append(">");
break;
default:
s.append(ch);
break;
}
}
token = s.toString();

You may also want to check to make sure your not replacing an occurrence that has already been replaced. You can use a regular expression with negative lookahead to do this.
For example:
String str = "sdasdasa&adas&dasdasa";
str = str.replaceAll("&(?!amp;)", "&");
This would result in the string "sdasdasa&adas&dasdasa".
The regex pattern "&(?!amp;)" basically says: Match any occurrence of '&' that is not followed by 'amp;'.

Just create a string that contains all of the data in question and then use String.replaceAll() like below.
String result = yourString.replaceAll("&", "&");

You can use stream and flatMap to map & to &
String str = "begin&end";
String newString = str.chars()
.flatMap(ch -> (ch == '&') ? "&".chars() : IntStream.of(ch))
.collect(StringBuilder::new, StringBuilder::appendCodePoint, StringBuilder::append)
.toString();

Escaping strings can be tricky - especially if you want to take unicode into account. I suppose XML is one of the simpler formats/languages to escape but still. I would recommend taking a look at the StringEscapeUtils class in Apache Commons Lang, and its handy escapeXml method.

Try this code.You can replace any character with another given character.
Here I tried to replace the letter 'a' with "-" character for the give string "abcdeaa"
OutPut -->_bcdef__
public class Replace {
public static void replaceChar(String str,String target){
String result = str.replaceAll(target, "_");
System.out.println(result);
}
public static void main(String[] args) {
replaceChar("abcdefaa","a");
}
}

If you're using Spring you can simply call HtmlUtils.htmlEscape(String input) which will handle the '&' to '&' translation.

//I think this will work, you don't have to replace on the even, it's just an example.
public void emphasize(String phrase, char ch)
{
char phraseArray[] = phrase.toCharArray();
for(int i=0; i< phrase.length(); i++)
{
if(i%2==0)// even number
{
String value = Character.toString(phraseArray[i]);
value = value.replace(value,"*");
phraseArray[i] = value.charAt(0);
}
}
}

String taskLatLng = task.getTask_latlng().replaceAll( "\\(","").replaceAll("\\)","").replaceAll("lat/lng:", "").trim();

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

How to remove high-ASCII characters from string like ®, ©, ™ in Java - java

I want to detect and remove high-ASCII characters like ®, ©, ™ from a String in Java. Is there any open-source library that can do this?

Related

How to remove leading 0 in the time timestamp 02:25PM using java? [duplicate]

Efficient encoding all the special characters in a string into entities

the best way for character replacement in String in java

Efficently Replacement of all unsupported chars in a String [duplicate]

How do I replace a character in a string in Java?

Categories

Resources