Using Java, I want to go through the lines of a text and replace all ampersand symbols (&) with the XML entity reference &.
I scan the lines of the text and then each word in the text with the Scanner class. Then I use the CharacterIterator to iterate over each characters of the word. However, how can I replace the character? First, Strings are immutable objects. Second, I want to replace a character (&) with several characters(amp&;). How should I approach this?
CharacterIterator it = new StringCharacterIterator(token);
for(char ch = it.first(); ch != CharacterIterator.DONE; ch = it.next()) {
if(ch == '&') {
}
}
Try using String.replace() or String.replaceAll() instead.
String my_new_str = my_str.replace("&", "&");
(Both replace all occurrences; replaceAll allows use of regex.)
The simple answer is:
token = token.replace("&", "&");
Despite the name as compared to replaceAll, replace does do a replaceAll, it just doesn't use a regular expression, which seems to be in order here (both from a performance and a good practice perspective - don't use regular expressions by accident as they have special character requirements which you won't be paying attention to).
Sean Bright's answer is probably as good as is worth thinking about from a performance perspective absent some further target requirement on performance and performance testing, if you already know this code is a hot spot for performance, if that is where your question is coming from. It certainly doesn't deserve the downvotes. Just use StringBuilder instead of StringBuffer unless you need the synchronization.
That being said, there is a somewhat deeper potential problem here. Escaping characters is a known problem which lots of libraries out there address. You may want to consider wrapping the data in a CDATA section in the XML, or you may prefer to use an XML library (including the one that comes with the JDK now) to actually generate the XML properly (so that it will handle the encoding).
Apache also has an escaping library as part of Commons Lang.
StringBuilder s = new StringBuilder(token.length());
CharacterIterator it = new StringCharacterIterator(token);
for (char ch = it.first(); ch != CharacterIterator.DONE; ch = it.next()) {
switch (ch) {
case '&':
s.append("&");
break;
case '<':
s.append("<");
break;
case '>':
s.append(">");
break;
default:
s.append(ch);
break;
}
}
token = s.toString();
You may also want to check to make sure your not replacing an occurrence that has already been replaced. You can use a regular expression with negative lookahead to do this.
For example:
String str = "sdasdasa&adas&dasdasa";
str = str.replaceAll("&(?!amp;)", "&");
This would result in the string "sdasdasa&adas&dasdasa".
The regex pattern "&(?!amp;)" basically says: Match any occurrence of '&' that is not followed by 'amp;'.
Just create a string that contains all of the data in question and then use String.replaceAll() like below.
String result = yourString.replaceAll("&", "&");
You can use stream and flatMap to map & to &
String str = "begin&end";
String newString = str.chars()
.flatMap(ch -> (ch == '&') ? "&".chars() : IntStream.of(ch))
.collect(StringBuilder::new, StringBuilder::appendCodePoint, StringBuilder::append)
.toString();
Escaping strings can be tricky - especially if you want to take unicode into account. I suppose XML is one of the simpler formats/languages to escape but still. I would recommend taking a look at the StringEscapeUtils class in Apache Commons Lang, and its handy escapeXml method.
Try this code.You can replace any character with another given character.
Here I tried to replace the letter 'a' with "-" character for the give string "abcdeaa"
OutPut -->_bcdef__
public class Replace {
public static void replaceChar(String str,String target){
String result = str.replaceAll(target, "_");
System.out.println(result);
}
public static void main(String[] args) {
replaceChar("abcdefaa","a");
}
}
If you're using Spring you can simply call HtmlUtils.htmlEscape(String input) which will handle the '&' to '&' translation.
//I think this will work, you don't have to replace on the even, it's just an example.
public void emphasize(String phrase, char ch)
{
char phraseArray[] = phrase.toCharArray();
for(int i=0; i< phrase.length(); i++)
{
if(i%2==0)// even number
{
String value = Character.toString(phraseArray[i]);
value = value.replace(value,"*");
phraseArray[i] = value.charAt(0);
}
}
}
String taskLatLng = task.getTask_latlng().replaceAll( "\\(","").replaceAll("\\)","").replaceAll("lat/lng:", "").trim();
Related
I have a string like this "abcd !#&$%^^&*()<>!/". I have list of all the entity codes for characters in a separate string i.e. only encode those characters which are in another string "!=&4....^=9...". I want to convert all of special characters into their entities except alphanumeric by regex as using loop on characters on by one is too slow.
e.g. it should show "abc ..;.." in other convert words all the special characters on keyboard.
Is there an efficient regex I can write ? I have tried this with loops but it is too slow to look at each character one by one and maintain a list of all special characters entities in other string
There are libraries but they do not convert all of the characters.
The code I wrote
// String to be encoded
String sDecoded = "abcd !##$%^&*();'m,";
// Special character entity list to put instead to special character. It is tokenized on cross and divide symbol as it cannot be entered by user on keyboard
String specialCharacters = "&÷$amp;×–÷–"
// Check the input
if (sDecoded == null || sDecoded.trim ().length () == 0)
return (sDecoded);
// Use StringTokenizer which is faster than split method
StringTokenizer st = new StringTokenizer(specialCharacters, "×");
String[] reg = null;
String[] charactersArray = sDecoded.split("");
String sEncoded = "";
// now loop on it and in each iteration, we will be getting a decodedCharacter:EncodedEntity pair
for(int i = 0; i < charactersArray.length; i++)
{
st = new StringTokenizer(specialCharacters, "×");
while(st.hasMoreElements())
{
reg = st.nextElement().toString().split("÷");
// This is an error, the character should not be blank ever because it will be character that we will encode
if(StringUtils.isBlank(reg[0]))
return sDecoded;
String c = charactersArray[i];
if(c.equalsIgnoreCase(reg[0]))
{
sEncoded = sEncoded + c.replace(reg[0], reg[1]);
break;
}
if(st.countTokens() == 0)
sEncoded = sEncoded + c.toString();
}
}
return (sEncoded);
I don't know what definition of "efficient" you are using, but there's the "don't reinvent the wheel" efficiency of using a simple call to Apache commons-text StringEscapeUtils utility class:
String encoded = StringEscapeUtils.escapeXml11(str);
or
String encoded = StringEscapeUtils.escapeHtml4(str);
and a variety of other similar methods, depending on which exact encoding you want.
Note: This class was originally in the commons-lang3 library, but was deprecated there and moved to the commons-text library.
Your approach is quite slow and inefficient. Maybe it looks elegant nowadays to use regex like a silver bullet for everything, but it is definitely not for this task. I see you are also using tokenizer which is also slow.Also loop inside a loop will degrade performance.
I would recomment using an iterative way with string builder which will produce blazing fast results, you will try for yourself. For each special character make an 'if' statement. Even if it looks too much code it will be very fast. Test yourself.
Try this :
class Scratch {
public static void main(String[] args) {
System.out.println(escapeSpecials("abc &"));
}
public static String escapeSpecials(String origin) {
StringBuilder result = new StringBuilder();
char[] chars = origin.toCharArray();
for (char c : chars) {
if (c == '&') {
result.append("&");
} else if (c == '\u2013') {
result.append("–");
} else {
// not a special character
result.append(c);
}
}
return result.toString();
}
}
I want to check a string for each character I replace it with other characters or keep it in the string. and also because it's a long string the time to do this task is so important. what is the best way of these, or any better idea?
for all of them I append the result to an StringBuilder.
check all of the characters with a for and charAt commands.
use switch like the previous way.
use replaceAll twice.
and if one of the first to methods is better is there any way to check a character with a group of characters, like :
if (st.charAt(i)=='a'..'z') ....
Edit:
please tell the less consuming in time way and tell the reason.I know all of these ways you said!
If you want to replace a single character (or a single sequence), use replace(), as other answers have suggested.
If you want to replace several characters (e.g., 'a', 'b', and 'c') with a single substitute character or character sequence (e.g., "X"), you should use a regular expression replace:
String result = original.replaceAll("[abc]", "X");
If you want to replace several characters, each with a different replacement (e.g., 'a' with 'A', 'b' with 'B'), then looping through the string yourself and building the result in a StringBuilder will probably be the most efficient. This is because, as you point out in your question, you will be going through the string only once.
String sb = new StringBuilder();
String targets = "abc";
String replacements = "ABC";
for (int i = 0; i < result.length; ++i) {
char c = original.charAt(i);
int loc = targets.indexOf(c);
sb.append(loc >= 0 ? replacements.charAt(loc) : c);
}
String result = sb.toString();
Check the documentation and find some good methods:
char from = 'a';
char to = 'b';
str = str.replace(from, to);
String replaceSample = "This String replace Example shows
how to replace one char from String";
String newString = replaceSample.replace('r', 't');
Output: This Stting teplace Example shows how to teplace one chat ftom Stting
Also, you could use contains:
str1.toLowerCase().contains(str2.toLowerCase())
To check if the substring str2 exists in str1
Edit.
Just read that the String come from a file. You can use Regex for this. That would be the best method.
http://docs.oracle.com/javase/tutorial/essential/regex/literals.html
This is your comment:
I want to replace all of the uppercases to lower cases and replace all
of the characters except a-z with space.
You can do it like this:
str = str.toLowerCase().replaceAll("[^a-z]", " ");
Your requirement should be part of the question, not in comment #7 under a posted answer...
You should look into regex for Java. You can match an entire set of characters. Strings have several functions: replace, replaceAll, and match, which you may find useful here.
You can match the set of alphanumeric, for instance, using [a-zA-Z], which may be what you're looking for.
This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
Converting Symbols, Accent Letters to English Alphabet
I need to replace all accented characters, such as
"à", "é", "ì", "ò", "ù"
with
"a'", "e'", "i'", "o'", "u'"...
because of an issue with reloading nested strings with accented characters after they've been saved.
Is there a way to do this without using different string replacement for all chars?
For example, I would prefer to avoid doing
text = text.replace("a", "a'");
text2 = text.replace("è", "e'");
text3 = text2.replace("ì", "i'");
text4 = text3.replace("ò", "o'");
text5 = text4.replace("ù", "u'");
etc.
I tried this from this post it seems to work.
String str= Normalizer.normalize(str, Normalizer.Form.NFD);
str= str.replaceAll("\\p{InCombiningDiacriticalMarks}+", "'");
Edit:
But replacing the Combining diacritical marks, has a side effect that you cannot distinguish between À Á Â
If you don't mind adding commons-lang as a dependency, try StringUtils.replaceEach
I believe the following perform the same task:
import org.apache.commons.lang.StringUtils;
public class ReplaceEachTest
{
public static void main(String [] args)
{
String text = "àéìòùàéìòù";
String [] searchList = {"à", "é", "ì", "ò", "ù"};
String [] replaceList = {"a'", "e'", "i'", "o'", "u'"};
String newtext = StringUtils.replaceEach(text, searchList, replaceList);
System.out.println(newtext);
}
}
This example prints a'e'i'o'u'a'e'i'o'u'
However in general I agree that since you are creating a custom character translation, you will need a solution where your explicitly specify the replacement for each character of interest.
My previous answer using replaceChars is no good because it only handles one-to-one character replacement.
After reading the comments in the main approach, I think a better option would be fix the problem - which is encoding related? - and not try to cover up the symptoms.
Also, this still requires a manual explicit mapping, which might make it less ideal than nandeesh's answer with a regular expression unicode character class.
Here is a skeleton for code to perform the mapping. It is slightly more complicated than a char-char.
This code tries to avoid extra Strings. It may or not be "more efficient". Try it with the real data/usage. YMMV.
String mapAccentChar (char ch) {
switch (ch) {
case 'à': return "a'";
// etc
}
return null;
}
String mapAccents (String input) {
StringBuilder sb = new StringBuilder();
int l = input.length();
for (int i = 0; i < l; i++) {
char ch = input.charAt(i);
String mapped = mapAccentChar(ch);
if (mapped != null) {
sb.append(mapped);
} else {
sb.append(ch);
}
return sb.toString();
}
Since there is no strict correlation between ASCII value of a char and its accented version, your replacement seems to me the most straightforward way.
I want to detect and remove high-ASCII characters like ®, ©, ™ from a String in Java. Is there any open-source library that can do this?
If you need to remove all non-US-ASCII (i.e. outside 0x0-0x7F) characters, you can do something like this:
s = s.replaceAll("[^\\x00-\\x7f]", "");
If you need to filter many strings, it would be better to use a precompiled pattern:
private static final Pattern nonASCII = Pattern.compile("[^\\x00-\\x7f]");
...
s = nonASCII.matcher(s).replaceAll();
And if it's really performance-critical, perhaps Alex Nikolaenkov's suggestion would be better.
I think that you can easily filter your string by hand and check code of the particular character. If it fits your requirements then add it to a StringBuilder and do toString() to it in the end.
public static String filter(String str) {
StringBuilder filtered = new StringBuilder(str.length());
for (int i = 0; i < str.length(); i++) {
char current = str.charAt(i);
if (current >= 0x20 && current <= 0x7e) {
filtered.append(current);
}
}
return filtered.toString();
}
A nice way to do this is to use Google Guava CharMatcher:
String newString = CharMatcher.ASCII.retainFrom(string);
newString will contain only the ASCII characters (code point < 128) from the original string.
This reads more naturally than a regular expression. Regular expressions can take more effort to understand for subsequent readers of your code.
I understand that you need to delete: ç,ã,Ã , but for everybody that need to convert ç,ã,Ã ---> c,a,A please have a look at this piece of code:
Example Code:
final String input = "Tĥïŝ ĩš â fůňķŷ Šťŕĭńġ";
System.out.println(
Normalizer
.normalize(input, Normalizer.Form.NFD)
.replaceAll("[^\\p{ASCII}]", "")
);
Output:
This is a funky String
I'm using a framwork which returns malformed Strings with "empty" characters from time to time.
"foobar" for example is represented by:
[,f,o,o,b,a,r]
The first character is NOT a whitespace (' '), so a System.out.printlin() would return "foobar" and not " foobar". Yet, the length of the String is 7 instead of 6. Obviously this makes most String methods (equals, split, substring,..) useless. Is there a way to remove empty characters from a String?
I tried to build a new String like this:
StringBuilder sb = new StringBuilder();
for (final char character : malformedString.toCharArray()) {
if (Character.isDefined(character)) {
sb.append(character);
}
}
sb.toString();
Unfortunately this doesn't work. Same with the following code:
StringBuilder sb = new StringBuilder();
for (final Character character : malformedString.toCharArray()) {
if (character != null) {
sb.append(character);
}
}
sb.toString();
I also can't check for an empty character like this:
if (character == ''){
//
}
Obviously there is something wrong with the String .. but I can't change the framework I'm using or wait for them to fix it (if it is a bug within their framework). I need to handle this String and sanatize it.
Any ideas?
Regex would be an appropriate way to sanitize the string from unwanted Unicode characters in this case.
String sanitized = dirty.replaceAll("[\uFEFF-\uFFFF]", "");
This will replace all char in \uFEFF-\uFFFF range with the empty string.
The [...] construct is called a character class, e.g. [aeiou] matches one of any of the lowercase vowels, [^aeiou] matches anything but.
You can do one of these two approaches:
replaceAll("[blacklist]", "")
replaceAll("[^whitelist]", "")
References
regular-expressions.info
It's probably the NULL character which is represented by \0. You can get rid of it by String#trim().
To nail down the exact codepoint, do so:
for (char c : string.toCharArray()) {
System.out.printf("U+%04x ", (int) c);
}
Then you can find the exact character here.
Update: as per the update:
Anyone know of a way to just include a range of valid characters instead of excluding 95% of the UTF8 range?
You can do that with help of regex. See the answer of #polygenelubricants here and this answer.
On the other hand, you can also just fix the problem in its root instead of workarounding it. Either update the files to get rid of the BOM mark, it's a legacy way to distinguish UTF-8 files from others which is nowadays worthless, or use a Reader which recognizes and skips the BOM. Also see this question.
A very simple way to remove the UTF-8 BOM from a string, using substring as Denis Tulskiy suggested. No looping needed. Just checks the first character for the mark and skips it if needed.
public static String removeUTF8BOM(String s) {
if (s.startsWith("\uFEFF")) {
s = s.substring(1);
}
return s;
}
I needed to add this to my code when using the Apache HTTPClient EntityUtil to read from a webserver. The webserver was not sending the blank mark but it was getting pulled in while reading the input stream. Original article can be found here.
Thank you Johannes Rössel. It actually was '\uFEFF'
The following code works:
final StringBuilder sb = new StringBuilder();
for (final char character : body.toCharArray()) {
if (character != '\uFEFF') {
sb.append(character);
}
}
final String sanitzedString = sb.toString();
Anyone know of a way to just include a range of valid characters instead of excluding 95% of the UTF8 range?
trim left or right removes white spaces. does it has a colon before space?
even more:
a=(long) string[0]; will show u the char code, and u can use replace() or substring.
This is what worked for me:-
StringBuilder sb = new StringBuilder();
for (char character : myString.toCharArray()) {
int i = (int) character;
if (i > 0 && i <= 256) {
sb.append(character);
}
}
return sb.toString();
The int value of my NULL characters was in the region of 8103 or something.
You can try replace:
s.replace("\u200B", "")
or
s.replace("\uFEFF", "")
Kotlin:
s.filter { it == '\u200B' }
for (int i = 0; i < s.length(); i++)
if (s.charAt(i) == ' ') {
your code....
}
Simply malformedString.trim() will solve the issue.
You could check for the whitespace like this:
if (character.equals(' ')){ // }