With the help of tucuxi from the existing post Java remove HTML from String without regular expressions I have built a method that will parse out any basic HTML tags from a string. Sometimes, however, the original string contains html hexadecimal characters like é (which is an accented e). I have started to add functionality which will translate these escaped characters into real characters.
You're probably asking: Why not use regular expressions? Or a third party library? Unfortunately I cannot, as I am developing on a BlackBerry platform which does not support regular expressions and I have never been able to successfully add a third party library to my project.
So, I have gotten to the point where any é is replaced with "e". My question now is, how do I add an actual 'accented e' to a string?
Here is my code:
public static String removeHTML(String synopsis) {
char[] cs = synopsis.toCharArray();
String sb = new String();
boolean tag = false;
for (int i = 0; i < cs.length; i++) {
switch (cs[i]) {
case '<':
if (!tag) {
tag = true;
break;
}
case '>':
if (tag) {
tag = false;
break;
}
case '&':
char[] copyTo = new char[7];
System.arraycopy(cs, i, copyTo, 0, 7);
String result = new String(copyTo);
if (result.equals("é")) {
sb += "e";
}
i += 7;
break;
default:
if (!tag)
sb += cs[i];
}
}
return sb.toString();
}
Thanks!
Java Strings are unicode.
sb += '\u00E9'; # lower case e + '
sb += '\u00C9'; # upper case E + '
You can print out just about any character you like in Java as it uses the Unicode character set.
To find the character you want take a look at the charts here:
http://www.unicode.org/charts/
In the Latin Supplement document you'll see all the unicode numbers for the accented characters. You should see the hex number 00E9 listed for é for example. The numbers for all Latin accented characters are in this document so you should find this pretty useful.
To print use character in a String, just use the Unicode escape sequence of \u followed by the character code like so:
System.out.print("Let's go to the caf\u00E9");
Would produce: "Let's go to the café"
Depending in which version of Java you're using you might find StringBuilders (or StringBuffers if you're multi-threaded) more efficient than using the + operator to concatenate Strings too.
try this:
if (result.equals("é")) {
sb += char(130);
}
instead of
if (result.equals("é")) {
sb += "e";
}
The thing is that you're not adding an accent to the top of the 'e' character, but rather that is a separate character all together. This site lists out the ascii codes for characters.
For a table of accented in characters in Java take a look at this reference.
To decode the html part, use Apache StringEscapeUtils from Apache commons lang:
import org.apache.commons.lang.StringEscapeUtils;
...
String withCharacters = StringEscapeUtils.unescapeHtml(yourString);
See also this Stack Overflow thread:
Replace HTML codes with equivalent characters in Java
Related
I had to export a bunch of strings to a CSV that I opened in excel. The strings contained '\n' and '\t' which I needed included in the CSV so I did the following before exporting the data:
public static String unEscapString(String s)
{
StringBuilder sb = new StringBuilder();
for (int i = 0; i < s.length(); i++)
{
switch (s.charAt(i))
{
case '\n': sb.append("\\n"); break;
case '\t': sb.append("\\t"); break;
default: sb.append(s.charAt(i));
}
}
return sb.toString();
}
The problem is that I am now reimporting the data into Java but I can't figure out how to get the newline and tab to print correctly again. I've tried:
s.replaceAll("\\n", "\n");
but it still ignores the newlines. Help?
EDIT: Example of what i'm trying to do:
Say one string in the CSV is "foo \n bar". When I import it using Java and i'm trying to print the same string to the console but have the newline behave correctly
replaceAll's first argument is a regular expression. You have 2 choices. You can use plain old replace like so:
s.replace("\\n", "\n");
or you can escape the slash for the regex parser (which is stripping the single slash out):
s.replaceAll("\\\\n", "\n");
or
s.replaceAll(Pattern.quote("\\n"), "\n");
I would opt for replace since you're not using a regular expression.
It should be
sb.append("\n");
Otherwise, you will get a '\' and a 'n' by using "\\n".
But I recommend you to use:
sb.append(System.getProperty("line.separator"));
Here System.getProperty("line.separator") gives you platform independent newline in java. Also from Java 7 there's a method that returns the value directly: System.lineSeparator().
If you want an actual newline in the string, it should be \n, not \\n. The way you have it, it is being interpreted as a backslash and then an 'n'.
I want to check a string for each character I replace it with other characters or keep it in the string. and also because it's a long string the time to do this task is so important. what is the best way of these, or any better idea?
for all of them I append the result to an StringBuilder.
check all of the characters with a for and charAt commands.
use switch like the previous way.
use replaceAll twice.
and if one of the first to methods is better is there any way to check a character with a group of characters, like :
if (st.charAt(i)=='a'..'z') ....
Edit:
please tell the less consuming in time way and tell the reason.I know all of these ways you said!
If you want to replace a single character (or a single sequence), use replace(), as other answers have suggested.
If you want to replace several characters (e.g., 'a', 'b', and 'c') with a single substitute character or character sequence (e.g., "X"), you should use a regular expression replace:
String result = original.replaceAll("[abc]", "X");
If you want to replace several characters, each with a different replacement (e.g., 'a' with 'A', 'b' with 'B'), then looping through the string yourself and building the result in a StringBuilder will probably be the most efficient. This is because, as you point out in your question, you will be going through the string only once.
String sb = new StringBuilder();
String targets = "abc";
String replacements = "ABC";
for (int i = 0; i < result.length; ++i) {
char c = original.charAt(i);
int loc = targets.indexOf(c);
sb.append(loc >= 0 ? replacements.charAt(loc) : c);
}
String result = sb.toString();
Check the documentation and find some good methods:
char from = 'a';
char to = 'b';
str = str.replace(from, to);
String replaceSample = "This String replace Example shows
how to replace one char from String";
String newString = replaceSample.replace('r', 't');
Output: This Stting teplace Example shows how to teplace one chat ftom Stting
Also, you could use contains:
str1.toLowerCase().contains(str2.toLowerCase())
To check if the substring str2 exists in str1
Edit.
Just read that the String come from a file. You can use Regex for this. That would be the best method.
http://docs.oracle.com/javase/tutorial/essential/regex/literals.html
This is your comment:
I want to replace all of the uppercases to lower cases and replace all
of the characters except a-z with space.
You can do it like this:
str = str.toLowerCase().replaceAll("[^a-z]", " ");
Your requirement should be part of the question, not in comment #7 under a posted answer...
You should look into regex for Java. You can match an entire set of characters. Strings have several functions: replace, replaceAll, and match, which you may find useful here.
You can match the set of alphanumeric, for instance, using [a-zA-Z], which may be what you're looking for.
This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
Converting Symbols, Accent Letters to English Alphabet
I need to replace all accented characters, such as
"à", "é", "ì", "ò", "ù"
with
"a'", "e'", "i'", "o'", "u'"...
because of an issue with reloading nested strings with accented characters after they've been saved.
Is there a way to do this without using different string replacement for all chars?
For example, I would prefer to avoid doing
text = text.replace("a", "a'");
text2 = text.replace("è", "e'");
text3 = text2.replace("ì", "i'");
text4 = text3.replace("ò", "o'");
text5 = text4.replace("ù", "u'");
etc.
I tried this from this post it seems to work.
String str= Normalizer.normalize(str, Normalizer.Form.NFD);
str= str.replaceAll("\\p{InCombiningDiacriticalMarks}+", "'");
Edit:
But replacing the Combining diacritical marks, has a side effect that you cannot distinguish between À Á Â
If you don't mind adding commons-lang as a dependency, try StringUtils.replaceEach
I believe the following perform the same task:
import org.apache.commons.lang.StringUtils;
public class ReplaceEachTest
{
public static void main(String [] args)
{
String text = "àéìòùàéìòù";
String [] searchList = {"à", "é", "ì", "ò", "ù"};
String [] replaceList = {"a'", "e'", "i'", "o'", "u'"};
String newtext = StringUtils.replaceEach(text, searchList, replaceList);
System.out.println(newtext);
}
}
This example prints a'e'i'o'u'a'e'i'o'u'
However in general I agree that since you are creating a custom character translation, you will need a solution where your explicitly specify the replacement for each character of interest.
My previous answer using replaceChars is no good because it only handles one-to-one character replacement.
After reading the comments in the main approach, I think a better option would be fix the problem - which is encoding related? - and not try to cover up the symptoms.
Also, this still requires a manual explicit mapping, which might make it less ideal than nandeesh's answer with a regular expression unicode character class.
Here is a skeleton for code to perform the mapping. It is slightly more complicated than a char-char.
This code tries to avoid extra Strings. It may or not be "more efficient". Try it with the real data/usage. YMMV.
String mapAccentChar (char ch) {
switch (ch) {
case 'à': return "a'";
// etc
}
return null;
}
String mapAccents (String input) {
StringBuilder sb = new StringBuilder();
int l = input.length();
for (int i = 0; i < l; i++) {
char ch = input.charAt(i);
String mapped = mapAccentChar(ch);
if (mapped != null) {
sb.append(mapped);
} else {
sb.append(ch);
}
return sb.toString();
}
Since there is no strict correlation between ASCII value of a char and its accented version, your replacement seems to me the most straightforward way.
When I have a string such as:
String x = "hello\nworld";
How do I get Java to print the actual escape character (and not interpret it as an escape character) when using System.out?
For example, when calling
System.out.print(x);
I would like to see:
hello\nworld
And not:
hello
world
I would like to see the actual escape characters for debugging purposes.
Use the method "StringEscapeUtils.escapeJava" in Java lib "org.apache.commons.lang"
String x = "hello\nworld";
System.out.print(StringEscapeUtils.escapeJava(x));
One way to do this is:
public static String unEscapeString(String s){
StringBuilder sb = new StringBuilder();
for (int i=0; i<s.length(); i++)
switch (s.charAt(i)){
case '\n': sb.append("\\n"); break;
case '\t': sb.append("\\t"); break;
// ... rest of escape characters
default: sb.append(s.charAt(i));
}
return sb.toString();
}
and you run System.out.print(unEscapeString(x)).
You have to escape the slash itself:
String x = "hello\\nworld";
System.out.println("hello \\nworld");
Java has its escape-sequence just the same as that in C.
use String x = "hello\\nworld";
Just escape the escape character.
String x = "hello\\nworld";
Try to escape the backslash like \\n
You might want to check out this method. Although this may do more than you intend. Alternatively, use String replace methods for new lines, carriage returns and tab characters. Do keep in mind that there are also such things as unicode and hex sequences.
Using Java, I want to go through the lines of a text and replace all ampersand symbols (&) with the XML entity reference &.
I scan the lines of the text and then each word in the text with the Scanner class. Then I use the CharacterIterator to iterate over each characters of the word. However, how can I replace the character? First, Strings are immutable objects. Second, I want to replace a character (&) with several characters(amp&;). How should I approach this?
CharacterIterator it = new StringCharacterIterator(token);
for(char ch = it.first(); ch != CharacterIterator.DONE; ch = it.next()) {
if(ch == '&') {
}
}
Try using String.replace() or String.replaceAll() instead.
String my_new_str = my_str.replace("&", "&");
(Both replace all occurrences; replaceAll allows use of regex.)
The simple answer is:
token = token.replace("&", "&");
Despite the name as compared to replaceAll, replace does do a replaceAll, it just doesn't use a regular expression, which seems to be in order here (both from a performance and a good practice perspective - don't use regular expressions by accident as they have special character requirements which you won't be paying attention to).
Sean Bright's answer is probably as good as is worth thinking about from a performance perspective absent some further target requirement on performance and performance testing, if you already know this code is a hot spot for performance, if that is where your question is coming from. It certainly doesn't deserve the downvotes. Just use StringBuilder instead of StringBuffer unless you need the synchronization.
That being said, there is a somewhat deeper potential problem here. Escaping characters is a known problem which lots of libraries out there address. You may want to consider wrapping the data in a CDATA section in the XML, or you may prefer to use an XML library (including the one that comes with the JDK now) to actually generate the XML properly (so that it will handle the encoding).
Apache also has an escaping library as part of Commons Lang.
StringBuilder s = new StringBuilder(token.length());
CharacterIterator it = new StringCharacterIterator(token);
for (char ch = it.first(); ch != CharacterIterator.DONE; ch = it.next()) {
switch (ch) {
case '&':
s.append("&");
break;
case '<':
s.append("<");
break;
case '>':
s.append(">");
break;
default:
s.append(ch);
break;
}
}
token = s.toString();
You may also want to check to make sure your not replacing an occurrence that has already been replaced. You can use a regular expression with negative lookahead to do this.
For example:
String str = "sdasdasa&adas&dasdasa";
str = str.replaceAll("&(?!amp;)", "&");
This would result in the string "sdasdasa&adas&dasdasa".
The regex pattern "&(?!amp;)" basically says: Match any occurrence of '&' that is not followed by 'amp;'.
Just create a string that contains all of the data in question and then use String.replaceAll() like below.
String result = yourString.replaceAll("&", "&");
You can use stream and flatMap to map & to &
String str = "begin&end";
String newString = str.chars()
.flatMap(ch -> (ch == '&') ? "&".chars() : IntStream.of(ch))
.collect(StringBuilder::new, StringBuilder::appendCodePoint, StringBuilder::append)
.toString();
Escaping strings can be tricky - especially if you want to take unicode into account. I suppose XML is one of the simpler formats/languages to escape but still. I would recommend taking a look at the StringEscapeUtils class in Apache Commons Lang, and its handy escapeXml method.
Try this code.You can replace any character with another given character.
Here I tried to replace the letter 'a' with "-" character for the give string "abcdeaa"
OutPut -->_bcdef__
public class Replace {
public static void replaceChar(String str,String target){
String result = str.replaceAll(target, "_");
System.out.println(result);
}
public static void main(String[] args) {
replaceChar("abcdefaa","a");
}
}
If you're using Spring you can simply call HtmlUtils.htmlEscape(String input) which will handle the '&' to '&' translation.
//I think this will work, you don't have to replace on the even, it's just an example.
public void emphasize(String phrase, char ch)
{
char phraseArray[] = phrase.toCharArray();
for(int i=0; i< phrase.length(); i++)
{
if(i%2==0)// even number
{
String value = Character.toString(phraseArray[i]);
value = value.replace(value,"*");
phraseArray[i] = value.charAt(0);
}
}
}
String taskLatLng = task.getTask_latlng().replaceAll( "\\(","").replaceAll("\\)","").replaceAll("lat/lng:", "").trim();