I want to compare a string portion (i.e. character) against a Chinese character. I assume due to the Unicode encoding it counts as two characters, so I'm looping through the string with increments of two. Now I ran into a roadblock where I'm trying to detect the '兒' character, but equals() doesn't match it, so what am I missing ? This is the code snippet:
for (int CharIndex = 0; CharIndex < tmpChar.length(); CharIndex=CharIndex+2) {
// Account for 'r' like in dianr/huir
if (tmpChar.substring(CharIndex,CharIndex+2).equals("兒")) {
Also, feel free to suggest a more elegant way to parse this ...
[UPDATE] Some pics from the debugger, showing that it doesn't match, even though it should. I pasted the Chinese character from the spreadsheet I use as input, so I don't think it's a copy and paste issue (unless the unicode gets lost along the way)
oh, dang, apparently it does not work simply copy and pasting:
Use CharSequence.codePoints(), which returns a stream of the codepoints, rather than having to deal with chars:
tmpChar.codePoints().forEach(c -> {
if (c == '兒') {
// ...
}
});
(Of course, you could have used tmpChar.codePoints().filter(c -> c == '兒').forEach(c -> { /* ... */ })).
Either characters, accepting 兒 as substring.
String s = ...;
if (s.contains("兒")) { ... }
int position = s.indexOf("兒");
if (position != -1) {
int position2 = position + "兒".length();
s = s.substring(0, position) + "*" + s.substring(position2);
}
if (s.startsWith("兒", i)) {
// At position i there is a 兒.
}
Or code points where it would be one code point. As that is not really easier, variable substring seem fine.
if (tmpChar.substring(CharIndex,CharIndex+2).equals("兒")) {
Is your problem. 兒 is only one UTF-16 character. Many Chinese characters can be represented in UTF-16 in one code unit; Java uses UTF-16. However, other characters are two code units.
There are a variety of APIs on the String class for coping.
As offered in another answer, obtaining the IntStream from codepoints allows you to get a 32-bit code point for each character. You can compare that to the code point value for the character you are looking for.
Or, you can use the ICU4J library with a richer set of facilities for all of this.
I am running a GWT application on Google App Engine which passes text input from the GUI via GWT-RPC/Servlet to an API. But umlauts like ä,ö,ü are misinterpreted by the API and the API shows only a ? instead of an umlaut.
I am pretty sure that the problem is the default character encoding on the Google App Engine, which is US-ASCII: US-ASCII does not know any umlaut.
Using umlauts with the API from JUnit-Tests on my local machine works. The default character encoding there is UTF-8.
The problem does not come from GWT or the Encoding with any HTML file; I used a Constant Java String within the appliation containing some umlauts and passed it to the API: the problem appears if the application is deployed in the Google App Engine.
Is there any way to change the Character Encoding in the Google App Engine? Or does anyone know another solution to my problem?
Storing umlauts from the GUI in the GAE Datastore and bringing them back to the GUI works funnily enough.
I was having the same problem: the default charset of a web application deployed to Google App Engine was set to US-ASCII, but I needed it to be UTF-8.
After a bit of head scratching, I found that adding:
<system-properties>
<property name="appengine.file.encoding" value="UTF-8" />
</system-properties>
to appengine-web.xml correctly sets the charset to UTF-8. More details can be found on Google Issue Tracker - Setting of default encoding.
Workaround (safe)
I wrote this class to encode UTF-Strings to ASCII-Strings (replacing all chars which are not in the ASCII-table with their table-number, preceded and followed by a mark), using AsciiEncoder.encode(yourUtfString)
The String can then be decoded back to UTF with AsciiEncoder.decode(yourAsciiEncodedUtfString) where UTF is supported.
package <your_package>;
import java.util.ArrayList;
/**
* Created by Micha F. aka Peracutor.
* 04.06.2017
*/
public class AsciiEncoder {
public static final char MARK = '%'; //use whatever ASCII-char you like (should be occurring not often in regular text)
public static String encode(String s) {
StringBuilder result = new StringBuilder(s.length() + 4 * 10); //buffer for 10 special characters (4 additional chars for every special char that gets replaced)
for (char c : s.toCharArray()) {
if ((int) c > 127 || c == MARK) {
result.append(MARK).append((int) c).append(MARK);
} else {
result.append(c);
}
}
return result.toString();
}
public static String decode(String s) {
int lastMark = -1;
ArrayList<Character> chars = new ArrayList<>();
try {
//noinspection InfiniteLoopStatement
while (true) {
String charString = s.substring(lastMark = s.indexOf(MARK, lastMark + 1) + 1, lastMark = s.indexOf(MARK, lastMark));
char c = (char) Integer.parseInt(charString);
chars.add(c);
}
} catch (IndexOutOfBoundsException | NumberFormatException ignored) {}
for (char c : chars) {
s = s.replace("" + MARK + ((int) c) + MARK, String.valueOf(c));
}
return s;
}
}
Hope this helps someone.
If you (like myself) are using the Java flexible environment on Google AppEngine, the default encoding can "simply" be fixed by setting the file.encoding system property through your app.yaml (via an environment variable that is automatically picked up by the runtime) like this:
env_variables:
JAVA_USER_OPTS: -Dfile.encoding=UTF-8
I was unable to insert a chinese character to mysql. So I though of doing this. I have a excel sheet where I have chinese characters. Like 秀昭 and so on.
I got them converted to unicode representations like \uxxx using below code which I got from SO, and then I stored in MySQL.
private static String escapeNonAscii(String str) {
List<String> arr = new ArrayList<String>();
StringBuilder retStr = new StringBuilder();
for (int i = 0; i < str.length(); i++) {
int cp = Character.codePointAt(str, i);
System.out.println("cp="+cp);
int charCount = Character.charCount(cp);
if (charCount > 1) {
i += charCount - 1; // 2.
if (i >= str.length()) {
throw new IllegalArgumentException("truncated unexpectedly");
}
}
if (cp < 128) {
retStr.appendCodePoint(cp);
} else {
retStr.append(String.format("\\u%x", cp));
arr.add(String.format("\\\\u%x", cp));
}
}
return retStr.toString();
}
The values have been stored properly. So now I need to display them back. When I tried
System.out.println("\u8BF7\u5728\u6B64\u5904");
It gives me proper output like,
`请在此`
But when I read from DB and did like
System.out.println(rs.getString(1).trim().toString() + " from DB");
It printed
`\u8BF7\u5728\u6B64\u5904`
What might be the problem? Have I missed anything? please help.
Escaped characters will only be processed prior to compiling. To store and retrieve the data from a database, you only have to consider two things: Make sure the data you read had the correct encoding. And when printing the data the correct encoding is set.
If you read data on a windows machine, it is posible you have to use the cp* encodings. Just use a InputStreamReader and set the charset. Now you have the data in the JVM. The internal encoding is some utf-16. Now that you use a type 4 jdbc, you do not have to worry about encoding, except that your database needs a encoding capable to store the data. UTF-8 or Unicode will to the trick. Consult your jdbc documentation for properties to set. Sometimes you have set an encoding explicitly (jdbc:mysql://localhost:3306/?useUnicode=yes&characterEncoding=UTF-8).
When outputting the data, sometimes the output must have a specific encoding. Normally, your JVM runs with the default system char set but you need another one, for example when rendering a HTML file.
I addressed a strange behavior when I parsed a HTML page which contains a unicode/ASCII element. Here the example git://gist.github.com/2995626.git.
What performed is:
File layout = new File(html_file);
Document doc = Jsoup.parse(layout, "UTF-8");
System.out.println(doc.toString());
What I expected was the HTML triangle, but it is converted to "â–¼". Do you have any suggestions?
Thanks in advance.
Jsoup is perfectly capable of parsing HTML using UTF-8. Even more, it's its default character encoding already. Your problem is caused elsewhere. Based on the information provided so far, I can see two possible problem causes:
The HTML file was originally not saved using UTF-8 (or perhaps it's one step before; it's originally not been read using UTF-8).
The stdout (there where the System.out goes to) does not use UTF-8.
If you make sure that both are correctly set, then your problem should disappear. If not, then there's another possible cause which is not guessable based on the information provided so far in your question. At least, this blog should bring a lot of new insight: Unicode - How to get the characters right?
It is a problem caused by unicode. Here you can have an example following. You can try the code below .The result will show you the cause why the code you write not working.
public static void main(String[] argv) {
String test = "Ch\u00e0o bu\u1ed5i s\u00e1ng";
System.out.println(unicode2String(test));
}
/**
* unicode 转字符串
*/
public static String unicode2String(String unicode) {
StringBuffer string = new StringBuffer();
String[] hex = unicode.split("\\\\u");
string.append(hex[0]);
for (int i = 1; i < hex.length; i++) {
// 转换出每一个代码点
int data = Integer.parseInt(hex[i], 16);
// 追加成string
string.append((char) data);
}
return string.toString();
}
Maybe you code should be as follows:
System.out.println(unicode2String(doc.toString()));
Very similar to this question, except for Java.
What is the recommended way of encoding strings for an XML output in Java. The strings might contain characters like "&", "<", etc.
As others have mentioned, using an XML library is the easiest way. If you do want to escape yourself, you could look into StringEscapeUtils from the Apache Commons Lang library.
Very simply: use an XML library. That way it will actually be right instead of requiring detailed knowledge of bits of the XML spec.
Just use.
<![CDATA[ your text here ]]>
This will allow any characters except the ending
]]>
So you can include characters that would be illegal such as & and >. For example.
<element><![CDATA[ characters such as & and > are allowed ]]></element>
However, attributes will need to be escaped as CDATA blocks can not be used for them.
This question is eight years old and still not a fully correct answer! No, you should not have to import an entire third party API to do this simple task. Bad advice.
The following method will:
correctly handle characters outside the basic multilingual plane
escape characters required in XML
escape any non-ASCII characters, which is optional but common
replace illegal characters in XML 1.0 with the Unicode substitution character. There is no best option here - removing them is just as valid.
I've tried to optimise for the most common case, while still ensuring you could pipe /dev/random through this and get a valid string in XML.
public static String encodeXML(CharSequence s) {
StringBuilder sb = new StringBuilder();
int len = s.length();
for (int i=0;i<len;i++) {
int c = s.charAt(i);
if (c >= 0xd800 && c <= 0xdbff && i + 1 < len) {
c = ((c-0xd7c0)<<10) | (s.charAt(++i)&0x3ff); // UTF16 decode
}
if (c < 0x80) { // ASCII range: test most common case first
if (c < 0x20 && (c != '\t' && c != '\r' && c != '\n')) {
// Illegal XML character, even encoded. Skip or substitute
sb.append("�"); // Unicode replacement character
} else {
switch(c) {
case '&': sb.append("&"); break;
case '>': sb.append(">"); break;
case '<': sb.append("<"); break;
// Uncomment next two if encoding for an XML attribute
// case '\'' sb.append("'"); break;
// case '\"' sb.append("""); break;
// Uncomment next three if you prefer, but not required
// case '\n' sb.append("
"); break;
// case '\r' sb.append("
"); break;
// case '\t' sb.append(" "); break;
default: sb.append((char)c);
}
}
} else if ((c >= 0xd800 && c <= 0xdfff) || c == 0xfffe || c == 0xffff) {
// Illegal XML character, even encoded. Skip or substitute
sb.append("�"); // Unicode replacement character
} else {
sb.append("&#x");
sb.append(Integer.toHexString(c));
sb.append(';');
}
}
return sb.toString();
}
Edit: for those who continue to insist it foolish to write your own code for this when there are perfectly good Java APIs to deal with XML, you might like to know that the StAX API included with Oracle Java 8 (I haven't tested others) fails to encode CDATA content correctly: it doesn't escape ]]> sequences in the content. A third party library, even one that's part of the Java core, is not always the best option.
This has worked well for me to provide an escaped version of a text string:
public class XMLHelper {
/**
* Returns the string where all non-ascii and <, &, > are encoded as numeric entities. I.e. "<A & B >"
* .... (insert result here). The result is safe to include anywhere in a text field in an XML-string. If there was
* no characters to protect, the original string is returned.
*
* #param originalUnprotectedString
* original string which may contain characters either reserved in XML or with different representation
* in different encodings (like 8859-1 and UFT-8)
* #return
*/
public static String protectSpecialCharacters(String originalUnprotectedString) {
if (originalUnprotectedString == null) {
return null;
}
boolean anyCharactersProtected = false;
StringBuffer stringBuffer = new StringBuffer();
for (int i = 0; i < originalUnprotectedString.length(); i++) {
char ch = originalUnprotectedString.charAt(i);
boolean controlCharacter = ch < 32;
boolean unicodeButNotAscii = ch > 126;
boolean characterWithSpecialMeaningInXML = ch == '<' || ch == '&' || ch == '>';
if (characterWithSpecialMeaningInXML || unicodeButNotAscii || controlCharacter) {
stringBuffer.append("&#" + (int) ch + ";");
anyCharactersProtected = true;
} else {
stringBuffer.append(ch);
}
}
if (anyCharactersProtected == false) {
return originalUnprotectedString;
}
return stringBuffer.toString();
}
}
Try this:
String xmlEscapeText(String t) {
StringBuilder sb = new StringBuilder();
for(int i = 0; i < t.length(); i++){
char c = t.charAt(i);
switch(c){
case '<': sb.append("<"); break;
case '>': sb.append(">"); break;
case '\"': sb.append("""); break;
case '&': sb.append("&"); break;
case '\'': sb.append("'"); break;
default:
if(c>0x7e) {
sb.append("&#"+((int)c)+";");
}else
sb.append(c);
}
}
return sb.toString();
}
StringEscapeUtils.escapeXml() does not escape control characters (< 0x20). XML 1.1 allows control characters; XML 1.0 does not. For example, XStream.toXML() will happily serialize a Java object's control characters into XML, which an XML 1.0 parser will reject.
To escape control characters with Apache commons-lang, use
NumericEntityEscaper.below(0x20).translate(StringEscapeUtils.escapeXml(str))
public String escapeXml(String s) {
return s.replaceAll("&", "&").replaceAll(">", ">").replaceAll("<", "<").replaceAll("\"", """).replaceAll("'", "'");
}
For those looking for the quickest-to-write solution: use methods from apache commons-lang:
StringEscapeUtils.escapeXml10() for xml 1.0
StringEscapeUtils.escapeXml11() for xml 1.1
StringEscapeUtils.escapeXml() is now deprecated, but was used commonly in the past
Remember to include dependency:
<dependency>
<groupId>org.apache.commons</groupId>
<artifactId>commons-lang3</artifactId>
<version>3.5</version> <!--check current version! -->
</dependency>
While idealism says use an XML library, IMHO if you have a basic idea of XML then common sense and performance says template it all the way. It's arguably more readable too. Though using the escaping routines of a library is probably a good idea.
Consider this: XML was meant to be written by humans.
Use libraries for generating XML when having your XML as an "object" better models your problem. For example, if pluggable modules participate in the process of building this XML.
Edit: as for how to actually escape XML in templates, use of CDATA or escapeXml(string) from JSTL are two good solutions, escapeXml(string) can be used like this:
<%#taglib prefix="fn" uri="http://java.sun.com/jsp/jstl/functions"%>
<item>${fn:escapeXml(value)}</item>
The behavior of StringEscapeUtils.escapeXml() has changed from Commons Lang 2.5 to 3.0.
It now no longer escapes Unicode characters greater than 0x7f.
This is a good thing, the old method was to be a bit to eager to escape entities that could just be inserted into a utf8 document.
The new escapers to be included in Google Guava 11.0 also seem promising:
http://code.google.com/p/guava-libraries/issues/detail?id=799
While I agree with Jon Skeet in principle, sometimes I don't have the option to use an external XML library. And I find it peculiar the two functions to escape/unescape a simple value (attribute or tag, not full document) are not available in the standard XML libraries included with Java.
As a result and based on the different answers I have seen posted here and elsewhere, here is the solution I've ended up creating (nothing worked as a simple copy/paste):
public final static String ESCAPE_CHARS = "<>&\"\'";
public final static List<String> ESCAPE_STRINGS = Collections.unmodifiableList(Arrays.asList(new String[] {
"<"
, ">"
, "&"
, """
, "'"
}));
private static String UNICODE_NULL = "" + ((char)0x00); //null
private static String UNICODE_LOW = "" + ((char)0x20); //space
private static String UNICODE_HIGH = "" + ((char)0x7f);
//should only be used for the content of an attribute or tag
public static String toEscaped(String content) {
String result = content;
if ((content != null) && (content.length() > 0)) {
boolean modified = false;
StringBuilder stringBuilder = new StringBuilder(content.length());
for (int i = 0, count = content.length(); i < count; ++i) {
String character = content.substring(i, i + 1);
int pos = ESCAPE_CHARS.indexOf(character);
if (pos > -1) {
stringBuilder.append(ESCAPE_STRINGS.get(pos));
modified = true;
}
else {
if ( (character.compareTo(UNICODE_LOW) > -1)
&& (character.compareTo(UNICODE_HIGH) < 1)
) {
stringBuilder.append(character);
}
else {
//Per URL reference below, Unicode null character is always restricted from XML
//URL: https://en.wikipedia.org/wiki/Valid_characters_in_XML
if (character.compareTo(UNICODE_NULL) != 0) {
stringBuilder.append("&#" + ((int)character.charAt(0)) + ";");
}
modified = true;
}
}
}
if (modified) {
result = stringBuilder.toString();
}
}
return result;
}
The above accommodates several different things:
avoids using char based logic until it absolutely has to - improves unicode compatibility
attempts to be as efficient as possible given the probability is the second "if" condition is likely the most used pathway
is a pure function; i.e. is thread-safe
optimizes nicely with the garbage collector by only returning the contents of the StringBuilder if something actually changed - otherwise, the original string is returned
At some point, I will write the inversion of this function, toUnescaped(). I just don't have time to do that today. When I do, I will come update this answer with the code. :)
Note: Your question is about escaping, not encoding. Escaping is using <, etc. to allow the parser to distinguish between "this is an XML command" and "this is some text". Encoding is the stuff you specify in the XML header (UTF-8, ISO-8859-1, etc).
First of all, like everyone else said, use an XML library. XML looks simple but the encoding+escaping stuff is dark voodoo (which you'll notice as soon as you encounter umlauts and Japanese and other weird stuff like "full width digits" (&#FF11; is 1)). Keeping XML human readable is a Sisyphus' task.
I suggest never to try to be clever about text encoding and escaping in XML. But don't let that stop you from trying; just remember when it bites you (and it will).
That said, if you use only UTF-8, to make things more readable you can consider this strategy:
If the text does contain '<', '>' or '&', wrap it in <![CDATA[ ... ]]>
If the text doesn't contain these three characters, don't warp it.
I'm using this in an SQL editor and it allows the developers to cut&paste SQL from a third party SQL tool into the XML without worrying about escaping. This works because the SQL can't contain umlauts in our case, so I'm safe.
If you are looking for a library to get the job done, try:
Guava 26.0 documented here
return XmlEscapers.xmlContentEscaper().escape(text);
Note: There is also an xmlAttributeEscaper()
Apache Commons Text 1.4 documented here
StringEscapeUtils.escapeXml11(text)
Note: There is also an escapeXml10() method
To escape XML characters, the easiest way is to use the Apache Commons Lang project, JAR downloadable from: http://commons.apache.org/lang/
The class is this: org.apache.commons.lang3.StringEscapeUtils;
It has a method named "escapeXml", that will return an appropriately escaped String.
You could use the Enterprise Security API (ESAPI) library, which provides methods like encodeForXML and encodeForXMLAttribute. Take a look at the documentation of the Encoder interface; it also contains examples of how to create an instance of DefaultEncoder.
Use JAXP and forget about text handling it will be done for you automatically.
Here's an easy solution and it's great for encoding accented characters too!
String in = "Hi Lârry & Môe!";
StringBuilder out = new StringBuilder();
for(int i = 0; i < in.length(); i++) {
char c = in.charAt(i);
if(c < 31 || c > 126 || "<>\"'\\&".indexOf(c) >= 0) {
out.append("&#" + (int) c + ";");
} else {
out.append(c);
}
}
System.out.printf("%s%n", out);
Outputs
Hi Lârry & Môe!
Try to encode the XML using Apache XML serializer
//Serialize DOM
OutputFormat format = new OutputFormat (doc);
// as a String
StringWriter stringOut = new StringWriter ();
XMLSerializer serial = new XMLSerializer (stringOut,
format);
serial.serialize(doc);
// Display the XML
System.out.println(stringOut.toString());
Just replace
& with &
And for other characters:
> with >
< with <
\" with "
' with '
Here's what I found after searching everywhere looking for a solution:
Get the Jsoup library:
<!-- https://mvnrepository.com/artifact/org.jsoup/jsoup -->
<dependency>
<groupId>org.jsoup</groupId>
<artifactId>jsoup</artifactId>
<version>1.12.1</version>
</dependency>
Then:
import org.jsoup.Jsoup
import org.jsoup.nodes.Document
import org.jsoup.nodes.Entities
import org.jsoup.parser.Parser
String xml = '''<?xml version = "1.0"?>
<SOAP-ENV:Envelope
xmlns:SOAP-ENV = "http://www.w3.org/2001/12/soap-envelope"
SOAP-ENV:encodingStyle = "http://www.w3.org/2001/12/soap-encoding">
<SOAP-ENV:Body xmlns:m = "http://www.example.org/quotations">
<m:GetQuotation>
<m:QuotationsName> MiscroSoft#G>>gle.com </m:QuotationsName>
</m:GetQuotation>
</SOAP-ENV:Body>
</SOAP-ENV:Envelope>'''
Document doc = Jsoup.parse(new ByteArrayInputStream(xml.getBytes("UTF-8")), "UTF-8", "", Parser.xmlParser())
doc.outputSettings().charset("UTF-8")
doc.outputSettings().escapeMode(Entities.EscapeMode.base)
println doc.toString()
Hope this helps someone
I have created my wrapper here, hope it will helps a lot, Click here You can modify depends on your requirements