How should I escape strings in JSON? - java

When creating JSON data manually, how should I escape string fields? Should I use something like Apache Commons Lang's StringEscapeUtilities.escapeHtml, StringEscapeUtilities.escapeXml, or should I use java.net.URLEncoder?
The problem is that when I use SEU.escapeHtml, it doesn't escape quotes and when I wrap the whole string in a pair of 's, a malformed JSON will be generated.

Ideally, find a JSON library in your language that you can feed some appropriate data structure to, and let it worry about how to escape things. It'll keep you much saner. If for whatever reason you don't have a library in your language, you don't want to use one (I wouldn't suggest this¹), or you're writing a JSON library, read on.
Escape it according to the RFC. JSON is pretty liberal: The only characters you must escape are \, ", and control codes (anything less than U+0020).
This structure of escaping is specific to JSON. You'll need a JSON specific function. All of the escapes can be written as \uXXXX where XXXX is the UTF-16 code unit¹ for that character. There are a few shortcuts, such as \\, which work as well. (And they result in a smaller and clearer output.)
For full details, see the RFC.
¹JSON's escaping is built on JS, so it uses \uXXXX, where XXXX is a UTF-16 code unit. For code points outside the BMP, this means encoding surrogate pairs, which can get a bit hairy. (Or, you can just output the character directly, since JSON's encoded for is Unicode text, and allows these particular characters.)

Extract From Jettison:
public static String quote(String string) {
if (string == null || string.length() == 0) {
return "\"\"";
}
char c = 0;
int i;
int len = string.length();
StringBuilder sb = new StringBuilder(len + 4);
String t;
sb.append('"');
for (i = 0; i < len; i += 1) {
c = string.charAt(i);
switch (c) {
case '\\':
case '"':
sb.append('\\');
sb.append(c);
break;
case '/':
// if (b == '<') {
sb.append('\\');
// }
sb.append(c);
break;
case '\b':
sb.append("\\b");
break;
case '\t':
sb.append("\\t");
break;
case '\n':
sb.append("\\n");
break;
case '\f':
sb.append("\\f");
break;
case '\r':
sb.append("\\r");
break;
default:
if (c < ' ') {
t = "000" + Integer.toHexString(c);
sb.append("\\u" + t.substring(t.length() - 4));
} else {
sb.append(c);
}
}
}
sb.append('"');
return sb.toString();
}

Try this org.codehaus.jettison.json.JSONObject.quote("your string").
Download it here: http://mvnrepository.com/artifact/org.codehaus.jettison/jettison

org.json.simple.JSONObject.escape() escapes quotes,\, /, \r, \n, \b, \f, \t and other control characters. It can be used to escape JavaScript codes.
import org.json.simple.JSONObject;
String test = JSONObject.escape("your string");

There is now a StringEscapeUtils#escapeJson(String) method in the Apache Commons Text library.
The methods of interest are as follows:
StringEscapeUtils#escapeJson(String)
StringEscapeUtils#unescapeJson(String)
This functionality was initially released as part of Apache Commons Lang version 3.2 but has since been deprecated and moved to Apache Commons Text. So if the method is marked as deprecated in your IDE, you're importing the implementation from the wrong library (both libraries use the same class name: StringEscapeUtils).
The implementation isn't pure Json. As per the Javadoc:
Escapes the characters in a String using Json String rules.
Escapes any values it finds into their Json String form. Deals
correctly with quotes and control-chars (tab, backslash, cr, ff, etc.)
So a tab becomes the characters '\' and 't'.
The only difference between Java strings and Json strings is that in
Json, forward-slash (/) is escaped.
See http://www.ietf.org/rfc/rfc4627.txt for further details.

org.json.JSONObject quote(String data) method does the job
import org.json.JSONObject;
String jsonEncodedString = JSONObject.quote(data);
Extract from the documentation:
Encodes data as a JSON string. This applies quotes and any necessary character escaping. [...] Null will be interpreted as an empty string

StringEscapeUtils.escapeJavaScript / StringEscapeUtils.escapeEcmaScript should do the trick too.

If you are using fastexml jackson, you can use the following:
com.fasterxml.jackson.core.io.JsonStringEncoder.getInstance().quoteAsString(input)
If you are using codehaus jackson, you can use the following:
org.codehaus.jackson.io.JsonStringEncoder.getInstance().quoteAsString(input)

Not sure what you mean by "creating json manually", but you can use something like gson (http://code.google.com/p/google-gson/), and that would transform your HashMap, Array, String, etc, to a JSON value. I recommend going with a framework for this.

I have not spent the time to make 100% certain, but it worked for my inputs enough to be accepted by online JSON validators:
org.apache.velocity.tools.generic.EscapeTool.EscapeTool().java("input")
although it does not look any better than org.codehaus.jettison.json.JSONObject.quote("your string")
I simply use velocity tools in my project already - my "manual JSON" building was within a velocity template

For those who came here looking for a command-line solution, like me, cURL's --data-urlencode works fine:
curl -G -v -s --data-urlencode 'query={"type" : "/music/artist"}' 'https://www.googleapis.com/freebase/v1/mqlread'
sends
GET /freebase/v1/mqlread?query=%7B%22type%22%20%3A%20%22%2Fmusic%2Fartist%22%7D HTTP/1.1
, for example. Larger JSON data can be put in a file and you'd use the # syntax to specify a file to slurp in the to-be-escaped data from. For example, if
$ cat 1.json 
{
  "type": "/music/artist",
  "name": "The Police",
  "album": []
}
you'd use
curl -G -v -s --data-urlencode query#1.json 'https://www.googleapis.com/freebase/v1/mqlread'
And now, this is also a tutorial on how to query Freebase from the command line :-)

Use EscapeUtils class in commons lang API.
EscapeUtils.escapeJavaScript("Your JSON string");

Consider Moshi's JsonWriter class. It has a wonderful API and it reduces copying to a minimum, everything can be nicely streamed to a filed, OutputStream, etc.
OutputStream os = ...;
JsonWriter json = new JsonWriter(Okio.buffer(Okio.sink(os)));
json.beginObject();
json.name("id").value(getId());
json.name("scores");
json.beginArray();
for (Double score : getScores()) {
json.value(score);
}
json.endArray();
json.endObject();
If you want the string in hand:
Buffer b = new Buffer(); // okio.Buffer
JsonWriter writer = new JsonWriter(b);
//...
String jsonString = b.readUtf8();

If you need to escape JSON inside JSON string, use org.json.JSONObject.quote("your json string that needs to be escaped") seem to work well

Apache commons-text now has a
StringEscapeUtils.escapeJson(String).

using the \uXXXX syntax can solve this problem, google UTF-16 with the name of the sign, you can find out XXXX, for example:utf-16 double quote

The methods here that show the actual implementation are all faulty.
I don't have Java code, but just for the record, you could easily convert this C#-code:
Courtesy of the mono-project #
https://github.com/mono/mono/blob/master/mcs/class/System.Web/System.Web/HttpUtility.cs
public static string JavaScriptStringEncode(string value, bool addDoubleQuotes)
{
if (string.IsNullOrEmpty(value))
return addDoubleQuotes ? "\"\"" : string.Empty;
int len = value.Length;
bool needEncode = false;
char c;
for (int i = 0; i < len; i++)
{
c = value[i];
if (c >= 0 && c <= 31 || c == 34 || c == 39 || c == 60 || c == 62 || c == 92)
{
needEncode = true;
break;
}
}
if (!needEncode)
return addDoubleQuotes ? "\"" + value + "\"" : value;
var sb = new System.Text.StringBuilder();
if (addDoubleQuotes)
sb.Append('"');
for (int i = 0; i < len; i++)
{
c = value[i];
if (c >= 0 && c <= 7 || c == 11 || c >= 14 && c <= 31 || c == 39 || c == 60 || c == 62)
sb.AppendFormat("\\u{0:x4}", (int)c);
else switch ((int)c)
{
case 8:
sb.Append("\\b");
break;
case 9:
sb.Append("\\t");
break;
case 10:
sb.Append("\\n");
break;
case 12:
sb.Append("\\f");
break;
case 13:
sb.Append("\\r");
break;
case 34:
sb.Append("\\\"");
break;
case 92:
sb.Append("\\\\");
break;
default:
sb.Append(c);
break;
}
}
if (addDoubleQuotes)
sb.Append('"');
return sb.ToString();
}
This can be compacted into
// https://github.com/mono/mono/blob/master/mcs/class/System.Json/System.Json/JsonValue.cs
public class SimpleJSON
{
private static bool NeedEscape(string src, int i)
{
char c = src[i];
return c < 32 || c == '"' || c == '\\'
// Broken lead surrogate
|| (c >= '\uD800' && c <= '\uDBFF' &&
(i == src.Length - 1 || src[i + 1] < '\uDC00' || src[i + 1] > '\uDFFF'))
// Broken tail surrogate
|| (c >= '\uDC00' && c <= '\uDFFF' &&
(i == 0 || src[i - 1] < '\uD800' || src[i - 1] > '\uDBFF'))
// To produce valid JavaScript
|| c == '\u2028' || c == '\u2029'
// Escape "</" for <script> tags
|| (c == '/' && i > 0 && src[i - 1] == '<');
}
public static string EscapeString(string src)
{
System.Text.StringBuilder sb = new System.Text.StringBuilder();
int start = 0;
for (int i = 0; i < src.Length; i++)
if (NeedEscape(src, i))
{
sb.Append(src, start, i - start);
switch (src[i])
{
case '\b': sb.Append("\\b"); break;
case '\f': sb.Append("\\f"); break;
case '\n': sb.Append("\\n"); break;
case '\r': sb.Append("\\r"); break;
case '\t': sb.Append("\\t"); break;
case '\"': sb.Append("\\\""); break;
case '\\': sb.Append("\\\\"); break;
case '/': sb.Append("\\/"); break;
default:
sb.Append("\\u");
sb.Append(((int)src[i]).ToString("x04"));
break;
}
start = i + 1;
}
sb.Append(src, start, src.Length - start);
return sb.ToString();
}
}

I think the best answer in 2017 is to use the javax.json APIs. Use javax.json.JsonBuilderFactory to create your json objects, then write the objects out using javax.json.JsonWriterFactory. Very nice builder/writer combination.

Related

parser with integer literals

I am looking for an easy and efficient way to implement a set of numbers in a lexical parser in java. For example my input code is as follows :
"6+9" ,
the output would have to be a little like this :
Number : 6
Sign : +
Number: 9
The issue I have is i have no way to recognize the number other than to implement it as follows :
static char INTVALUE = ('0') ;
which means I would have to manually enter each number from 0 to 9 and I don't know If such a method would even allows to have a number such as 85 in my input .
This is for a homework assignment by the way
Thanks .
For the simplest grammars you can indeed use regular expressions:
import java.util.regex.*;
// ...
String expression = "(10+9)*2";
Pattern pattern = Pattern.compile("\\s*(\\d+|\\D)\\s*");
Matcher matcher = pattern.matcher(expression);
while (matcher.find()) {
String token = matcher.group(1);
System.out.printf("%s: '%s'%n",
token.matches("\\d+") ? "Number" : "Symbol",
token);
}
In a compiler construction course you will probably be expected to construct an NFA and then transform that into a minimal DFA by implementing an algorithm like this one. In real life you would normally use a tool like ANTLR or JLex.
Why dont use regular expressions for this. It sounds a best fit for what you are attempting to do.
Its fairly simple to learn. Look at Character classes (\d) and Quatifiers(+ ?) in this cheatsheet
To check for integers and doubles us the following.
aStr.matches("-?\\d+(\\.\\d+)?");
For just integers:
aStr.matches("-?\\d+");
You can also do something simple like this:
public List<Token> lex(String s) {
List<Token> result = new ArrayList<Token>();
int pos = 0;
int len = s.length();
while (pos < len) {
switch (s.charAt(pos)) {
case '0':
case '1':
case '2':
case '3':
case '4':
case '5':
case '6':
case '7':
case '8':
case '9':
{
int end = pos;
do {
++end;
} while (end < len && s.charAt(end) >= '0' && s.charAt(end) <= '9');
result.add(new Number(s.substring(pos, end)));
pos = end;
break;
}
case '+':
{
result.add(new Operator("+"));
++pos;
break;
}
// ...
}
}
return result;
}

Convert a html string into a java "out.println"-statement

How can i convert a html string into a java "out.println" statement (with java)?
e.g.
<h1>Hello world</h1><p style="background-color:red">hello</p>
into
out.println("<h1>Hello world</h1>");
out.println("<p style=\"background-color:red\">hello</p>");
If you are trying to ease editing, there is an option for this in Eclipse. Look in: Window -> Preferences -> Java -> Editor -> Typing -> In string literals -> Escape text when pasting into a string literal.
If you are trying to do this programmatically, this will suffice:
public static String escapeForQuotes(String s) {
return escapeForQuotes(s, '\uFFFF');
}
public static String escapeForQuotes(String s, char ignore) {
int len = s.length();
StringBuilder sb = new StringBuilder(len * 6 / 5);
for (int i = 0; i < len; i++) {
char c = s.charAt(i);
if (c == ignore) { sb.append(c); continue; }
switch (c) {
case '\\': case '\"': case '\'': break;
case '\n': c = 'n'; break;
case '\r': c = 'r'; break;
case '\0': c = '0'; break;
default: sb.append(c); continue;
}
sb.append('\\').append(c);
}
return sb.toString();
}
The function returns its input with backslashes inserted before quotes, backslashes, line breaks and nulls. The optional 'ignore' parameter allows you to specify one character that need not be escaped. E.g., ' could be but need not be escaped in a "-quoted string, and vice-versa.
E.g.,
System.out.println("System.out.println(\"" + escapeForQuotes(html, '\'') + "\");");
Try below mentioned links
JSoup
Html Parser Tutorial
Regex replace. I take it you want to do the replace in your IDE. (Doing it in Java would need backslashes escaped, typically like "\w" becoming "\\w".)
Replace
<(\w+)[^>]*>([.\r\n]*)</\1>
by
out.println("$2");\r\n
The \1 matching the first group (the tag name).

converting a string of declared variables (e.g. "x+y") to a double as a mathematical function

What I'm trying to do is read a line (string) and use it as a mathematical function to get (double) values or answers to it at different points (like a calculator basically)
I included a very simplistic code of what I'm trying to do just for the sake of being direct and straight forward:
double x, y, z;
String function;
x = 5;
y = 4;
function = "(x*y)+y";
z = Double.parseDouble(function);
/*
I want z to equal this
z = (x*y)+y;
*/
System.out.print("z= " + z);
Again, this is only a sample code to be clearer about my question. My question again is: how can I set z = function when z is a double and function is a string?
NOTE: I tried parse as you can see, but it didn't work. I also tried to read the string character by character, but it didn't work either because it added the value of the characters together.
I guess you are looking for a lexer and a parser.
These are basical components of every compiler or interpreter as
the lexer is able to split input (your string) into tokens
the parser is able to build a tree which represent the syntatic shape of your tokens to be furtherly interpretated semantically
This discipline is quite wide and I suggest you to start with something like ANTLR for Java, it is a parser generator that will generate both lexer and parser according to rules you specify through a grammar. There are many, this is just the first that came into my mind.
If you want to forget about all this theory just embed something like JavaScript or Groovy in your Java program, they are able to interpret code that is given at runtime so that you can just go that way.
Java does not have something like eval builtin. But you can use an expression language like spEL, mvel or Jexl for this.
Maybe this SO question can help you.
I suggest you have a look at Parboiled. Unlike nearly all other parser solutions for Java, you write your grammars... In Java.
What is more, among the Java examples, there are working calculators.
float eval(String exp)
{
char[] a = exp.toCharArray();
float[] buffer = new int[exp.length];
int k = 0;
for(int i : a)
{
if(a[i] >= 48 && a[i] <= 57) //checking for numbers
{
int x = a[i] - '0';
buffer[k++] = x;
}
else if(a[i] == '+' || a[i] == '-' || a[i] == '*' || a[i] == '/') //checking for operands
{
float result;
switch(a[i])
{
case '+': result = buffer[k] + buffer[k-1]; break;
case '-': result = buffer[k] - buffer[k-1]; break;
case '*': result = buffer[k] * buffer[k-1]; break;
case '/': result = buffer[k] / buffer[k-1]; break;
}
}
buffer[k++] = result;
}
return buffer[k]; //finally returning the recent value
}
Use a method like this. Will help a lot. Implemented using a stack data structure.

Java String Special character replacement

I have string which contains alpahanumeric and special character.
I need to replace each and every special char with some string.
For eg,
Input string = "ja*va st&ri%n#&"
Expected o/p = "jaasteriskvaspacestandripercentagenatand"
= "asterisk"
& = "and"
% = "percentage"
# = "at"
thanks,
Unless you're absolutely desperate for performance, I'd use a very simple approach:
String result = input.replace("*", "asterisk")
.replace("%", "percentage")
.replace("#", "at"); // Add more to taste :)
(Note that there's a big difference between replace and replaceAll - the latter takes a regular expression. It's easy to get the wrong one and see radically different effects!)
An alternative would be something like:
public static String replaceSpecial(String input)
{
// Output will be at least as long as input
StringBuilder builder = new StringBuilder(input.length());
for (int i = 0; i < input.length(); i++)
{
char c = input.charAt(i);
switch (c)
{
case '*': builder.append("asterisk"); break;
case '%': builder.append("percentage"); break;
case '#': builder.append("at"); break;
default: builder.append(c); break;
}
}
return builder.toString();
Take a look at the following java.lang.String methods:
replace()
replaceAll()

Best way to encode text data for XML in Java?

Very similar to this question, except for Java.
What is the recommended way of encoding strings for an XML output in Java. The strings might contain characters like "&", "<", etc.
As others have mentioned, using an XML library is the easiest way. If you do want to escape yourself, you could look into StringEscapeUtils from the Apache Commons Lang library.
Very simply: use an XML library. That way it will actually be right instead of requiring detailed knowledge of bits of the XML spec.
Just use.
<![CDATA[ your text here ]]>
This will allow any characters except the ending
]]>
So you can include characters that would be illegal such as & and >. For example.
<element><![CDATA[ characters such as & and > are allowed ]]></element>
However, attributes will need to be escaped as CDATA blocks can not be used for them.
This question is eight years old and still not a fully correct answer! No, you should not have to import an entire third party API to do this simple task. Bad advice.
The following method will:
correctly handle characters outside the basic multilingual plane
escape characters required in XML
escape any non-ASCII characters, which is optional but common
replace illegal characters in XML 1.0 with the Unicode substitution character. There is no best option here - removing them is just as valid.
I've tried to optimise for the most common case, while still ensuring you could pipe /dev/random through this and get a valid string in XML.
public static String encodeXML(CharSequence s) {
StringBuilder sb = new StringBuilder();
int len = s.length();
for (int i=0;i<len;i++) {
int c = s.charAt(i);
if (c >= 0xd800 && c <= 0xdbff && i + 1 < len) {
c = ((c-0xd7c0)<<10) | (s.charAt(++i)&0x3ff); // UTF16 decode
}
if (c < 0x80) { // ASCII range: test most common case first
if (c < 0x20 && (c != '\t' && c != '\r' && c != '\n')) {
// Illegal XML character, even encoded. Skip or substitute
sb.append("�"); // Unicode replacement character
} else {
switch(c) {
case '&': sb.append("&"); break;
case '>': sb.append(">"); break;
case '<': sb.append("<"); break;
// Uncomment next two if encoding for an XML attribute
// case '\'' sb.append("&apos;"); break;
// case '\"' sb.append("""); break;
// Uncomment next three if you prefer, but not required
// case '\n' sb.append("
"); break;
// case '\r' sb.append("
"); break;
// case '\t' sb.append(" "); break;
default: sb.append((char)c);
}
}
} else if ((c >= 0xd800 && c <= 0xdfff) || c == 0xfffe || c == 0xffff) {
// Illegal XML character, even encoded. Skip or substitute
sb.append("�"); // Unicode replacement character
} else {
sb.append("&#x");
sb.append(Integer.toHexString(c));
sb.append(';');
}
}
return sb.toString();
}
Edit: for those who continue to insist it foolish to write your own code for this when there are perfectly good Java APIs to deal with XML, you might like to know that the StAX API included with Oracle Java 8 (I haven't tested others) fails to encode CDATA content correctly: it doesn't escape ]]> sequences in the content. A third party library, even one that's part of the Java core, is not always the best option.
This has worked well for me to provide an escaped version of a text string:
public class XMLHelper {
/**
* Returns the string where all non-ascii and <, &, > are encoded as numeric entities. I.e. "<A & B >"
* .... (insert result here). The result is safe to include anywhere in a text field in an XML-string. If there was
* no characters to protect, the original string is returned.
*
* #param originalUnprotectedString
* original string which may contain characters either reserved in XML or with different representation
* in different encodings (like 8859-1 and UFT-8)
* #return
*/
public static String protectSpecialCharacters(String originalUnprotectedString) {
if (originalUnprotectedString == null) {
return null;
}
boolean anyCharactersProtected = false;
StringBuffer stringBuffer = new StringBuffer();
for (int i = 0; i < originalUnprotectedString.length(); i++) {
char ch = originalUnprotectedString.charAt(i);
boolean controlCharacter = ch < 32;
boolean unicodeButNotAscii = ch > 126;
boolean characterWithSpecialMeaningInXML = ch == '<' || ch == '&' || ch == '>';
if (characterWithSpecialMeaningInXML || unicodeButNotAscii || controlCharacter) {
stringBuffer.append("&#" + (int) ch + ";");
anyCharactersProtected = true;
} else {
stringBuffer.append(ch);
}
}
if (anyCharactersProtected == false) {
return originalUnprotectedString;
}
return stringBuffer.toString();
}
}
Try this:
String xmlEscapeText(String t) {
StringBuilder sb = new StringBuilder();
for(int i = 0; i < t.length(); i++){
char c = t.charAt(i);
switch(c){
case '<': sb.append("<"); break;
case '>': sb.append(">"); break;
case '\"': sb.append("""); break;
case '&': sb.append("&"); break;
case '\'': sb.append("&apos;"); break;
default:
if(c>0x7e) {
sb.append("&#"+((int)c)+";");
}else
sb.append(c);
}
}
return sb.toString();
}
StringEscapeUtils.escapeXml() does not escape control characters (< 0x20). XML 1.1 allows control characters; XML 1.0 does not. For example, XStream.toXML() will happily serialize a Java object's control characters into XML, which an XML 1.0 parser will reject.
To escape control characters with Apache commons-lang, use
NumericEntityEscaper.below(0x20).translate(StringEscapeUtils.escapeXml(str))
public String escapeXml(String s) {
return s.replaceAll("&", "&").replaceAll(">", ">").replaceAll("<", "<").replaceAll("\"", """).replaceAll("'", "&apos;");
}
For those looking for the quickest-to-write solution: use methods from apache commons-lang:
StringEscapeUtils.escapeXml10() for xml 1.0
StringEscapeUtils.escapeXml11() for xml 1.1
StringEscapeUtils.escapeXml() is now deprecated, but was used commonly in the past
Remember to include dependency:
<dependency>
<groupId>org.apache.commons</groupId>
<artifactId>commons-lang3</artifactId>
<version>3.5</version> <!--check current version! -->
</dependency>
While idealism says use an XML library, IMHO if you have a basic idea of XML then common sense and performance says template it all the way. It's arguably more readable too. Though using the escaping routines of a library is probably a good idea.
Consider this: XML was meant to be written by humans.
Use libraries for generating XML when having your XML as an "object" better models your problem. For example, if pluggable modules participate in the process of building this XML.
Edit: as for how to actually escape XML in templates, use of CDATA or escapeXml(string) from JSTL are two good solutions, escapeXml(string) can be used like this:
<%#taglib prefix="fn" uri="http://java.sun.com/jsp/jstl/functions"%>
<item>${fn:escapeXml(value)}</item>
The behavior of StringEscapeUtils.escapeXml() has changed from Commons Lang 2.5 to 3.0.
It now no longer escapes Unicode characters greater than 0x7f.
This is a good thing, the old method was to be a bit to eager to escape entities that could just be inserted into a utf8 document.
The new escapers to be included in Google Guava 11.0 also seem promising:
http://code.google.com/p/guava-libraries/issues/detail?id=799
While I agree with Jon Skeet in principle, sometimes I don't have the option to use an external XML library. And I find it peculiar the two functions to escape/unescape a simple value (attribute or tag, not full document) are not available in the standard XML libraries included with Java.
As a result and based on the different answers I have seen posted here and elsewhere, here is the solution I've ended up creating (nothing worked as a simple copy/paste):
public final static String ESCAPE_CHARS = "<>&\"\'";
public final static List<String> ESCAPE_STRINGS = Collections.unmodifiableList(Arrays.asList(new String[] {
"<"
, ">"
, "&"
, """
, "&apos;"
}));
private static String UNICODE_NULL = "" + ((char)0x00); //null
private static String UNICODE_LOW = "" + ((char)0x20); //space
private static String UNICODE_HIGH = "" + ((char)0x7f);
//should only be used for the content of an attribute or tag
public static String toEscaped(String content) {
String result = content;
if ((content != null) && (content.length() > 0)) {
boolean modified = false;
StringBuilder stringBuilder = new StringBuilder(content.length());
for (int i = 0, count = content.length(); i < count; ++i) {
String character = content.substring(i, i + 1);
int pos = ESCAPE_CHARS.indexOf(character);
if (pos > -1) {
stringBuilder.append(ESCAPE_STRINGS.get(pos));
modified = true;
}
else {
if ( (character.compareTo(UNICODE_LOW) > -1)
&& (character.compareTo(UNICODE_HIGH) < 1)
) {
stringBuilder.append(character);
}
else {
//Per URL reference below, Unicode null character is always restricted from XML
//URL: https://en.wikipedia.org/wiki/Valid_characters_in_XML
if (character.compareTo(UNICODE_NULL) != 0) {
stringBuilder.append("&#" + ((int)character.charAt(0)) + ";");
}
modified = true;
}
}
}
if (modified) {
result = stringBuilder.toString();
}
}
return result;
}
The above accommodates several different things:
avoids using char based logic until it absolutely has to - improves unicode compatibility
attempts to be as efficient as possible given the probability is the second "if" condition is likely the most used pathway
is a pure function; i.e. is thread-safe
optimizes nicely with the garbage collector by only returning the contents of the StringBuilder if something actually changed - otherwise, the original string is returned
At some point, I will write the inversion of this function, toUnescaped(). I just don't have time to do that today. When I do, I will come update this answer with the code. :)
Note: Your question is about escaping, not encoding. Escaping is using <, etc. to allow the parser to distinguish between "this is an XML command" and "this is some text". Encoding is the stuff you specify in the XML header (UTF-8, ISO-8859-1, etc).
First of all, like everyone else said, use an XML library. XML looks simple but the encoding+escaping stuff is dark voodoo (which you'll notice as soon as you encounter umlauts and Japanese and other weird stuff like "full width digits" (&#FF11; is 1)). Keeping XML human readable is a Sisyphus' task.
I suggest never to try to be clever about text encoding and escaping in XML. But don't let that stop you from trying; just remember when it bites you (and it will).
That said, if you use only UTF-8, to make things more readable you can consider this strategy:
If the text does contain '<', '>' or '&', wrap it in <![CDATA[ ... ]]>
If the text doesn't contain these three characters, don't warp it.
I'm using this in an SQL editor and it allows the developers to cut&paste SQL from a third party SQL tool into the XML without worrying about escaping. This works because the SQL can't contain umlauts in our case, so I'm safe.
If you are looking for a library to get the job done, try:
Guava 26.0 documented here
return XmlEscapers.xmlContentEscaper().escape(text);
Note: There is also an xmlAttributeEscaper()
Apache Commons Text 1.4 documented here
StringEscapeUtils.escapeXml11(text)
Note: There is also an escapeXml10() method
To escape XML characters, the easiest way is to use the Apache Commons Lang project, JAR downloadable from: http://commons.apache.org/lang/
The class is this: org.apache.commons.lang3.StringEscapeUtils;
It has a method named "escapeXml", that will return an appropriately escaped String.
You could use the Enterprise Security API (ESAPI) library, which provides methods like encodeForXML and encodeForXMLAttribute. Take a look at the documentation of the Encoder interface; it also contains examples of how to create an instance of DefaultEncoder.
Use JAXP and forget about text handling it will be done for you automatically.
Here's an easy solution and it's great for encoding accented characters too!
String in = "Hi Lârry & Môe!";
StringBuilder out = new StringBuilder();
for(int i = 0; i < in.length(); i++) {
char c = in.charAt(i);
if(c < 31 || c > 126 || "<>\"'\\&".indexOf(c) >= 0) {
out.append("&#" + (int) c + ";");
} else {
out.append(c);
}
}
System.out.printf("%s%n", out);
Outputs
Hi Lârry & Môe!
Try to encode the XML using Apache XML serializer
//Serialize DOM
OutputFormat format = new OutputFormat (doc);
// as a String
StringWriter stringOut = new StringWriter ();
XMLSerializer serial = new XMLSerializer (stringOut,
format);
serial.serialize(doc);
// Display the XML
System.out.println(stringOut.toString());
Just replace
& with &
And for other characters:
> with >
< with <
\" with "
' with &apos;
Here's what I found after searching everywhere looking for a solution:
Get the Jsoup library:
<!-- https://mvnrepository.com/artifact/org.jsoup/jsoup -->
<dependency>
<groupId>org.jsoup</groupId>
<artifactId>jsoup</artifactId>
<version>1.12.1</version>
</dependency>
Then:
import org.jsoup.Jsoup
import org.jsoup.nodes.Document
import org.jsoup.nodes.Entities
import org.jsoup.parser.Parser
String xml = '''<?xml version = "1.0"?>
<SOAP-ENV:Envelope
xmlns:SOAP-ENV = "http://www.w3.org/2001/12/soap-envelope"
SOAP-ENV:encodingStyle = "http://www.w3.org/2001/12/soap-encoding">
<SOAP-ENV:Body xmlns:m = "http://www.example.org/quotations">
<m:GetQuotation>
<m:QuotationsName> MiscroSoft#G>>gle.com </m:QuotationsName>
</m:GetQuotation>
</SOAP-ENV:Body>
</SOAP-ENV:Envelope>'''
Document doc = Jsoup.parse(new ByteArrayInputStream(xml.getBytes("UTF-8")), "UTF-8", "", Parser.xmlParser())
doc.outputSettings().charset("UTF-8")
doc.outputSettings().escapeMode(Entities.EscapeMode.base)
println doc.toString()
Hope this helps someone
I have created my wrapper here, hope it will helps a lot, Click here You can modify depends on your requirements

Categories