I have a Java string
String t = "Region S\u00FCdost SER";
where \u00FC is a replacement for the unicode character "ü"
If i add a new escape char to the above string, i would still want my below function to escape other chars excluding the current .
For example, the below function on re running would return the result as "Region S\\u00FCdost SER" and "Region S\\\\u00FCdost SER" on subsequent iterations.
How do we prevent this?
public static String escapeString(String str)
{
StringBuffer result = new StringBuffer();
// char is 16 bits long and can hold an UTF-16 code
// i iterate on chars and not on code points
// i guess this will be enough until we need to support surrogate pairs
for (int i = 0; i < str.length(); i++)
{
char c = str.charAt(i);
switch (c) {
case '"':
result.append("\\\""); //$NON-NLS-1$
break;
case '\b':
result.append("\\b"); //$NON-NLS-1$
break;
case '\t':
result.append("\\t"); //$NON-NLS-1$
break;
case '\n':
result.append("\\n"); //$NON-NLS-1$
break;
case '\f':
result.append("\\f"); //$NON-NLS-1$
break;
case '\r':
result.append("\\r"); //$NON-NLS-1$
break;
case '\'':
result.append("\\'"); //$NON-NLS-1$
break;
case '\\':
result.append("\\\\"); //$NON-NLS-1$
break;
default:
if (c < 128)
{
//is ascii
result.append(c);
}
else
{
result.append(
String.format("\\u%04X", (int) c)); //$NON-NLS-1$
}
}
}
return result.toString();
}
}
You can do:
case '\\':
if(str.charAt(i+1)!='u')
result.append("\\\\");
else
result.append("\\");
break;
Assuming that \u will always denote a unicode character sequence in your string.
When you write a Java string literal as "Region S\u00FCdost SER", the Java compiler will interpret that as the string value Region Südost SER, which is what the escape() method will see when called on t.
If you wanted the string Region S\u00FCdost SER, you should have escaped the \ , i.e. "Region S\\u00FCdost SER".
If you keep running the escape() method, I believe you'll see what you want.
String s = "Region S\u00FCdost SER";
System.out.println(s); // print original text
for (int i = 0; i < 4; i++) {
s = escapeString(s);
System.out.println(s);
}
Output:
Region Südost SER <-- original text
Region S\u00FCdost SER
Region S\\u00FCdost SER
Region S\\\\u00FCdost SER
Region S\\\\\\\\u00FCdost SER
If you change input to "He'd say: \"Bitte schön\"", you get:
He'd say: "Bitte schön" <-- original text
He\'d say: \"Bitte sch\u00F6n\"
He\\\'d say: \\\"Bitte sch\\u00F6n\\\"
He\\\\\\\'d say: \\\\\\\"Bitte sch\\\\u00F6n\\\\\\\"
He\\\\\\\\\\\\\\\'d say: \\\\\\\\\\\\\\\"Bitte sch\\\\\\\\u00F6n\\\\\\\\\\\\\\\"
I mean, this is what you wanted, right? If not, please clarify question by actually showing example output of what you want.
I get a set of chars, e.g. as a String containing all of them and need a charclass Pattern matching any of them. For example
for "abcde" I want "[a-e]"
for "[]^-" I want "[-^\\[\\]]"
How can I create a compact solution and how to handle border cases like empty set and set of all chars?
What chars need to be escaped?
Clarification
I want to create a charclass Pattern, i.e. something like "[...]", no repetitions and no such stuff. It must work for any input, that's why I'm interested in the corner cases, too.
Here's a start:
import java.util.*;
public class RegexUtils {
private static String encode(char c) {
switch (c) {
case '[':
case ']':
case '\\':
case '-':
case '^':
return "\\" + c;
default:
return String.valueOf(c);
}
}
public static String createCharClass(char[] chars) {
if (chars.length == 0) {
return "[^\\u0000-\\uFFFF]";
}
StringBuilder builder = new StringBuilder();
boolean includeCaret = false;
boolean includeMinus = false;
List<Character> set = new ArrayList<Character>(new TreeSet<Character>(toCharList(chars)));
if (set.size() == 1<<16) {
return "[\\w\\W]";
}
for (int i = 0; i < set.size(); i++) {
int rangeLength = discoverRange(i, set);
if (rangeLength > 2) {
builder.append(encode(set.get(i))).append('-').append(encode(set.get(i + rangeLength)));
i += rangeLength;
} else {
switch (set.get(i)) {
case '[':
case ']':
case '\\':
builder.append('\\').append(set.get(i));
break;
case '-':
includeMinus = true;
break;
case '^':
includeCaret = true;
break;
default:
builder.append(set.get(i));
break;
}
}
}
builder.append(includeCaret ? "^" : "");
builder.insert(0, includeMinus ? "-" : "");
return "[" + builder + "]";
}
private static List<Character> toCharList(char[] chars) {
List<Character> list = new ArrayList<Character>();
for (char c : chars) {
list.add(c);
}
return list;
}
private static int discoverRange(int index, List<Character> chars) {
int range = 0;
for (int i = index + 1; i < chars.size(); i++) {
if (chars.get(i) - chars.get(i - 1) != 1) break;
range++;
}
return range;
}
public static void main(String[] args) {
System.out.println(createCharClass("daecb".toCharArray()));
System.out.println(createCharClass("[]^-".toCharArray()));
System.out.println(createCharClass("".toCharArray()));
System.out.println(createCharClass("d1a3e5c55543b2000".toCharArray()));
System.out.println(createCharClass("!-./0".toCharArray()));
}
}
As you can see, the input:
"daecb".toCharArray()
"[]^-".toCharArray()
"".toCharArray()
"d1a3e5c55543b2000".toCharArray()
prints:
[a-e]
[-\[\]^]
[^\u0000-\uFFFF]
[0-5a-e]
[!\--0]
The corner cases in a character class are:
\
[
]
which will need a \ to be escaped. The character ^ doesn't need an escape if it's not placed at the start of a character class, and the - does not need to be escaped when it's placed at the start, or end of the character class (hence the boolean flags in my code).
The empty set is [^\u0000-\uFFFF], and the set of all the characters is [\u0000-\uFFFF]. Not sure what you need the former for as it won't match anything. I'd throw an IllegalArgumentException() on an empty string instead.
What chars need to be escaped?
- ^ \ [ ] - that's all of them, I've actually tested it. And unlike some other regex implementations [ is considered a meta character inside a character class, possibly due to the possibility of using inner character classes with operators.
The rest of task sounds easy, but rather tedious. First you need to select unique characters. Then loop through them, appending to a StringBuilder, possibly escaping. If you want character ranges, you need to sort the characters first and select contiguous ranges while looping. If you want the - to be at the beginning of the range with no escaping, then set a flag, but don't append it. After the loop, if the flag is set, prepend - to the result before wrapping it in [].
Match all characters ".*" (zero or more repeitions * of matching any character . .
Match a blank line "^$" (match start of a line ^ and end of a line $. Note the lack of stuff to match in the middle of the line).
Not sure if the last pattern is exactly what you wanted, as there's different interpretations to "match nothing".
A quick, dirty, and almost-not-pseudo-code answer:
StringBuilder sb = new StringBuilder("[");
Set<Character> metaChars = //...appropriate initialization
while (sourceString.length() != 0) {
char c = sourceString.charAt(0);
sb.append(metaChars.contains(c) ? "\\"+c : c);
sourceString.replace(c,'');
}
sb.append("]");
Pattern p = Pattern.compile(sb.toString());
//...can check here for the appropriate sb.length cases
// e.g, 2 = empty, all chars equals the count of whatever set qualifies as all chars, etc
Which gives you the unique string of char's you want to match, with meta-characters replaced. It will not convert things into ranges (which I think is fine - doing so smells like premature optimization to me). You can do some post tests for simple set cases - like matching sb against digits, non-digits, etc, but unless you know that's going to buy you a lot of performance (or the simplification is the point of this program), I wouldn't bother.
If you really want to do ranges, you could instead sourceString.toCharArray(), sort that, iterate deleting repetitions and doing some sort of range check and replacing meta characters as you add the contents to StringBuilder.
EDIT: I actually kind of liked the toCharArray version, so pseudo-coded it out as well:
//...check for empty here, if not...
char[] sourceC = sourceString.toCharArray();
Arrays.sort(sourceC);
lastC = sourceC[0];
StringBuilder sb = new StringBuilder("[");
StringBuilder range = new StringBuilder();
for (int i=1; i<sourceC.length; i++) {
if (lastC == sourceC[i]) continue;
if (//.. next char in sequence..//) //..add to range
else {
// check range size, append accordingly to sb as a single item, range, etc
}
lastC = sourceC[i];
}
My group could either be of the form x/y, x.y or x_y.z. Each group is separated by an underscore. The groups are unordered.
Example:
ABC/DEF_abc.def_PQR/STU_ghi_jkl.mno
I would like to capture the following:
ABC/DEF
abc.def
PQR/STU
ghi_jkl.mno
I have done this using a fairly verbose string iteration and parsing method (shown below), but am wondering if a simple regex can accomplish this.
private static ArrayList<String> go(String s){
ArrayList<String> list = new ArrayList<String>();
boolean inSlash = false;
int pos = 0 ;
boolean inDot = false;
for(int i = 0 ; i < s.length(); i++){
char c = s.charAt(i);
switch (c) {
case '/':
inSlash = true;
break;
case '_':
if(inSlash){
list.add(s.substring(pos,i));
inSlash = false;
pos = i+1 ;
}
else if (inDot){
list.add(s.substring(pos,i));
inDot = false;
pos = i+1;
}
break;
case '.':
inDot = true;
break;
default:
break;
}
}
list.add(s.substring(pos));
System.out.println(list);
return list;
}
Have a try with:
((?:[^_./]+/[^_./]+)|(?:[^_./]+\.[^_./]+)|(?:[^_./]+(?:_[^_./]+)+\.[^_./]+))
I don't know java syntax but in Perl:
#!/usr/bin/perl
use 5.10.1;
use strict;
use warnings;
my $str = q!ABC/DEF_abc.def_PQR/STU_ghi_jkl.mno_a_b_c.z_a_b_c_d.z_a_b_c_d_e.z!;
my $re = qr!((?:[^_./]+/[^_./]+)|(?:[^_./]+\.[^_./]+)|(?:[^_./]+(?:_[^_./]+)+\.[^_./]+))!;
while($str=~/$re/g) {
say $1;
}
will produce:
ABC/DEF
abc.def
PQR/STU
ghi_jkl.mno
a_b_c.z
a_b_c_d.z
a_b_c_d_e.z
There might be a problem with the underscore since it's not always a separator.
Maybe: ((?<=_)\w+_)?\w+[./]\.w+
This regex would probably do (tested with .Net regular expressions):
[a-zA-Z]+[./][a-zA-Z]+|[a-zA-Z]+_[a-zA-Z]+\.[a-zA-Z]+
(If you know your input is well formed there is no need to explicitly match the separator)
This one goes with positive lookahead instead of alternations
[A-Za-z]+(_(?=[A-Za-z]+\.[A-Za-z]+))?[A-Za-z]+[/.][A-Za-z]+
When creating JSON data manually, how should I escape string fields? Should I use something like Apache Commons Lang's StringEscapeUtilities.escapeHtml, StringEscapeUtilities.escapeXml, or should I use java.net.URLEncoder?
The problem is that when I use SEU.escapeHtml, it doesn't escape quotes and when I wrap the whole string in a pair of 's, a malformed JSON will be generated.
Ideally, find a JSON library in your language that you can feed some appropriate data structure to, and let it worry about how to escape things. It'll keep you much saner. If for whatever reason you don't have a library in your language, you don't want to use one (I wouldn't suggest this¹), or you're writing a JSON library, read on.
Escape it according to the RFC. JSON is pretty liberal: The only characters you must escape are \, ", and control codes (anything less than U+0020).
This structure of escaping is specific to JSON. You'll need a JSON specific function. All of the escapes can be written as \uXXXX where XXXX is the UTF-16 code unit¹ for that character. There are a few shortcuts, such as \\, which work as well. (And they result in a smaller and clearer output.)
For full details, see the RFC.
¹JSON's escaping is built on JS, so it uses \uXXXX, where XXXX is a UTF-16 code unit. For code points outside the BMP, this means encoding surrogate pairs, which can get a bit hairy. (Or, you can just output the character directly, since JSON's encoded for is Unicode text, and allows these particular characters.)
Extract From Jettison:
public static String quote(String string) {
if (string == null || string.length() == 0) {
return "\"\"";
}
char c = 0;
int i;
int len = string.length();
StringBuilder sb = new StringBuilder(len + 4);
String t;
sb.append('"');
for (i = 0; i < len; i += 1) {
c = string.charAt(i);
switch (c) {
case '\\':
case '"':
sb.append('\\');
sb.append(c);
break;
case '/':
// if (b == '<') {
sb.append('\\');
// }
sb.append(c);
break;
case '\b':
sb.append("\\b");
break;
case '\t':
sb.append("\\t");
break;
case '\n':
sb.append("\\n");
break;
case '\f':
sb.append("\\f");
break;
case '\r':
sb.append("\\r");
break;
default:
if (c < ' ') {
t = "000" + Integer.toHexString(c);
sb.append("\\u" + t.substring(t.length() - 4));
} else {
sb.append(c);
}
}
}
sb.append('"');
return sb.toString();
}
Try this org.codehaus.jettison.json.JSONObject.quote("your string").
Download it here: http://mvnrepository.com/artifact/org.codehaus.jettison/jettison
org.json.simple.JSONObject.escape() escapes quotes,\, /, \r, \n, \b, \f, \t and other control characters. It can be used to escape JavaScript codes.
import org.json.simple.JSONObject;
String test = JSONObject.escape("your string");
There is now a StringEscapeUtils#escapeJson(String) method in the Apache Commons Text library.
The methods of interest are as follows:
StringEscapeUtils#escapeJson(String)
StringEscapeUtils#unescapeJson(String)
This functionality was initially released as part of Apache Commons Lang version 3.2 but has since been deprecated and moved to Apache Commons Text. So if the method is marked as deprecated in your IDE, you're importing the implementation from the wrong library (both libraries use the same class name: StringEscapeUtils).
The implementation isn't pure Json. As per the Javadoc:
Escapes the characters in a String using Json String rules.
Escapes any values it finds into their Json String form. Deals
correctly with quotes and control-chars (tab, backslash, cr, ff, etc.)
So a tab becomes the characters '\' and 't'.
The only difference between Java strings and Json strings is that in
Json, forward-slash (/) is escaped.
See http://www.ietf.org/rfc/rfc4627.txt for further details.
org.json.JSONObject quote(String data) method does the job
import org.json.JSONObject;
String jsonEncodedString = JSONObject.quote(data);
Extract from the documentation:
Encodes data as a JSON string. This applies quotes and any necessary character escaping. [...] Null will be interpreted as an empty string
StringEscapeUtils.escapeJavaScript / StringEscapeUtils.escapeEcmaScript should do the trick too.
If you are using fastexml jackson, you can use the following:
com.fasterxml.jackson.core.io.JsonStringEncoder.getInstance().quoteAsString(input)
If you are using codehaus jackson, you can use the following:
org.codehaus.jackson.io.JsonStringEncoder.getInstance().quoteAsString(input)
Not sure what you mean by "creating json manually", but you can use something like gson (http://code.google.com/p/google-gson/), and that would transform your HashMap, Array, String, etc, to a JSON value. I recommend going with a framework for this.
I have not spent the time to make 100% certain, but it worked for my inputs enough to be accepted by online JSON validators:
org.apache.velocity.tools.generic.EscapeTool.EscapeTool().java("input")
although it does not look any better than org.codehaus.jettison.json.JSONObject.quote("your string")
I simply use velocity tools in my project already - my "manual JSON" building was within a velocity template
For those who came here looking for a command-line solution, like me, cURL's --data-urlencode works fine:
curl -G -v -s --data-urlencode 'query={"type" : "/music/artist"}' 'https://www.googleapis.com/freebase/v1/mqlread'
sends
GET /freebase/v1/mqlread?query=%7B%22type%22%20%3A%20%22%2Fmusic%2Fartist%22%7D HTTP/1.1
, for example. Larger JSON data can be put in a file and you'd use the # syntax to specify a file to slurp in the to-be-escaped data from. For example, if
$ cat 1.json
{
"type": "/music/artist",
"name": "The Police",
"album": []
}
you'd use
curl -G -v -s --data-urlencode query#1.json 'https://www.googleapis.com/freebase/v1/mqlread'
And now, this is also a tutorial on how to query Freebase from the command line :-)
Use EscapeUtils class in commons lang API.
EscapeUtils.escapeJavaScript("Your JSON string");
Consider Moshi's JsonWriter class. It has a wonderful API and it reduces copying to a minimum, everything can be nicely streamed to a filed, OutputStream, etc.
OutputStream os = ...;
JsonWriter json = new JsonWriter(Okio.buffer(Okio.sink(os)));
json.beginObject();
json.name("id").value(getId());
json.name("scores");
json.beginArray();
for (Double score : getScores()) {
json.value(score);
}
json.endArray();
json.endObject();
If you want the string in hand:
Buffer b = new Buffer(); // okio.Buffer
JsonWriter writer = new JsonWriter(b);
//...
String jsonString = b.readUtf8();
If you need to escape JSON inside JSON string, use org.json.JSONObject.quote("your json string that needs to be escaped") seem to work well
Apache commons-text now has a
StringEscapeUtils.escapeJson(String).
using the \uXXXX syntax can solve this problem, google UTF-16 with the name of the sign, you can find out XXXX, for example:utf-16 double quote
The methods here that show the actual implementation are all faulty.
I don't have Java code, but just for the record, you could easily convert this C#-code:
Courtesy of the mono-project #
https://github.com/mono/mono/blob/master/mcs/class/System.Web/System.Web/HttpUtility.cs
public static string JavaScriptStringEncode(string value, bool addDoubleQuotes)
{
if (string.IsNullOrEmpty(value))
return addDoubleQuotes ? "\"\"" : string.Empty;
int len = value.Length;
bool needEncode = false;
char c;
for (int i = 0; i < len; i++)
{
c = value[i];
if (c >= 0 && c <= 31 || c == 34 || c == 39 || c == 60 || c == 62 || c == 92)
{
needEncode = true;
break;
}
}
if (!needEncode)
return addDoubleQuotes ? "\"" + value + "\"" : value;
var sb = new System.Text.StringBuilder();
if (addDoubleQuotes)
sb.Append('"');
for (int i = 0; i < len; i++)
{
c = value[i];
if (c >= 0 && c <= 7 || c == 11 || c >= 14 && c <= 31 || c == 39 || c == 60 || c == 62)
sb.AppendFormat("\\u{0:x4}", (int)c);
else switch ((int)c)
{
case 8:
sb.Append("\\b");
break;
case 9:
sb.Append("\\t");
break;
case 10:
sb.Append("\\n");
break;
case 12:
sb.Append("\\f");
break;
case 13:
sb.Append("\\r");
break;
case 34:
sb.Append("\\\"");
break;
case 92:
sb.Append("\\\\");
break;
default:
sb.Append(c);
break;
}
}
if (addDoubleQuotes)
sb.Append('"');
return sb.ToString();
}
This can be compacted into
// https://github.com/mono/mono/blob/master/mcs/class/System.Json/System.Json/JsonValue.cs
public class SimpleJSON
{
private static bool NeedEscape(string src, int i)
{
char c = src[i];
return c < 32 || c == '"' || c == '\\'
// Broken lead surrogate
|| (c >= '\uD800' && c <= '\uDBFF' &&
(i == src.Length - 1 || src[i + 1] < '\uDC00' || src[i + 1] > '\uDFFF'))
// Broken tail surrogate
|| (c >= '\uDC00' && c <= '\uDFFF' &&
(i == 0 || src[i - 1] < '\uD800' || src[i - 1] > '\uDBFF'))
// To produce valid JavaScript
|| c == '\u2028' || c == '\u2029'
// Escape "</" for <script> tags
|| (c == '/' && i > 0 && src[i - 1] == '<');
}
public static string EscapeString(string src)
{
System.Text.StringBuilder sb = new System.Text.StringBuilder();
int start = 0;
for (int i = 0; i < src.Length; i++)
if (NeedEscape(src, i))
{
sb.Append(src, start, i - start);
switch (src[i])
{
case '\b': sb.Append("\\b"); break;
case '\f': sb.Append("\\f"); break;
case '\n': sb.Append("\\n"); break;
case '\r': sb.Append("\\r"); break;
case '\t': sb.Append("\\t"); break;
case '\"': sb.Append("\\\""); break;
case '\\': sb.Append("\\\\"); break;
case '/': sb.Append("\\/"); break;
default:
sb.Append("\\u");
sb.Append(((int)src[i]).ToString("x04"));
break;
}
start = i + 1;
}
sb.Append(src, start, src.Length - start);
return sb.ToString();
}
}
I think the best answer in 2017 is to use the javax.json APIs. Use javax.json.JsonBuilderFactory to create your json objects, then write the objects out using javax.json.JsonWriterFactory. Very nice builder/writer combination.
I have string which contains alpahanumeric and special character.
I need to replace each and every special char with some string.
For eg,
Input string = "ja*va st&ri%n#&"
Expected o/p = "jaasteriskvaspacestandripercentagenatand"
= "asterisk"
& = "and"
% = "percentage"
# = "at"
thanks,
Unless you're absolutely desperate for performance, I'd use a very simple approach:
String result = input.replace("*", "asterisk")
.replace("%", "percentage")
.replace("#", "at"); // Add more to taste :)
(Note that there's a big difference between replace and replaceAll - the latter takes a regular expression. It's easy to get the wrong one and see radically different effects!)
An alternative would be something like:
public static String replaceSpecial(String input)
{
// Output will be at least as long as input
StringBuilder builder = new StringBuilder(input.length());
for (int i = 0; i < input.length(); i++)
{
char c = input.charAt(i);
switch (c)
{
case '*': builder.append("asterisk"); break;
case '%': builder.append("percentage"); break;
case '#': builder.append("at"); break;
default: builder.append(c); break;
}
}
return builder.toString();
Take a look at the following java.lang.String methods:
replace()
replaceAll()