Input Sanitizing to not break JSON syntax - java

So, in a nutshell I'm trying to create a regex that I can use in a java program that is about to submit a JSON object to my php server.
myString.replaceAll(myRegexString,"");
My question is that I am absolutely no good with regex and to add onto that I need to escape the characters properly as its stored in a string, and then also escape the characters properly inside the regex. good lordy.
What I came up with was this:
String myRegexString = "[\"',{}[]:;]"
The first backslash was to escape outer quotes to get a " in there. And then it struck me that {} and [] are also regex commands. Would I escape those as well? Like:
String myRegexString = "[\"',\{\}\[\]:;]"
Thanks in advance. In case it wasnt clear from examples above the only characters I really care about at this moment in time is:
" { } [ ] , and also ; : ' for general sqlinj protection.
UPDATE:
This is the final regex:
[\\Q\"',{}[\]:;\\E] for anyone else curious. Thanks Amit!

Why don't you use an actual JSON encoding API/framework? What you're doing is not sanitizing. What you're doing is corrupting the data. If my name is O'Reilly, I want it to be spelled O'Reilly, not OReilly. If I send a message containing [ or {, I want these to be in the messages. Use a framework or API that escapes those characters when needed rather than removing them blindly.
Googling for JSON Java will lead you to many APIs and frameworks.

Try something like
String myRegexString = "[\\Q\"',{}[]:;\\E]";
now the characters between \Q and \E are now treated as normal characters.

Related

JSON broken when double quotes comes inside the key/value

Sample data:
{"630":{"TotalLength":"33-3/8" - 36-3/4""},"631":{"Length":"34 37 7/8"}}
We are facing the double quotes issue in JSON response. How we can replace the double quotes with " \" " which comes inside the key or value? Java is the development platform.
This answer is assuming that you are not in control of creating this JSON-like string. If you can control that part, then you should be escaping properly there itself.
In this case, since parsing systematically is not an option as it's not a valid JSON yet, all I could suggest is to go through the various strings and see if you can find a pattern on which you can apply some logic and escape all the "s which prevent the string from being a valid JSON.
Here is probably a way to start:
All of the "s that are needed to be there for the string to be a vaild JSON are surrounded by one or multiple characters among {, :, ,, and }, with or without space in between the " and the other JSON characters.
So, if you parse the JSON-like string using Java and look for all the "s, and, when encountered with one, if they are along with any of the above characters (with or without space in between), you just leave it as it is. If not, replace that " with a \".
Note that the above method may or may not work depending on the data in question. What I mean to convey is the approach that you may find useful if there's absolutely no way for the string to be escaped during it's creation, and, if these strings follow a strict pattern with respect to the unescaped "s.

How to Remove Special Character Except Comma

I have an Output in my Android EditText like below :
["HOT","SMALL"]
I want my Output like below :
HOT,SMALL
I want to remove [] and "" but not the Comma , . I have read this but its not work. I tried this but this one remove all Special Chars. Anybody can help my problem, any suggest will helpfull for me. Thanks Before.
There are a couple of ways I'd do this.
The first, quick and straight forward is to just replace all the special characters with "", using a regex and String.replaceAll
myString.replaceAll("[\\\"\\[\\]]", "");
(Btw, I used http://rubular.com/ as a quick way to check my regex. Remember that the regex needs to be escaped for java - I used this tool to do that.)
The alternative is that you're actually looking at the String representation of a JSON object here, so convert the JSON string into a Java array of Strings using something like org.json, and then concatenate the strings together with a , delimiter.

Java String#contains() using String#matches() with escape character

I need a simple way to implement the contains function using matches. I believe this is my starting point:
xxx.matches("'.*yyy.*'");
But I need to make it a universal method and pre-process whatever I search for to be accepted by matches! This must be done using only the escape '\' character!
Imagine a string SEARCH_FOR that can contain some special characters that must be "regex escaped"...
String SEARCH_FOR="*.\\"
xxx.matches("'.*" + SEARCH_FOR + ".*'");
Are there any catches? Special situations? Any other "special chars should be taken into account?
Are you looking for Pattern.quote(String) ?
This escapes special characters for you.
EDIT:
After reading the comments, I really hope you try Pattern.quote(yourString.toLowerCase()) as it sounds like you've been using Pattern.quote(yourString).toLowerCase(). If DataNucleus is applying the regex then there should be no problems with using the \Q and \E escape sequence.
Since you have really asked for it, ".\\".replaceAll("(\\.|\\$|\\+|\\*|\\\\)", "\\\\\$1") outputs \.\\
This will escape .'s, $'s, + 's, *'s and \'s. Note that the security of this is now all upon you. If you don't escape something you needed to, or you escape it incorrectly, you will either allow people to use regex inside the search term when you weren't expecting to or it won't returns results that you were expecting.

how to built regular expression to get value between two single quotes and if there is no single qoute, extract between commas

Problem that i face:
-I have an input string, a SQL statement that i need to parse
-extract the value that need to be insert base on the column name specify
-i can extract the value that is wrap in between 2 single quotes, but:
--?what about value that has no single quotes wrap at them? (like: integer or double)
--?what if the value inside already has single quotes? (like: 'James''s dictionary')
Below is the sample input string:
INSERT INTO LJS1_DX (base, doc, key1, key2, no, sq, eq, ln, en, date, line)
VALUES ('GET','','#000210','',' 0',' 1','5',1,0,'20100706','Street''James''s dictionary')
The Java Code i have below match value between two single quotes only:
Pattern p = Pattern.compile("'.*?'");
columnValues = "'GET0','','#000210','',' 0',' 1','5',1,0,'20100706','Street''James''s dictionary'";
Matcher m = p.matcher(columnValues); // get a matcher object
StringBuffer output = new StringBuffer();
while (m.find()) {
logger.trace(m.group());
}
Appreciate if anyone can provide any guideline or sample to this question.
Thank you!!
I agree with gnibbler that this is a job for a csv parser.
A regex that works on your example would be
'(?:''|[^'])*'|[^',]+
which looks challenging to debug and maintain, doesn't it?
Explanation:
' # First alternative: match an "opening" '
(?: # followed by either...
'' # two ' in a row (escaped ')
| # or...
[^'] # any character that is not a '
)* # zero or more times,
' # then match a "closing" '
| # or (second alternative):
[^',\s]+ # match any run of characters except ', comma or whitespace
It also works if there is whitespace around the values/commas (and will leave that out of the match).
Regex are not really suitable for this. You will always find cases that fail
A csv parser such as opencsv is probably a better option
In general, when you need to parse complex langauges, regexps are not the best tool - there's too much context to make sense of. So, if reading XML use an XML parser, if reading C code, use a C language parser and if reading SQL ...
There's a Java SQL parser here, I would use somethink like this.
For other languages it may be best to use a "YACC"-like parser. For example JACK
instead you can get all values using subString after Values keyword. Same way we can get names also. then you will have two comma-separated string which can be converted to array and you will have a arrays for names and values. you can then check which param has which value .
hope this helps.
I think Tim had the right idea; it just needs to be implemented more efficiently. Here's a much more efficient version:
'[^']*+(?:''[^']*+)*+'|[^',\s]++
It uses Friedl's "unrolled loop" technique to avoid excessive reliance on alternations that match one or two characters at a time (I think that's what did you in, Tim), plus possessive quantifiers throughout.
Regular expressions are not easy to use with this (but everything is possible).
I would suggest parsing it yourself, or use a library to do the parsing. By writing the parser yourself you are certain that it works exactly as you need it to.

Regular expression for splitting JSON text in lines after symbols

I am trying to use a regular expression to have this kind of string
{
"key1"
:
value1
,
"key2"
:
"value2"
,
"arrayKey"
:
[
{
"keyA"
:
valueA
,
"keyB"
:
"valueB"
,
"keyC"
:
[
0
,
1
,
2
]
}
]
}
from
JSONObject.toString()
that is one long line of text in my Android Java app
{"key1":"value1","key2":"value2","arrayKey":[{"keyA":"valueA","keyB":"valueB","keyC":[0,1,2]}]}
I found this regular expression for finding all commas.
/(,)(?=(?:[^"]|"[^"]*")*$)/
Now I need to know:
0- if this is reliable, that is, does what they say.
1- if this is works also with commas inside double-quotes.
2- if this takes into account escaped double-quotes.
3- if I have to take into account also single quotes, as this file is produced by my app but occasionally it could be manually edited by the user.
5- It has to be used with the multi-line flag to work with multi-line text.
6- It has to work with replaceAll().
The resulting regular expression will be be used for replacing each symbol with a two-char sequence made of the symbol itself plus \n character.
The resulting text has to be still JSON text.
Subsequent replace actions will take place also for the other symbols
: [ ] { }
and other symbols that can be found in JSON files outside the alphanumeric sequences between quotes (I do not know if the mentioned symbols are the only ones).
Its not that much simple, but yes if you want to do then you need to filter characters([,{,",',:) and replace then with a new line character against it.
like:
[ should get replaced with [\n
Answer to your question is Yes its very much reliable and good to implement its just a single line of code doing all. Thats what regex is made for.
0- if this is reliable, that is, does what they say.
Let's break down the expression a little:
(,) is a capturing group that matches a single comma
(?=...) would mean a positive lookahead meaning the comma would need to be followed by a match of that group's content
(?:...)* would be a non-capturing group that can occur 0 to many times
[^"]|"[^"]*" would match either any character except a double quote ([^"]) or (|) a pair of double quotes with any character in between except other double quotes ("[^"]*")
As you can see especially the last part could make it unreliable if there are escaped double quotes in a text value, so the answer would be "this is reliable if the input is simple enough".
1- if this is works also with commas inside double-quotes.
If the double quote pairs are correctly identified any commas in between would be ignored.
2- if this takes into account escaped double-quotes.
Here's one of the major problems: escaped double quotes would need to be handled. This can get quite complex if you want to handle arbitrary cases, especially if the texts could contain commas as well.
3- if I have to take into account also single quotes, as this file is produced by my app but occasionally it could be manually edited by the user.
Single quotes aren't allowed by the JSON sepcification but many parsers support them because humans tend to use them anyway. Thus you might need to take them into account and that makes no. 2 even more complex because now there might be an unescaped double quote in a single quote text.
5- It has to be used with the multi-line flag to work with multi-line text.
I'm not entirely sure about that but adding the multi-line flag shouldn't hurt. You could add it to the expression itself though, i.e. by prepeding (?m).
6- It has to work with replaceAll().
In its current form the regex would work with String#replaceAll() because it only matches the comma - the lookahead is used to determine a match but won't result in the wrong parts being replaced. The matches themselves might not be correct though, as described above.
That being said, you should note that JSON is not a regular language and only regular languages are a perfect fit for regular expressions.
Thus I'd recommend using a proper JSON parser (there are quite a lot out there) to parse the JSON into POJOs (might just be a bunch of generic JsonObject and JsonArray instances) and reformat that according to your needs.
Here's an example of how Jackson could be used to accomplish that: https://kodejava.org/how-to-pretty-print-json-string-using-jackson/
In fact, since you're already using JSONObject.toString() you probably don't need the parser itself but just a proper formatter (if you want/need to roll your own you could have a look at the org.json.JSONObject sources ).

Categories