I am trying to validate a JSON string using regex. Found the valid regex from another post https://stackoverflow.com/a/3845829/7493427
It uses DEFINE feature in regex. But I think the JRegex library does not support that feature. Is there a work around for this?
I used java.util.regex first, then found out about JRegex library. But this doesn't work too.
String regex = "(?(DEFINE)" +
"(?<number> -? (?= [1-9]|0(?!\\d) ) \\d+ (\\.\\d+)? ([eE] [+-]?
\\d+)? )" +
"(?<boolean> true | false | null )" +
"(?<string> \" ([^\"\\n\\r\\t\\\\\\\\]* | \\\\\\\\
[\"\\\\\\\\bfnrt\\/] | \\\\\\\\ u [0-9a-f]{4} )* \" )" +
"(?<array> \\[ (?: (?&json) (?: , (?&json) )* )? \\s*
\\] )" +
"(?<pair> \\s* (?&string) \\s* : (?&json) )" +
"(?<object> \\{ (?: (?&pair) (?: , (?&pair) )* )? \\s*
\\} )" +
"(?<json> \\s* (?: (?&number) | (?&boolean) | (?&string) | (?
&array) | (?&object) ) \\s* )" +
")" +
"\\A (?&json) \\Z";
String test = "{\"asd\" : \"asdasdasdasdasdasd\"}";
jregex.Pattern pattern = new jregex.Pattern(regex);
jregex.Matcher matcher = pattern.matcher(test);
if(matcher.find()) {
System.out.println(matcher.groups());
}
I expected a match as the test json is valid, but I get an exception.
Exception in thread "main" jregex.PatternSyntaxException: unknown group name in conditional expr.: DEFINE at jregex.Term.makeTree(Term.java:360) at jregex.Term.makeTree(Term.java:219)at jregex.Term.makeTree(Term.java:206) at jregex.Pattern.compile(Pattern.java:164) at jregex.Pattern.(Pattern.java:150) at jregex.Pattern.(Pattern.java:108) at com.cloak.utilities.regex.VariableValidationHelper.main(VariableValidationHelper.java:305)
You can use this rather simple Jackson setup:
private static final ObjectMapper MAPPER = new ObjectMapper();
public static boolean isValidJson(String json) {
try {
MAPPER.readValue(json, Map.class);
return true;
} catch(IOException e) {
return false;
}
}
ObjectMapper#readValue() will throw JsonProcessingExceptions (a sub class of IOException) when the input is invalid.
Related
I want to replace all Java-style comments (/* */) with the number of new lines for that comment. So far, I can only come up with something that replaces comments with an empty string
String.replaceAll("/\\*[\\s\\S]*?\\*/", "")
Is it possible to replace the matching regexes instead with the number of new lines it contains? If this is not possible with just regex matching, what's the best way for it to be done?
For example,
/* This comment
has 2 new lines
contained within */
will be replaced with a string of just 2 new lines.
Since Java supports the \G construct, just do it all in one go.
Use a global regex replace function.
Find
"/(?:\\/\\*(?=[\\S\\s]*?\\*\\/)|(?<!\\*\\/)(?!^)\\G)(?:(?!\\r?\\n|\\*\\/).)*((?:\\r?\\n)?)(?:\\*\\/)?/"
Replace
"$1"
https://regex101.com/r/l1VraO/1
Expanded
(?:
/ \*
(?= [\S\s]*? \* / )
|
(?<! \* / )
(?! ^ )
\G
)
(?:
(?! \r? \n | \* / )
.
)*
( # (1 start)
(?: \r? \n )?
) # (1 end)
(?: \* / )?
==================================================
==================================================
IF you should ever care about comment block delimiters started within
quoted strings like this
String comment = "/* this is a comment*/"
Here is a regex (addition) that parses the quoted string as well as the comment.
Still done in a single regex all at once in a global find / replace.
Find
"/(\"[^\"\\\\]*(?:\\\\[\\S\\s][^\"\\\\]*)*\")|(?:\\/\\*(?=[\\S\\s]*?\\*\\/)|(?<!\")(?<!\\*\\/)(?!^)\\G)(?:(?!\\r?\\n|\\*\\/).)*((?:\\r?\\n)?)(?:\\*\\/)?/"
Replace
"$1$2"
https://regex101.com/r/tUwuAI/1
Expanded
( # (1 start)
"
[^"\\]*
(?:
\\ [\S\s]
[^"\\]*
)*
"
) # (1 end)
|
(?:
/ \*
(?= [\S\s]*? \* / )
|
(?<! " )
(?<! \* / )
(?! ^ )
\G
)
(?:
(?! \r? \n | \* / )
.
)*
( # (2 start)
(?: \r? \n )?
) # (2 end)
(?: \* / )?
You can do it with a regex "replacement loop".
Most easily done in Java 9+:
String result = Pattern.compile("/\\*(?:[^*]++|\\*(?!/))*+\\*/").matcher(input)
.replaceAll(r -> r.group().replaceAll(".*", ""));
The main regex has been optimized for performance. The lambda has not been optimized.
For all Java versions:
Matcher m = Pattern.compile("/\\*(?:[^*]++|\\*(?!/))*+\\*/").matcher(input);
StringBuffer buf = new StringBuffer();
while (m.find())
m.appendReplacement(buf, m.group().replaceAll(".*", ""));
String result = m.appendTail(buf).toString();
Test
final String input = "Line 1\n"
+ "/* Inline comment */\n"
+ "Line 3\n"
+ "/* One-line\n"
+ " comment */\n"
+ "Line 6\n"
+ "/* This\n"
+ " comment\n"
+ " has\n"
+ " 4\n"
+ " lines */\n"
+ "Line 12";
Matcher m = Pattern.compile("(?s)/\\*(?:[^*]++|\\*(?!/))*+\\*/").matcher(input);
String result = m.replaceAll(r -> r.group().replaceAll(".*", ""));
// Show input/result side-by-side
String[] inLines = input.split("\n", -1);
String[] resLines = result.split("\n", -1);
int lineCount = Math.max(inLines.length, resLines.length);
System.out.println("input |result");
System.out.println("-------------------------+-------------------------");
for (int i = 0; i < lineCount; i++) {
System.out.printf("%-25s|%s%n", (i < inLines.length ? inLines[i] : ""),
(i < resLines.length ? resLines[i] : ""));
}
Output
input |result
-------------------------+-------------------------
Line 1 |Line 1
/* Inline comment */ |
Line 3 |Line 3
/* One-line |
comment */ |
Line 6 |Line 6
/* This |
comment |
has |
4 |
lines */ |
Line 12 |Line 12
Maybe, this expression,
\/\*.*?\*\/
on s mode might be close to what you have in mind.
Test
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class re{
public static void main(String[] args){
final String regex = "\\/\\*.*?\\*\\/";
final String string = "/* This comment\n"
+ "has 2 new lines\n"
+ "contained within */\n\n"
+ "Some codes here 1\n\n"
+ "/* This comment\n"
+ "has 2 new lines\n"
+ "contained within \n"
+ "*/\n\n\n"
+ "Some codes here 2";
final String subst = "\n\n";
final Pattern pattern = Pattern.compile(regex, Pattern.DOTALL);
final Matcher matcher = pattern.matcher(string);
final String result = matcher.replaceAll(subst);
System.out.println(result);
}
}
Output
Some codes here 1
Some codes here 2
If you wish to explore/simplify/modify the expression, it's been
explained on the top right panel of
regex101.com. If you'd like, you
can also watch in this
link, how it would match
against some sample inputs.
I want to be able to output both "==" and "=" as tokens.
For example, the input text file is:
biscuit==cookie apple=fruit+-()
The output:
biscuit
=
=
cookie
apple
=
fruit
+
-
(
)
What I want the output to be:
biscuit
==
cookie
apple
=
fruit
+
-
(
)
Here is my code:
Scanner s = null;
try {
s = new Scanner(new BufferedReader(new FileReader("input.txt")));
s.useDelimiter("\\s|(?<=\\p{Punct})|(?=\\p{Punct})");
while (s.hasNext()) {
String next = s.next();
System.out.println(next);
}
} finally {
if (s != null) {
s.close();
}
}
Thank you.
Edit: I want to be able to keep the current regex.
Just split the input string according to the below regex .
String s = "biscuit==cookie apple=fruit";
String[] tok = s.split("\\s+|\\b(?==+)|(?<==)(?!=)");
System.out.println(Arrays.toString(tok));
Output:
[biscuit, ==, cookie, apple, =, fruit]
Explanation:
\\s+ Matches one or more space characters.
| OR
\\b(?==+) Matches a word boundary only if it's followed by a = symbol.
| OR
(?<==) Lookafter to = symbol.
(?!=) And match the boundary only if it's not followed by a = symbol.
Update:
String s = "biscuit==cookie apple=fruit+-()";
String[] tok = s.split("\\s+|(?<!=)(?==+)|(?<==)(?!=)|(?=[+()-])");
System.out.println(Arrays.toString(tok));
Output:
[biscuit, ==, cookie, apple, =, fruit, +, -, (, )]
You might be able to qualify those punctuations with some additional assertions.
# "\\s|(?<===)|(?<=\\p{Punct})(?!(?<==)(?==))|(?=\\p{Punct})(?!(?<==)(?==))"
\s
| (?<= == )
| (?<= \p{Punct} )
(?!
(?<= = )
(?= = )
)
| (?= \p{Punct} )
(?!
(?<= = )
(?= = )
)
Info update
If some characters aren't covered in \p{Punct} just add them as a separate class within
the punctuation subexpressions.
For engines that don't do certain properties well inside classes, use this ->
# Raw: \s|(?<===)|(?<=\p{Punct}|[=+])(?!(?<==)(?==))|(?=\p{Punct}|[=+])(?!(?<==)(?==))
\s
| (?<= == )
| (?<= \p{Punct} | [=+] )
(?!
(?<= = )
(?= = )
)
| (?= \p{Punct} | [=+] )
(?!
(?<= = )
(?= = )
)
For engines that handle properties well inside classes, this is a better one ->
# Raw: \s|(?<===)|(?<=[\p{Punct}=+])(?!(?<==)(?==))|(?=[\p{Punct}=+])(?!(?<==)(?==))
\s
| (?<= == )
| (?<= [\p{Punct}=+] )
(?!
(?<= = )
(?= = )
)
| (?= [\p{Punct}=+] )
(?!
(?<= = )
(?= = )
)
In other words you want to split on
one or more whitespaces
place which has = after it and non-= before it (like foo|= where | represents this place)
place which has = before it it and non-= after it (like =|foo where | represents this place)
In other words
s.useDelimiter("\\s+|(?<!=)(?==)|(?<==)(?!=)");
// ^^^^^ ^^^^^^^^^^^ ^^^^^^^^^^^
//cases: 1) 2) 3)
Since it looks like you are building parser I would suggest using tool which will let you build correct grammar like http://www.antlr.org/. But if you must stick with regex then other improvement which will let you build regex easier would be using Matcher#find instead of delimiter from Scanner. This way your regex and code could look like
String data = "biscuit==cookie apple=fruit+-()";
String regex = "<=|==|>=|[\\Q<>+-=()\\E]|[^\\Q<>+-=()\\E]+";
Pattern p = Pattern.compile(regex);
Matcher m = p.matcher(data);
while (m.find())
System.out.println(m.group());
Output:
biscuit
==
cookie apple
=
fruit
+
-
(
)
You can make this regex more general by using
String regex = "<=|==|>=|\\p{Punct}|\\P{Punct}+";
// ^^^^^^^^^^ ^^^^^^^^^^^-- standard cases
// ^^ ^^ ^^------------------------- special cases
Also this approach would require reading data from file first, and storing it in single String which you would parse. You can find many ways of how to read text from file for instance in this question:
Reading a plain text file in Java
so you can use something like
String data = new String(Files.readAllBytes(Paths.get("input.txt")));
You can specify encoding which String should use while reading bytes from file by using constructor String(bytes, encoding). So you can write it as new String(butes,"UTF-8") or to avoid typos while selecting encoding use one of stored in StandardCharsets class like new String(bytes, StandardCharsets.UTF_8).
(?===)|(?<===)|\s|(?<!=)(?==)|(?<==)(?!=)|(?=\p{P})|(?<=\p{P})|(?=\+)
You can try this.Se demo.
http://regex101.com/r/wQ1oW3/18
I have an instruction like:
db.insert( {
_id:3,
cost:{_0:11},
description:"This is a description.\nCool, isn\'t it?"
});
The Eclipse plugin I am using, called MonjaDB splits the instruction by newline and I get each line as a separate instruction, which is bad. I fixed it using ;(\r|\n)+ which now includes the entire instruction, however, when sanitizing the newlines between the parts of the JSON, it also sanitizes the \n and \r within string in the json itself.
How do I avoid removing \t, \r, \n from within JSON strings? which are, of course, delimited by "" or ''.
You need to arrange to ignore whitespace when it appears within quotes,. So as suggested by one of the commenters:
\s+ | ( " (?: [^"\\] | \\ . ) * " ) // White-space inserted for readability
Match java whitespace or a double-quoted string where a string consists of " followed by any non-escape, non-quote or an escape + plus any character, then a final ". This way, whitespaces inside strings are not matched.
and replace with $1 if $1 is not null.
Pattern clean = Pattern.compile(" \\s+ | ( \" (?: [^\"\\\\] | \\\\ . ) * \" ) ", Pattern.COMMENTS | Pattern.DOTALL);
StringBuffer sb = new StringBuffer();
Matcher m = clean.matcher( json );
while (m.find()) {
m.appendReplacement(sb, "" );
// Don't put m.group(1) in the appendReplacement because if it happens to contain $1 or $2 you'll get an error.
if ( m.group(1) != null )
sb.append( m.group(1) );
}
m.appendTail(sb);
String cleanJson = sb.toString();
This is totally off the top of my head but I'm pretty sure it's close to what you want.
Edit: I've just got access to a Java IDE and tried out my solution. I had made a couple of mistakes with my code including using \. instead of . in the Pattern. So I have fixed that up and run it on a variation of your sample:
db.insert( {
_id:3,
cost:{_0:11},
description:"This is a \"description\" with an embedded newline: \"\n\".\nCool, isn\'t it?"
});
The code:
String json = "db.insert( {\n" +
" _id:3,\n" +
" cost:{_0:11},\n" +
" description:\"This is a \\\"description\\\" with an embedded newline: \\\"\\n\\\".\\nCool, isn\\'t it?\"\n" +
"});";
// insert above code
System.out.println(cleanJson);
This produces:
db.insert({_id:3,cost:{_0:11},description:"This is a \"description\" with an embedded newline: \"\n\".\nCool, isn\'t it?"});
which is the same json expression with all whitespace removed outside quoted strings and whitespace and newlines retained inside quoted strings.
I'm trying to implement a parser for the example file listed below. I'd like to recognize quoted strings with '+' between them as a single token. So I created a jj file, but it doesn't match such strings. I was under the impression that JavaCC is supposed to match the longest possible match for each token spec. But that doesn't seem to be case for me.
What am I doing wrong here? Why isn't my <STRING> token matching the '+' even though it's specified in there? Why is whitespace not being ignored?
options {
TOKEN_FACTORY = "Token";
}
PARSER_BEGIN(Parser)
package com.example.parser;
public class Parser {
public static void main(String args[]) throws ParseException {
ParserTokenManager manager = new ParserTokenManager(new SimpleCharStream(Parser.class.getResourceAsStream("example")));
Token token = manager.getNextToken();
while (token != null && token.kind != ParserConstants.EOF) {
System.out.println(token.toString() + "[" + token.kind + "]");
token = manager.getNextToken();
}
Parser parser = new Parser(Parser.class.getResourceAsStream("example"));
parser.start();
}
}
PARSER_END(Parser)
// WHITE SPACE
<DEFAULT, IN_STRING_KEYWORD>
SKIP :
{
" " // <-- skipping spaces
| "\t"
| "\n"
| "\r"
| "\f"
}
// TOKENS
TOKEN :
{
< KEYWORD1 : "keyword1" > : IN_STRING_KEYWORD
}
<IN_STRING_KEYWORD>
TOKEN : {<STRING : <CONCAT_STRING> | <UNQUOTED_STRING> > : DEFAULT
| <#CONCAT_STRING : <QUOTED_STRING> ("+" <QUOTED_STRING>)+ >
// <-- CONCAT_STRING never matches "+" part when input is "'smth' +", because whitespace is not ignored!?
| <#QUOTED_STRING : <SINGLEQUOTED_STRING> | <DOUBLEQUOTED_STRING> >
| <#SINGLEQUOTED_STRING : "'" (~["'"])* "'" >
| <#DOUBLEQUOTED_STRING :
"\""
(
(~["\"", "\\"]) |
("\\" ["n", "t", "\"", "\\"])
)*
"\""
>
| <#UNQUOTED_STRING : (~[" ","\t", ";", "{", "}", "/", "*", "'", "\"", "\n", "\r"] | "/" ~["/", "*"] | "*" ~["/"])+ >
}
void start() :
{}
{
(<KEYWORD1><STRING>";")+ <EOF>
}
Here's an example file that should get parsed:
keyword1 "foo" + ' bar';
I'd like to match the argument of the first keyword1 as a single <STRING> token.
Current output:
keyword1[6]
Exception in thread "main" com.example.parser.TokenMgrError: Lexical error at line 1, column 15. Encountered: " " (32), after : "\"foo\""
at com.example.parser.ParserTokenManager.getNextToken(ParserTokenManager.java:616)
at com.example.parser.Parser.main(Parser.java:12)
I'm using JavaCC 5.0.
STRING is expanding to the longest sequence that can be matched, which is "foo" as the error indicates. The space after the closing double quote is not part of the definition of the private token CONCAT_STRING. Skip tokens do not apply within the definition of other tokens, so you must incorporate the space directly into the definition, on either side of the +.
As an aside, I recommend have a final token definition like so:
<each-state-in-which-the-empty-string-cannot-be-recognized>
TOKEN : {
< ILLEGAL : ~[] >
}
This prevents TokenMgrErrors from being thrown and makes debugging a bit easier.
I've been looking through the ANTLR v3 documentation (and my trusty copy of "The Definitive ANTLR reference"), and I can't seem to find a clean way to implement escape sequences in string literals (I'm currently using the Java target). I had hoped to be able to do something like:
fragment
ESCAPE_SEQUENCE
: '\\' '\'' { setText("'"); }
;
STRING
: '\'' (ESCAPE_SEQUENCE | ~('\'' | '\\'))* '\''
{
// strip the quotes from the resulting token
setText(getText().substring(1, getText().length() - 1));
}
;
For example, I would want the input token "'Foo\'s House'" to become the String "Foo's House".
Unfortunately, the setText(...) call in the ESCAPE_SEQUENCE fragment sets the text for the entire STRING token, which is obviously not what I want.
Is there a way to implement this grammar without adding a method to go back through the resulting string and manually replace escape sequences (e.g., with something like setText(escapeString(getText())) in the STRING rule)?
Here is how I accomplished this in the JSON parser I wrote.
STRING
#init{StringBuilder lBuf = new StringBuilder();}
:
'"'
( escaped=ESC {lBuf.append(getText());} |
normal=~('"'|'\\'|'\n'|'\r') {lBuf.appendCodePoint(normal);} )*
'"'
{setText(lBuf.toString());}
;
fragment
ESC
: '\\'
( 'n' {setText("\n");}
| 'r' {setText("\r");}
| 't' {setText("\t");}
| 'b' {setText("\b");}
| 'f' {setText("\f");}
| '"' {setText("\"");}
| '\'' {setText("\'");}
| '/' {setText("/");}
| '\\' {setText("\\");}
| ('u')+ i=HEX_DIGIT j=HEX_DIGIT k=HEX_DIGIT l=HEX_DIGIT
{setText(ParserUtil.hexToChar(i.getText(),j.getText(),
k.getText(),l.getText()));}
)
;
For ANTLR4, Java target and standard escaped string grammar, I used a dedicated singleton class : CharSupport to translate string. It is available in antlr API :
STRING : '"'
( ESC
| ~('"'|'\\'|'\n'|'\r')
)*
'"' {
setText(
org.antlr.v4.misc.CharSupport.getStringFromGrammarStringLiteral(
getText()
)
);
}
;
As I saw in V4 documentation and by experiments, #init is no longer supported in lexer part!
Another (possibly more efficient) alternative is to use rule arguments:
STRING
#init { final StringBuilder buf = new StringBuilder(); }
:
'"'
(
ESCAPE[buf]
| i = ~( '\\' | '"' ) { buf.appendCodePoint(i); }
)*
'"'
{ setText(buf.toString()); };
fragment ESCAPE[StringBuilder buf] :
'\\'
( 't' { buf.append('\t'); }
| 'n' { buf.append('\n'); }
| 'r' { buf.append('\r'); }
| '"' { buf.append('\"'); }
| '\\' { buf.append('\\'); }
| 'u' a = HEX_DIGIT b = HEX_DIGIT c = HEX_DIGIT d = HEX_DIGIT { buf.append(ParserUtil.hexChar(a, b, c, d)); }
);
I needed to do just that, but my target was C and not Java. Here's how I did it based on answer #1 (and comment), in case anyone needs something alike:
QUOTE : '\'';
STR
#init{ pANTLR3_STRING unesc = GETTEXT()->factory->newRaw(GETTEXT()->factory); }
: QUOTE ( reg = ~('\\' | '\'') { unesc->addc(unesc, reg); }
| esc = ESCAPED { unesc->appendS(unesc, GETTEXT()); } )+ QUOTE { SETTEXT(unesc); };
fragment
ESCAPED : '\\'
( '\\' { SETTEXT(GETTEXT()->factory->newStr8(GETTEXT()->factory, (pANTLR3_UINT8)"\\")); }
| '\'' { SETTEXT(GETTEXT()->factory->newStr8(GETTEXT()->factory, (pANTLR3_UINT8)"\'")); }
)
;
HTH.