I've almost finished my first adventure with ANTLR, and it's been quite a trip. Unfortunately, always only counts in horseshoes, hand grenades, and nuclear weapons, right?
Anyways, I'm trying to parse an input that looks like this:
; IF AGE IS LESS THAN 21, STILL RETURN TRUE FOR OVERSEAS LOCATION \r\n
SHOW "AGE REQUIREMENTS FAILED" FOR \r\n
IF AGE < 21 THEN \r\n
LOCATION = "OVERSEAS" \r\n
ENDIF \r\n
\r\n
; NEED SOMEONE WHO HAS WORKED FOR US FOR > 1 YEAR EXCEPT FOR CEO \r\n
SHOW "MINIMUM TIME REQUIREMENT NOT MET" FOR \r\n
IF STARTDATE > TODAY - 1 YEAR THEN \r\n
EMPLID=001 \r\n
ENDIF \r\n
Generally, if the test fails, the message is shown.
Anyways, a set can contain 1 or more SHOW rules. Processing of a single SHOW rule works, but it won't "split" when an inputstream contains > 1 SHOW rules.
Here are the relevant rules from the grammar:
showGroup returns [List<PolicyEvaluation> value]
#init {List<PolicyEvaluation> peList = new ArrayList<PolicyEvaluation>();}
: (expr1=show)* {peList.add($expr1.value);}
{
System.out.println("Entered policyGroup rule");
$value = peList;
}
;
// evaluate a single SHOW statement
show returns [PolicyEvaluation value]
: ('SHOW' expr1=STRING 'FOR')? expr2=ifStatement EOL*
{
System.out.println("Entered show rule");
Boolean expr2Value = (Boolean) $expr2.value;
PolicyEvaluation pe = new PolicyEvaluation();
if (expr1 == null) {
pe.setValue(expr2Value);
pe.setMessage(null);
} else {
if (expr2Value == false) {
pe.setValue(false);
pe.setMessage(expr1.getText());
} else {
pe.setValue(true);
pe.setMessage(null);
}
}
$value = pe;
}
;
// rules leading up to the show rule
// domain-specific grammar rules
STRING: '"' ID (' ' ID)* '"'
{
System.out.println("Entered STRING lexer rule");
// strip the quotes once we match this token
setText(getText().substring(1, getText().length()-1));
}
;
COMMENT: ';' (ID|' ')* EOL {$channel = HIDDEN;};
EOL: ('\r'|'\n'|'\r\n') {$channel = HIDDEN;};
SPACE: ' ' {$channel = HIDDEN;};
Maybe this is something simple. Any help is appreciated.
Jason
Try changing this: (expr1=show)* {peList.add($expr1.value);}
to this: (expr1=show {peList.add($expr1.value);})*
The action as it is will only fire after all show matches have completed, leaving you to operate on the last expr1.
Related
I try to figure out how to get values from the parser.
My input is 'play the who' and it should return a string with 'the who'.
Sample.g:
text returns [String value]
: speech = wordExp space name {$value = $speech.text;}
;
name returns [String value]
: SongArtist = WORD (space WORD)* {$value = $SongArtist.text;}
;
wordExp returns [String value]
: command = PLAY {$value = $command.text;} | command = SEARCH {$value = $command.text;}
;
PLAY : 'play';
SEARCH : 'search';
space : ' ';
WORD : ( 'a'..'z' | 'A'..'Z' )*;
WS
: ('\t' | '\r'| '\n') {$channel=HIDDEN;}
;
If I enter 'play the who' that tree comes up:
http://i.stack.imgur.com/ET61P.png
I created a Java file to catch the output. If I call parser.wordExp() I supposed to get 'the who', but it returns the object and this EOF failure (see the output below). parser.text() returns 'play'.
import org.antlr.runtime.*;
import a.b.c.SampleLexer;
import a.b.c.SampleParser;
public class Main {
public static void main(String[] args) throws Exception {
ANTLRStringStream in = new ANTLRStringStream("play the who");
SampleLexer lexer = new SampleLexer(in);
CommonTokenStream tokens = new CommonTokenStream(lexer);
SampleParser parser = new SampleParser(tokens);
System.out.println(parser.text());
System.out.println(parser.wordExp());
}
}
The console return this:
play
a.b.c.SampleParser$wordExp_return#1d0ca25a
line 1:12 no viable alternative at input '<EOF>'
How can I catch 'the who'? It is weird for me why I can not catch this string. The interpreter creates the tree correctly.
First, in your grammar, speech only gets assigned the return value of parser rule wordExp. If you want to manipulate the return value of rule name as well, you can do this with an additional variable like the example below.
text returns [String value]
: a=wordExp space b=name {$value = $a.text+" "+$b.text;}
;
Second, invoking parser.text() parses the entire input. A second invocation (in your case parser.wordExp()) thus finds EOF. If you remove the second call the no viable alternative at input 'EOF' goes away.
There may be a better way to do this, but in the meantime this may help you out.
I want to be able to output both "==" and "=" as tokens.
For example, the input text file is:
biscuit==cookie apple=fruit+-()
The output:
biscuit
=
=
cookie
apple
=
fruit
+
-
(
)
What I want the output to be:
biscuit
==
cookie
apple
=
fruit
+
-
(
)
Here is my code:
Scanner s = null;
try {
s = new Scanner(new BufferedReader(new FileReader("input.txt")));
s.useDelimiter("\\s|(?<=\\p{Punct})|(?=\\p{Punct})");
while (s.hasNext()) {
String next = s.next();
System.out.println(next);
}
} finally {
if (s != null) {
s.close();
}
}
Thank you.
Edit: I want to be able to keep the current regex.
Just split the input string according to the below regex .
String s = "biscuit==cookie apple=fruit";
String[] tok = s.split("\\s+|\\b(?==+)|(?<==)(?!=)");
System.out.println(Arrays.toString(tok));
Output:
[biscuit, ==, cookie, apple, =, fruit]
Explanation:
\\s+ Matches one or more space characters.
| OR
\\b(?==+) Matches a word boundary only if it's followed by a = symbol.
| OR
(?<==) Lookafter to = symbol.
(?!=) And match the boundary only if it's not followed by a = symbol.
Update:
String s = "biscuit==cookie apple=fruit+-()";
String[] tok = s.split("\\s+|(?<!=)(?==+)|(?<==)(?!=)|(?=[+()-])");
System.out.println(Arrays.toString(tok));
Output:
[biscuit, ==, cookie, apple, =, fruit, +, -, (, )]
You might be able to qualify those punctuations with some additional assertions.
# "\\s|(?<===)|(?<=\\p{Punct})(?!(?<==)(?==))|(?=\\p{Punct})(?!(?<==)(?==))"
\s
| (?<= == )
| (?<= \p{Punct} )
(?!
(?<= = )
(?= = )
)
| (?= \p{Punct} )
(?!
(?<= = )
(?= = )
)
Info update
If some characters aren't covered in \p{Punct} just add them as a separate class within
the punctuation subexpressions.
For engines that don't do certain properties well inside classes, use this ->
# Raw: \s|(?<===)|(?<=\p{Punct}|[=+])(?!(?<==)(?==))|(?=\p{Punct}|[=+])(?!(?<==)(?==))
\s
| (?<= == )
| (?<= \p{Punct} | [=+] )
(?!
(?<= = )
(?= = )
)
| (?= \p{Punct} | [=+] )
(?!
(?<= = )
(?= = )
)
For engines that handle properties well inside classes, this is a better one ->
# Raw: \s|(?<===)|(?<=[\p{Punct}=+])(?!(?<==)(?==))|(?=[\p{Punct}=+])(?!(?<==)(?==))
\s
| (?<= == )
| (?<= [\p{Punct}=+] )
(?!
(?<= = )
(?= = )
)
| (?= [\p{Punct}=+] )
(?!
(?<= = )
(?= = )
)
In other words you want to split on
one or more whitespaces
place which has = after it and non-= before it (like foo|= where | represents this place)
place which has = before it it and non-= after it (like =|foo where | represents this place)
In other words
s.useDelimiter("\\s+|(?<!=)(?==)|(?<==)(?!=)");
// ^^^^^ ^^^^^^^^^^^ ^^^^^^^^^^^
//cases: 1) 2) 3)
Since it looks like you are building parser I would suggest using tool which will let you build correct grammar like http://www.antlr.org/. But if you must stick with regex then other improvement which will let you build regex easier would be using Matcher#find instead of delimiter from Scanner. This way your regex and code could look like
String data = "biscuit==cookie apple=fruit+-()";
String regex = "<=|==|>=|[\\Q<>+-=()\\E]|[^\\Q<>+-=()\\E]+";
Pattern p = Pattern.compile(regex);
Matcher m = p.matcher(data);
while (m.find())
System.out.println(m.group());
Output:
biscuit
==
cookie apple
=
fruit
+
-
(
)
You can make this regex more general by using
String regex = "<=|==|>=|\\p{Punct}|\\P{Punct}+";
// ^^^^^^^^^^ ^^^^^^^^^^^-- standard cases
// ^^ ^^ ^^------------------------- special cases
Also this approach would require reading data from file first, and storing it in single String which you would parse. You can find many ways of how to read text from file for instance in this question:
Reading a plain text file in Java
so you can use something like
String data = new String(Files.readAllBytes(Paths.get("input.txt")));
You can specify encoding which String should use while reading bytes from file by using constructor String(bytes, encoding). So you can write it as new String(butes,"UTF-8") or to avoid typos while selecting encoding use one of stored in StandardCharsets class like new String(bytes, StandardCharsets.UTF_8).
(?===)|(?<===)|\s|(?<!=)(?==)|(?<==)(?!=)|(?=\p{P})|(?<=\p{P})|(?=\+)
You can try this.Se demo.
http://regex101.com/r/wQ1oW3/18
I have an instruction like:
db.insert( {
_id:3,
cost:{_0:11},
description:"This is a description.\nCool, isn\'t it?"
});
The Eclipse plugin I am using, called MonjaDB splits the instruction by newline and I get each line as a separate instruction, which is bad. I fixed it using ;(\r|\n)+ which now includes the entire instruction, however, when sanitizing the newlines between the parts of the JSON, it also sanitizes the \n and \r within string in the json itself.
How do I avoid removing \t, \r, \n from within JSON strings? which are, of course, delimited by "" or ''.
You need to arrange to ignore whitespace when it appears within quotes,. So as suggested by one of the commenters:
\s+ | ( " (?: [^"\\] | \\ . ) * " ) // White-space inserted for readability
Match java whitespace or a double-quoted string where a string consists of " followed by any non-escape, non-quote or an escape + plus any character, then a final ". This way, whitespaces inside strings are not matched.
and replace with $1 if $1 is not null.
Pattern clean = Pattern.compile(" \\s+ | ( \" (?: [^\"\\\\] | \\\\ . ) * \" ) ", Pattern.COMMENTS | Pattern.DOTALL);
StringBuffer sb = new StringBuffer();
Matcher m = clean.matcher( json );
while (m.find()) {
m.appendReplacement(sb, "" );
// Don't put m.group(1) in the appendReplacement because if it happens to contain $1 or $2 you'll get an error.
if ( m.group(1) != null )
sb.append( m.group(1) );
}
m.appendTail(sb);
String cleanJson = sb.toString();
This is totally off the top of my head but I'm pretty sure it's close to what you want.
Edit: I've just got access to a Java IDE and tried out my solution. I had made a couple of mistakes with my code including using \. instead of . in the Pattern. So I have fixed that up and run it on a variation of your sample:
db.insert( {
_id:3,
cost:{_0:11},
description:"This is a \"description\" with an embedded newline: \"\n\".\nCool, isn\'t it?"
});
The code:
String json = "db.insert( {\n" +
" _id:3,\n" +
" cost:{_0:11},\n" +
" description:\"This is a \\\"description\\\" with an embedded newline: \\\"\\n\\\".\\nCool, isn\\'t it?\"\n" +
"});";
// insert above code
System.out.println(cleanJson);
This produces:
db.insert({_id:3,cost:{_0:11},description:"This is a \"description\" with an embedded newline: \"\n\".\nCool, isn\'t it?"});
which is the same json expression with all whitespace removed outside quoted strings and whitespace and newlines retained inside quoted strings.
I want to read a text file where fields data are separated by delimiter | (pipe symbol).
But some unexpected happened :
Here is my code :
doScannerTest ("Y~2011~GT~Nepal~Ganesh~Tiwari~N", "~");
doScannerTest("Y|2011|GT|Nepal|Ganesh|Tiwari|N", "|");
private static void doScannerTest(String recordLine, String delim) {
java.util.Scanner lineScanner = new java.util.Scanner(recordLine);
lineScanner.useDelimiter(delim);
while (lineScanner.hasNext()) {
System.out.println(lineScanner.next());
}
}
The delim ~ works fine but | prints all characters in recordLine.
Why the record with delim | is not working ? I cannot change the framework code(which uses Scanner) and use String Split.
the pipe character is a reserved regex character, you will need to escape it.
for example you need to use
\\|
putting that in your code gives the below output
Y
2011
GT
Nepal
Ganesh
Tiwari
N
I have a grammar that uses the $ character at the start of many terminal rules, such as $video{, $audio{, $image{, $link{ and others that are like this.
However, I'd also like to match all the $ and { and } characters that don't match these rules too. For example, my grammar does not properly match $100 in the CHUNK rule, but adding the $ to the long list of acceptable characters in CHUNK causes the other production rules to break.
How can I change my grammar so that it's smart enough to distinguish normal $, { and } characters from my special production rules?
Basically what I'd to be able to do is say, "if the $ character doesn't have {, video, image, audio, link, etc. after it, then it should go to CHUNK".
grammar Text;
#header {
}
#lexer::members {
private boolean readLabel = false;
private boolean readUrl = false;
}
#members {
private int numberOfVideos = 0;
private int numberOfAudios = 0;
private StringBuilder builder = new StringBuilder();
public String getResult() {
return builder.toString();
}
}
text
: expression*
;
expression
: fillInTheBlank
{
builder.append($fillInTheBlank.value);
}
| image
{
builder.append($image.value);
}
| video
{
builder.append($video.value);
}
| audio
{
builder.append($audio.value);
}
| link
{
builder.append($link.value);
}
| everythingElse
{
builder.append($everythingElse.value);
}
;
fillInTheBlank returns [String value]
: BEGIN_INPUT LABEL END_COMMAND
{
$value = "<input type=\"text\" id=\"" +
$LABEL.text +
"\" name=\"" +
$LABEL.text +
"\" class=\"FillInTheBlankAnswer\" />";
}
;
image returns [String value]
: BEGIN_IMAGE URL END_COMMAND
{
$value = "<img src=\"" + $URL.text + "\" />";
}
;
video returns [String value]
: BEGIN_VIDEO URL END_COMMAND
{
numberOfVideos++;
StringBuilder b = new StringBuilder();
b.append("<div id=\"video1\">Loading the player ...</div>\r\n");
b.append("<script type=\"text/javascript\">\r\n");
b.append("\tjwplayer(\"video" + numberOfVideos + "\").setup({\r\n");
b.append("\t\tflashplayer: \"/trainingdividend/js/jwplayer/player.swf\", file: \"");
b.append($URL.text);
b.append("\"\r\n\t});\r\n");
b.append("</script>\r\n");
$value = b.toString();
}
;
audio returns [String value]
: BEGIN_AUDIO URL END_COMMAND
{
numberOfAudios++;
StringBuilder b = new StringBuilder();
b.append("<p id=\"audioplayer_");
b.append(numberOfAudios);
b.append("\">Alternative content</p>\r\n");
b.append("<script type=\"text/javascript\">\r\n");
b.append("\tAudioPlayer.embed(\"audioplayer_");
b.append(numberOfAudios);
b.append("\", {soundFile: \"");
b.append($URL.text);
b.append("\"});\r\n");
b.append("</script>\r\n");
$value = b.toString();
}
;
link returns [String value]
: BEGIN_LINK URL END_COMMAND
{
$value = "" + $URL.text + "";
}
;
everythingElse returns [String value]
: CHUNK
{
$value = $CHUNK.text;
}
;
BEGIN_INPUT
: '${'
{
readLabel = true;
}
;
BEGIN_IMAGE
: '$image{'
{
readUrl = true;
}
;
BEGIN_VIDEO
: '$video{'
{
readUrl = true;
}
;
BEGIN_AUDIO
: '$audio{'
{
readUrl = true;
}
;
BEGIN_LINK
: '$link{'
{
readUrl = true;
}
;
END_COMMAND
: { readLabel || readUrl }?=> '}'
{
readLabel = false;
readUrl = false;
}
;
URL
: { readUrl }?=> 'http://' ('a'..'z'|'A'..'Z'|'0'..'9'|'.'|'/'|'-'|'_'|'%'|'&'|'?'|':')+
;
LABEL
: { readLabel }?=> ('a'..'z'|'A'..'Z') ('a'..'z'|'A'..'Z'|'0'..'9')*
;
CHUNK
//: (~('${'|'$video{'|'$image{'|'$audio{'))+
: ('a'..'z'|'A'..'Z'|'0'..'9'|' '|'\t'|'\n'|'\r'|'-'|','|'.'|'?'|'\''|':'|'\"'|'>'|'<'|'/'|'_'|'='|';'|'('|')'|'&'|'!'|'#'|'%'|'*')+
;
You can't negate more than a single character. So, the following is invalid:
~('${')
But why not simply add '$', '{' and '}' to your CHUNK rule and remove the + at the end of the CHUNK rule (otherwise it would gobble up to much, possibly '$video{' further in the source, as you have noticed yourself already)?.
Now a CHUNK token will always consist of a single character, but you could create a production rule to fix this:
chunk
: CHUNK+
;
and use chunk in your production rules instead of CHUNK (or use CHUNK+, of course).
Input like "{ } $foo $video{" would be tokenized as follows:
CHUNK {
CHUNK
CHUNK }
CHUNK
CHUNK $
CHUNK f
CHUNK o
CHUNK o
CHUNK
BEGIN_VIDEO $video{
EDIT
And if you let your parser output an AST, you can easily merge all the text that one or more CHUNK's match into a single AST, whose inner token is of type CHUNK, like this:
grammar Text;
options {
output=AST;
}
...
chunk
: CHUNK+ -> {new CommonTree(new CommonToken(CHUNK, $text))}
;
...
An alternative solution which doesn't generate that many single-character tokens would be to allow chunks to contain a $ sign only as the first character. That way your input data will get split up at the dollar signs only.
You can achieve this by introducing a fragment lexer rule (i.e., a rule that does not define a token itself but can be used in other token regular expressions):
fragment CHUNKBODY
: 'a'..'z'|'A'..'Z'|'0'..'9'|' '|'\t'|'\n'|'\r'|'-'|','|'.'|'?'|'\''|':'|'\"'|'>'|'<'|'/'|'_'|'='|';'|'('|')'|'&'|'!'|'#'|'%'|'*';
The CHUNK rule then looks like:
CHUNK
: { !readLabel && !readUrl }?=> (CHUNKBODY|'$')CHUNKBODY*
;
This seems to work for me.