Source code parsing and macro-like handling of similar constructions

Source code parsing and macro-like handling of similar constructions - java

TL;DR VERSION: Is there a parser generator that supports the following: when some rule is reduced (I assume LALR(1) parser), then reduction isn't performed, but parser backs off and replaces input with different code using values from this rule and parses that code. Repeat if needed. So if code is "i++" and rule is "expr POST_INCR", I can do more or less:
expr POST_INCR -> "(tmp = expr; expr = expr + 1; tmp)"
So basicaly code rewriting using macros?
LONG VERSION:
I wrote yet another simple interpreted language (in Java for simplicity). It works ok, but it raised some question. Introduction is pretty long, but simple and helps to shows my problem clearly (I think).
I have "while" loop. It is pretty simple, given:
WHILE LPARE boolean_expression RPAREN statement
I generate more or less the following:
new WhileNode(boolean_expression, statement);
This creates new node that, when visited later, generates code for my virtual machine. But I also have the following:
FOR LPAREN for_init_expr SEMICOLON boolean_expression SEMICOLON for_post_expr RPAREN statement
This is "for loop" known from Java or C. From aforementioned rule I create more or less the following:
new ListNode(
for_init_expr,
new WhileNode(
boolean_expression,
new ListNode(
statement,
new ListNode(for_post_expr, null))))
This is of course simple transformation, from:
for (for_init ; boolean_expression ; for_post_expr)
statement
to:
for_init
while (boolean_expression) {
statement
for_post_expr;
}
All is fine and dandy, but things get hairy for the following:
FOR LPAREN var_decl COLON expression RPAREN statement
This if well known and liked:
for (int x : new int[] { 1, 2 })
print(x);
I refrain from posting code that generates AST, since basic for loop was already a little bit long, and what we get here is ever worse. This construction is equal to:
int[] tmp = new int[] { 1, 2 };
for (int it = 0 ; it < tmp.length; it = it + 1) {
int x = tmp[it];
print(x);
}
And since I'm not using types, I simply assume that "expression" (so right side, after COLON) is something that I can iterate over (and arrays are not iterable), I call a function on a result of this "expression" which returns instance of Iterable. So, in fact, my rewritten code isn't as simple as one above, is more or less this:
Iterator it = makeIterable(new int[] { 1, 2 });
while (it.hasNext()) {
int x = it.next();
print(x);
}
It doesn't look THAT bad, but note that AST for this generates three function calls and while loop. To show you what mess it is, I post what I have now:
113 | FOR LPAREN var_decl_name.v PIPE simple_value_field_or_call.o RPAREN statement.s
114 {: Symbol sv = ext(_symbol_v, _symbol_o);
115 String autoVarName = generateAutoVariableName();
116 Node iter = new StatementEndNode(sv, "",
117 new BinNode(sv, CMD.SET, "=",
118 new VarDeclNode(sv, autoVarName),
119 new CallNode(sv, "()",
120 new BinNode(sv, CMD.DOT, ".",
121 new MkIterNode(sv, o),
122 new PushConstNode(sv, "iterator")))));
123 Node varinit = new StatementEndNode(sv, "",
124 new BinNode(sv, CMD.SET, "=",
125 v,
126 new PushConstNode(sv, "null")));
127 Node hasnext = new CallNode(sv, "()",
128 new BinNode(sv, CMD.DOT, ".",
129 new VarNode(sv, autoVarName),
130 new PushConstNode(sv, "hasNext")));
131 Node vargennext = new StatementEndNode(sv, "",
132 new BinNode(sv, CMD.SET, "=",
133 new VarNode(sv, v.name),
134 new CallNode(sv, "()",
135 new BinNode(sv, CMD.DOT, ".",
136 new VarNode(sv, autoVarName),
137 new PushConstNode(sv, "next")))));
138 return new ListNode(sv, "",
139 new ListNode(sv, "",
140 new ListNode(sv, "",
141 iter
142 ),
143 varinit
144 ),
145 new WhileNode(sv, "while",
146 hasnext,
147 new ListNode(sv, "",
148 new ListNode(sv, "",
149 vargennext
150 ),
151 s)));
To answer your questions: yes, I am ashamed of this code.
QUESTION: Is there are parser generator that let's me do something about it, namely given rule:
FOR LPAREN var_decl COLON expr RPAREN statement
tell parser to rewrite it as if it was something else. I imagine that this would require some kind of LISP's macro mechanism (which is easy in lisp due to basically lack of grammar whatsoever), maybe similar to this:
FOR LPAREN var_decl COLON expr RPAREN statement =
{ with [ x = generateAutoName(); ]
emit [ "Iterator $x = makeIterable($expr).iterator();"
"while (${x}.hasNext()) {"
"$var_decl = ${x}.next();"
"$statement"
"}"
]
}
I don't know if this is a well known problem or not, I simply don't even know what to look for - the most similar question that I found is this one: Any software for pattern-matching and -rewriting source code? but it isn't anywhere close to what I need, since it is supposed to work as a separate step and not during compilation, so it doesn't qualify.
Any help will be appreciated.

I think you are trying to bend the parser too much. You can simply build
the tree with macro in it, and then post-process the tree to replace the macros with whatever substitution you want.
You can do this by walking the resulting tree, detecting the macro nodes (or places where you want to do substitutions), and simply splicing in replacements with procedural tree hackery. Not pretty but workable. You should able to do this with the result of any parser generator/AST building machinery.
If you want a more structured approach, you could build your AST and then use source-to-source transformations to "rewrite" the macros to their content.
Out DMS Software Reengineering Toolkit can do this, you can read more details
about what the transforms look like.
Using the DMS approach, your concept:
expr POST_INCR -> "(tmp = expr; expr = expr + 1; tmp)"
requires that you parse the original text in the conventional
way with the grammar rule:
term = expr POST_INCR ;
You would give all these grammar rules to DMS and let it
parse the source and build your AST according to the grammar.
Then you apply a DMS rewrite to the resulting tree:
domain MyToyLanguage; -- tells DMS to use your grammar to process the rule below
rule replace_POST_INCR(e: expr): term->term
= "\e POST_INCR" -> " (tmp = \e; \e = \e + 1; tmp) ";
The quote marks here are "domain meta quotes" rather than string literal quotes.
The text outside the double-quotes is DMS rule-syntax. The text inside the quotes is syntax from your language (MyToyLangauge), and is parsed using the parser you provided, some special escapes for pattern variables like \e.
(You don't have to do anything to your grammar to get this pattern-parsing capability; DMS takes care of that).
By convention with DMS, we often name literal tokens like POST_INCR
with a quoted equivalent '++' in the lexer, rather than using such a name.
Instead of
#token POST_INCR "\+\+"
the lexer rule then looks like:
#token '++' "\+\+"
If you do that, then your grammar rule reads like:
term = expr '++' ;
and your rewrite rule now looks like:
rule replace_POST_INCR(e: expr): term->term
= "\e++" -> " (tmp = \e; \e = \e + 1; tmp) ";
Following this convention, the grammar (lexer and BNF)
is (IMHO) a lot more readable,
and the rewrite rules are more readable too, since they stay
extremely close to the actual language syntax.

Perhaps you are looking for something like ANTLR's tree-rewriting rules.
You could probably make your AST construction syntax more readable by defining some helper functions. To my eye, there is a lot of redundancy (why do you need both an enumeration and a character string for an operator?) but I'm not a Java programmer.
One approach you might take:
Start with your parser, which already produces an AST. Add a lexical syntax or two to handle template arguments and gensyms. Then write an AST walker which serializes the AST into the code (either Java or bytecode) needed to regenerate the AST. Using that, you can generate the macro templates functions using your own parser, which means that it will automatically stay in sync with any changes you might make to your AST.

Related

Treat invalid chars as a single token in ANTLR4 lexer

I'm using the JSON grammar from the antlr4 grammar repository to parse JSON files for an editor plugin. It works, but reports invalid chars one by one. The following snippet results in 18 lexer errors:
{
sometext-without-quotes : 42
}
I want to boil it down to 1-2 by treating consecutive, invalid single-char tokens of the same type as one bigger invalid token.
For a similar question, a custom lexer was suggested that glues "unknown" elements to larger tokens: In antlr4 lexer, How to have a rule that catches all remaining "words" as Unknown token?
I assume that this bypasses the usual lexer error reporting, which I would like to avoid, if possible. Isn't there a proper solution for that rather simple task? It seems to have worked by default in ANTLR3.

The answer is in the link you provided. I don't want to copy the original answer completely so I'll try and paraphrase a bit...
In antlr4 lexer, How to have a rule that catches all remaining "words" as Unknown token?
Add unknowns to the lexer that will match multiples of these...
unknowns : Unknown+ ;
...
Unknown : . ;
There was an edit made to this post to cater for the case where you were only using a lexer and not using a parser. If using a parser then you do not need to override the nextToken method because the error can be handled in the parser in a much cleaner way ie unknowns are just another token type as far as the lexer is concerned. The lexer passes these to the parser which can then handle the errors. If using a parser I'd normally recognize all tokens as individual tokens and then in the parser emit the errors ie group them or not. The reason for doing this is all error handling is done in one place ie it's not in the lexer and in the parser. It also makes the lexer simpler to write and test ie it must recognize all text and never fail on any utf8 input. Some people would likely do it differently but this has worked for me with hand written lexers in C. The parser is in charge of determining what's actually valid and how to error on it. One other benefit is that the lexer is fairly generic and can be reused.
For lexer only solution...
Check the answer at the link and look for this comment in the answer...
... but I only have a lexer, no parsers ...
The answer states that you override the nextToken method and goes into some detail on how to do that
#Override
public Token nextToken() {
and the important part in the code is this...
Token next = super.nextToken();
if(next.getType() != Unknown) {
return next;
}
The code that comes after this handles the case where you can match the bad tokens.

What you could do is use lexer modes. For this you'd had to split grammar to parser and lexer grammar. Let's start with lexer grammar:
JSONLexer.g4
/** Taken from "The Definitive ANTLR 4 Reference" by Terence Parr */
// Derived from http://json.org
lexer grammar JSONLexer;
STRING
: '"' (ESC | ~ ["\\])* '"'
;
fragment ESC
: '\\' (["\\/bfnrt] | UNICODE)
;
fragment UNICODE
: 'u' HEX HEX HEX HEX
;
fragment HEX
: [0-9a-fA-F]
;
NUMBER
: '-'? INT '.' [0-9] + EXP? | '-'? INT EXP | '-'? INT
;
fragment INT
: '0' | [1-9] [0-9]*
;
// no leading zeros
fragment EXP
: [Ee] [+\-]? INT
;
// \- since - means "range" inside [...]
TRUE : 'true';
FALSE : 'false';
NULL : 'null';
LCURL : '{';
RCURL : '}';
COL : ':';
COMA : ',';
LBRACK : '[';
RBRACK : ']';
WS
: [ \t\n\r] + -> skip
;
NON_VALID_STRING : . ->pushMode(MODE_ERR);
mode MODE_ERR;
WS1
: [ \t\n\r] + -> skip
;
COL1 : ':' ->popMode;
MY_ERR_TOKEN : ~[':']* ->type(NON_VALID_STRING);
Basically I have added some tokens used in the parser part (like LCURL, COL, COMA etc) and introduced NON_VALID_STRING token, which is basically the first character that's nothing that already is (should be) matched. Once this token is detected, I switch the lexer to MODE_ERR mode. In this mode I go back to default mode once : is detected (this can be changed and maybe refined, but server the purpose here :) ) or I say that everything else is MY_ERR_TOKEN to which I assign NON_VALID_STRING token type. Here is what ATNLRWorks says to this when I run interpret lexer option with your input:
So s is NON_VALID_STRING type and so is everything else until :. So, same type but two different tokens. If you want them not to be of the same type, simply omit the type call in the lexer grammar.
Here is the parser grammar now
JSONParser.g4
/** Taken from "The Definitive ANTLR 4 Reference" by Terence Parr */
// Derived from http://json.org
parser grammar JSONParser;
options {
tokenVocab=JSONLexer;
}
json
: object
| array
;
object
: LCURL pair (COMA pair)* RCURL
| LCURL RCURL
;
pair
: STRING COL value
;
array
: LBRACK value (COMA value)* RBRACK
| LBRACK RBRACK
;
value
: STRING
| NUMBER
| object
| array
| TRUE
| FALSE
| NULL
;
and if you run the test rig (I do it with ANTLRworks) you'll get a single error (see screenshot)
Also you could accumulate lexer errors by overriding the generated lexer class, but I understood in the question that this is not desired or I didn't understand that part :)

ANTLR4: context-sensitive spaces?

In a grammar I would like to implement texts without string delimiting xxx.
The idea is to define things like
a = xxx;
instead of
a ="xxx";
to simplify typewriting. Otherwise there should be variable definitions
and other kind of stuff as well.
As a first approach I experimented with this grammar:
grammar SpaceNoSpace;
prog: stat+;
stat:
'somethingelse' ';'
| typed description* content
;
typed:
'something' '-'
| 'anotherthing' '-'
;
description:
'someSortOfDetails' COLON ID HASH
| 'otherSortOfDetails' COLON ID HASH
;
content:
contenttext ';'
;
contenttext:
(~';')*
;
COLON: ':' ;
HASH: '#';
SEMI: ';';
SPACE: ' ';
ID: [a-zA-Z][a-zA-z0-9]+;
WS : [ \t\n\r]+ -> channel(HIDDEN);
ANY_CHAR : . ;
This works fine for input files like this:
something-someSortOfDetails: aVariableName#
this is the content of this;
anotherthing-someSortOfDetails: aVariableName#
here spaces are accepted as much as you like;
somethingelse;
But modifying the last line to
somethingelse ;
leads to a syntax error:
line 7:15 extraneous input ' ' expecting ';'
This probably reveals that the lexer rule
WS : [ \t\n\r]+ -> channel(HIDDEN);
is not applied, (but the SPACE rule???).
Otherwise, if I delete the SPACE lexer-rule, the space
in "somethingelse ;" is ignored (by lexer-rule WS), so that the parser rule
stat : somethingelse as a consequence is detected correctly.
But as a consequence of the deleted SPACE-rule the content text will be reduced to single in-between-spaces,
so "this here" will be reduced to "this here".
This is not a big problem, but nevertheless it is an
interesting question:
is it possible to implement context-sensitive WS or SPACE
lexer rules:
within the content parser-rule any space should be preserved,
in any other rule spaces should be ignored.
Is this possible to define such a context-sensitive lexer-rule behavior in ANTLR4?

Have you considered Lexer Modes? The section with mode(), pushMode(), popMode is probably interesting for you.
Yet I think that lexer modes are more a problem than a solution. Their purpose is to use (parser) context in the lexer. Consequently one should discard the paradigm of separating lexer and parser - and use a PEG-Parser instead.

Since the SPACE rule is before the WS rule, the lexer is returning a space token to the parser. The ' ' is not being being placed on the hidden channel.

Modifying a plain ANLTR file in background

I would like to modify a grammar file by adding some Java code programatically in background. What I mean is that consider you have a println statement that you want to add it in a grammar before ANTLR works (i.e. creates lexer and parser files).
I have this trivial code: {System.out.println("print");}
Here is the simple grammar that I want to add the above snippet in the 'prog' rule after 'expr':
Before:
grammar Expr;
prog: (expr NEWLINE)* ;
expr: expr ('*'|'/') expr
| INT
;
NEWLINE : [\r\n]+ ;
INT : [0-9]+ ;
After:
grammar Expr;
prog: (expr {System.out.println("print");} NEWLINE)* ;
expr: expr ('*'|'/') expr
| INT
;
NEWLINE : [\r\n]+ ;
INT : [0-9]+ ;
Again note that I want to do this in runtime so that the grammar does not show any Java code (the 'before' snippet).
Is it possible to make this real before ANLTR generates lexer and parser files? Is there any way to visit (like AST visitor for ANTLR) a simple grammar?

ANTLR 4 generates a listener interface and base class (empty implementation) by default. If you also specify the -visitor flag when generating your parser, it will create a visitor interface and base class. Either of these features may be used to execute code using the parse tree rather than embedding actions directly in the grammar file.

If the code is always in the same place, simply insert a function call that acts as a hook to include real code afterwards.
This way you don't have to modify the source or generate the lexer/parser again.
If you want to insert code at predefined points (like enter rule/leave rule), go with Sam's solution to insert them into a listener. In either case it should not be necessary to modify the grammar file.
grammar Expr;
prog: (expr {Hooks.programHook();} NEWLINE)* ;
expr: expr ('*'|'/') expr
| INT
;
NEWLINE : [\r\n]+ ;
INT : [0-9]+ ;
In a java file of your choice (I'm no Java programmer, so the real syntax may be different):
public class Hooks
{
public static void programHook()
{
System.out.println("print");
}
}

Regex for almost JSON but not quite

Hello all I'm trying to parse out a pretty well formed string into it's component pieces. The string is very JSON like but it's not JSON strictly speaking. They're formed like so:
createdAt=Fri Aug 24 09:48:51 EDT 2012, id=238996293417062401, text='Test Test', source="Region", entities=[foo, bar], user={name=test, locations=[loc1,loc2], locations={comp1, comp2}}
With output just as chunks of text nothing special has to be done at this point.
createdAt=Fri Aug 24 09:48:51 EDT 2012
id=238996293417062401
text='Test Test'
source="Region"
entities=[foo, bar]
user={name=test, locations=[loc1,loc2], locations={comp1, comp2}}
Using the following expression I am able to get most of the fields separated out
,(?=(?:[^\"]*\"[^\"]*\")*(?![^\"]*\"))(?=(?:[^']*'[^']*')*(?![^']*'))
Which will split on all the commas not in quotes of any type, but I can't seem to make the leap to where it splits on commas not in brackets or braces as well.

Because you want to handle nested parens/brackets, the "right" way to handle them is to tokenize them separately, and keep track of your nesting level. So instead of a single regex, you really need multiple regexes for your different token types.
This is Python, but converting to Java shouldn't be too hard.
# just comma
sep_re = re.compile(r',')
# open paren or open bracket
inc_re = re.compile(r'[[(]')
# close paren or close bracket
dec_re = re.compile(r'[)\]]')
# string literal
# (I was lazy with the escaping. Add other escape sequences, or find an
# "official" regex to use.)
chunk_re = re.compile(r'''"(?:[^"\\]|\\")*"|'(?:[^'\\]|\\')*[']''')
# This class could've been just a generator function, but I couldn;'t
# find a way to manage the state in the match function that wasn't
# awkward.
class tokenizer:
def __init__(self):
self.pos = 0
def _match(self, regex, s):
m = regex.match(s, self.pos)
if m:
self.pos += len(m.group(0))
self.token = m.group(0)
else:
self.token = ''
return self.token
def tokenize(self, s):
field = '' # the field we're working on
depth = 0 # how many parens/brackets deep we are
while self.pos < len(s):
if not depth and self._match(sep_re, s):
# In Java, change the "yields" to append to a List, and you'll
# have something roughly equivalent (but non-lazy).
yield field
field = ''
else:
if self._match(inc_re, s):
depth += 1
elif self._match(dec_re, s):
depth -= 1
elif self._match(chunk_re, s):
pass
else:
# everything else we just consume one character at a time
self.token = s[self.pos]
self.pos += 1
field += self.token
yield field
Usage:
>>> list(tokenizer().tokenize('foo=(3,(5+7),8),bar="hello,world",baz'))
['foo=(3,(5+7),8)', 'bar="hello,world"', 'baz']
This implementation takes a few shortcuts:
The string escapes are really lazy: it only supports \" in double quoted strings and \' in single-quoted strings. This is easy to fix.
It only keeps track of nesting level. It does not verify that parens are matched up with parens (rather than brackets). If you care about that you can change depth into some sort of stack and push/pop parens/brackets onto it.

Instead of splitting on the comma, you can use the following regular expression to match the chunks that you want.
(?:^| )(.+?)=(\{.+?\}|\[.+?\]|.+?)(?=,|$)
Python:
import re
text = "createdAt=Fri Aug 24 09:48:51 EDT 2012, id=238996293417062401, text='Test Test', source=\"Region\", entities=[foo, bar], user={name=test, locations=[loc1,loc2], locations={comp1, comp2}}"
re.findall(r'(?:^| )(.+?)=(\{.+?\}|\[.+?\]|.+?)(?=,|$)', text)
>> [
('createdAt', 'Fri Aug 24 09:48:51 EDT 2012'),
('id', '238996293417062401'),
('text', "'Test Test'"),
('source', '"Region"'),
('entities', '[foo, bar]'),
('user', '{name=test, locations=[loc1,loc2], locations={comp1, comp2}}')
]
I've set up grouping so it will separate out the "key" and the "value". It will do the same in Java - See it working in Java here:
http://www.regexplanet.com/cookbook/ahJzfnJlZ2V4cGxhbmV0LWhyZHNyDgsSBlJlY2lwZRj0jzQM/index.html
Regular Expression explained:
(?:^| ) Non-capturing group that matches the beginning of a line, or a space
(.+?) Matches the "key" before the...
= equal sign
(\{.+?\}|\[.+?\]|.+?) Matches either a set of {characters}, [characters], or finally just characters
(?=,|$) Look ahead that matches either a , or the end of a line.

parsing a string by regular expression

I have a string of
"name"=>"3B Ae", "note"=>"Test fddd \"33 Ae\" FIXME", "is_on"=>"keke, baba"
and i want to parse it by a java program into segments of
name
3B Ae
note
Test fddd \"33 Ae\" FIXME
is_on
keke, baba
It is noted that the contents of the string, i.e. name, 3B Ae, are not fixed.
Any suggestion?

If you:
replace => with :
Wrap the full string with {}
The result will look like this, which is valid JSON. You can then use a JSON parser (GSON or Jackson, for example) to parse those values into a java object.
{
"name": "3B Ae",
"note": "Test fddd \"33 Ae\" FIXME",
"is_on": "keke, baba"
}
If you have control over the process that produces this string, I highly recommend that you use a standard format like JSON or XML that can be parsed more easily on the other end.

Because of the quoting rules, I'm not certain that a regular expression (even a PCRE with negative lookbehinds) can parse this consistently. What you probably want is to use a pushdown automaton, or some other parser capable of handling a context-free language.

If you can make sure your data (key or value) does not have a => or a , (or find some other delimiters that will not occur), the solution is pretty simple:
Split the string by , you get the key => value pairs
Split the key value => pairs by => you get what you want
if inputString holds
"name"=>"3B Ae", "note"=>"Test fddd \"33 Ae\" FIXME", "is_on"=>"keke baba"
(from a file for instance)
(I have changed the , to ; from between keke and baba)
String[] keyValuePairs = inputString.split(",");
for(String oneKeyValue : keyValuePairs)
{
String[] keyAndValue = oneKeyValue.split("=>");
}

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Source code parsing and macro-like handling of similar constructions - java

Related

Treat invalid chars as a single token in ANTLR4 lexer

ANTLR4: context-sensitive spaces?

Modifying a plain ANLTR file in background

Regex for almost JSON but not quite

parsing a string by regular expression

Categories

Resources