In ANTLR how to parse the nested function using java - java

I am new to ANTLR, I have a list of functions which are mostly of nested types.
Below are the examples for functions:
1. Function.add(Integer a,Integer b)
2. Function.concat(String a,String b)
3. Function.mul(Integer a,Integer b)
If the input is having:
Function.concat(Function.substring(String,Integer,Integer),String)
So by using ANTLR with Java program, how to define and validate whether the function names are correct and parameter count and datatypes are correct, which has to be recursive as the Function will be in deeply nested format?
validate test class:
public class FunctionValidate {
public static void main(String[] args) {
FunctionValidate fun = new FunctionValidate();
fun.test("FUNCTION.concat(1,2)");
}
private String test(String source) {
CodePointCharStream input = CharStreams.fromString(source);
return compile(input);
}
private String compile(CharStream source) {
MyFunctionsLexer lexer = new MyFunctionsLexer(source);
CommonTokenStream tokenStream = new CommonTokenStream(lexer);
MyFunctionsParser parser = new MyFunctionsParser(tokenStream);
FunctionContext tree = parser.function();
ArgumentContext tree1= parser.argument();
FunctionValidateVisitorImpl visitor = new FunctionValidateVisitorImpl();
visitor.visitFunction(tree);
visitor.visitArgument(tree1);
return null;
}
}
Visitor impl:
public class FunctionValidateVisitorImpl extends MyFunctionsParserBaseVisitor<String> {
#Override
public String visitFunction(MyFunctionsParser.FunctionContext ctx) {
String function = ctx.getText();
System.out.println("------>"+function);
return null;
}
#Override
public String visitArgument(MyFunctionsParser.ArgumentContext ctx){
String param = ctx.getText();
System.out.println("------>"+param);
return null;
}
}
System.out.println("------>"+param); this statement is not printing argument it is only printing ------>.

This task can be accomplished by implementing two main steps:
1) Parse given input and build an Abstract Syntax Tree (AST).
2) Traverse the tree and validate each function, each argument, one after another, using a Listener or a Visitor patterns.
Fortunately, ANTLR provides tools for implementing both steps.
Here's a simple grammar I wrote based on your example. It does recursive parsing and builds the AST. You may want to extend its functionality to meet your needs.
Lexer:
lexer grammar MyFunctionsLexer;
FUNCTION: 'FUNCTION';
NAME: [A-Z]+;
DOT: '.';
COMMA: ',';
L_BRACKET: '(';
R_BRACKET: ')';
WS : [ \t\r\n]+ -> skip;
Parser:
parser grammar MyFunctionsParser;
options {
tokenVocab=MyFunctionsLexer;
}
function : FUNCTION '.' NAME '('(argument (',' argument)*)')';
argument: (NAME | function);
Important thing to notice here: the parser does not make distinction between a valid (from your point of view) and invalid functions, arguments, number of arguments, etc.
So the function like Function.whatever(InvalidArg) is also a valid construction from parser's point of view. To further validate the input and test whether it meets your requirements (which is a predefined list of functions and their arguments), you have to traverse the tree using a Listener or a Visitor (I think Visitor fits here perfectly).
To get a better understanding of what it is I'd recommend reading this and this. But if you want to get deeper into the subject, you should definitely look at "The Dragons Book", which covers the topic exhaustively.

Related

Java - Abstract Syntax Tree with grammar

i am building a simple grammar parser, with regex. It works but now i want to add Abstract Syntax Tree. But i still dont understand how to set it up. i included the parser.
The parser gets a string and tokeniaze it with the lexer.
The tokens include the value and a type.
Any idea how to setup nodes to build a AST?
public class Parser {
lexer lex;
Hashtable<String, Integer> data = new Hashtable<String, Integer>();
public Parser( String str){
ArrayList<Token> token = new ArrayList<Token>();
String[] strpatt = { "[0-9]*\\.[0-9]+", //0
"[a-zA-Z_][a-zA-Z0-9_]*",//1
"[0-9]+",//2
"\\+",//3
"\\-",//4
"\\*",//5
"\\/",//6
"\\=",// 7
"\\)",// 8
"\\("//9
};
lex = new lexer(strpatt, "[\\ \t\r\n]+");
lex.set_data(str);
}
public int peek() {
//System.out.println(lex.peek().type);
return lex.peek().type;
}
public boolean peek( String[] regex) {
return lex.peek(regex);
}
public void set_data( String s) {
lex.set_data(s);
}
public Token scan() {
return lex.scan();
}
public int goal() {
int ret = 0;
while(peek() != -1) {
ret = expr();
}
return ret;
}
}
Currently, you are simply evaluating as you parse:
ret = ret * term()
The easiest way to think of an AST is that it is just a different kind of evaluation. Instead of producing a numeric result from numeric sub-computations, as above, you produce a description of the computation from descriptions of the sub-computations. The description is represented as small structure which contains the essential information:
ret = BuildProductNode(ret, term());
Or, perhaps
ret = BuildBinaryNode(Product, ret, term());
It's a tree because the Node objects which are being passed around refer to other Node objects without there ever being a cycle or a node with two different parents.
Clearly there are a lot of details missing from the above, particularly the precise nature of the Node object. But it's a rough outline.

JJTree Token Manager Declarations

Hi everyone I have the following code in my .jjt file for my abstract syntax tree for checking track if where the nodes are made within the file that is passed to it but I cannot access this variable from my semantic checker class.
The code is bellow and any help would be appreciated! I've tried everything and I'm losing hope at this stage.
This is the integer in the .jjt file i'd like to access
TOKEN_MGR_DECLS :
{
static int commentNesting = 0;
public static int linenumber = 0;
}
SKIP : /*STRUCTURES AND CHARACTERS TO SCAPE*/
{
" "
| "\t"
| "\n" {linenumber++;}
| "\r"
| "\f"
}
An example of one of my nodes
void VariableDeclaration() #VariableDeclaration : {Token t; String id; String type;}
{
t = <VARIABLE> id = Identifier() <COLON> type = Type()
}
My semantic checker class
public class SemanticCheckVisitor implements "My jjt file visitor" {
public Object visit(VariableDeclaration node, Object data) {
node.childrenAccept(this, data);
return data;
}
How would it be possible to get the linenumber which this node was declared?
Thanks everyone.
}
You can see an example of this in the Teaching Machine's Java parser, which is here.
First you need to modify your SimpleNode type to include a field for the line number. In the TM I added a declaration
private SourceCoords myCoords ;
where SourceCoords is a type that includes not only the line number, but also information about what file the line was in. You can just use an int field. Also in SimpleNode you need to declare some methods like this
public void setCoords( SourceCoords toSet ) { myCoords = toSet ; }
public SourceCoords getCoords() { return myCoords ; }
You might want to declare them in the Node interface too.
Over in your .jjt file, use the option
NODE_SCOPE_HOOK=true;
And declare two methods in your parser class
void jjtreeOpenNodeScope(Node n) {
((SimpleNode)n).setCoords( new SourceCoords( file, getToken(1).beginLine ) ) ;
}
void jjtreeCloseNodeScope(Node n) {
}
Hmm. I probably should have declared the methods in Node to avoid that ugly cast.
One more thing, you are keeping count of the lines yourself. It's better the get the line number from the token, like I did. Your counter will generally by one token ahead. But when the parser looks ahead, it could be several tokens ahead.
If the token manager isn't keeping count of the lines correctly, then use your own count, but communicate it to the parser through an extra added field in the Token class.
Generally it's a bad idea to compute anything in the token manager and then use it in the parser unless its information you store in the tokens.

Sonarqube: How to get the expression string when writing custom java rules?

The target class is:
class Example{
public void m(){
System.out.println("Hello" + 1);
}
}
I want to get the full string of MethodInvocation "System.out.println("Hello" + 1)" for some regex check. How to write?
public class Rule extends BaseTreeVisitor implements JavaFileScanner {
#Override
public void visitMethodInvocation(MethodInvocationTree tree) {
//get the string of MethodInvocation
//some regex check
super.visitMethodInvocation(tree);
}
}
I wrote some code inspection rules using eclipse jdt and idea psi whose expression tree node has these attributes. I wonder why sonar's just has first and last token instead.
Thanks!
An old question, but I have a solution.
This works for any sort of tree.
#Override
public void visitMethodInvocation(MethodInvocationTree tree) {
int firstLine = tree.firstToken().line();
int lastLine = tree.lastToken().line();
String rawText = getRelevantLines(firstLine, lastLine);
// do your thing here with rawText
}
private String getRelevantLines(int startLine, int endLine) {
StringBuilder builder = new StringBuilder();
context.getFileLines().subList(startLine, endLine).forEach(builder::append);
return builder.toString();
}
If you want to refine further, you can also use firstToken().column or perhaps use the method name in your regex.
If you want more lines/bigger scope, just use the parent of that tree tree.parent()
This will also handle cases where the expression/params/etc span multiple lines.
There might be a better way... but I don't know of any other way. May update if I figure out something better.

Matching OR expression using Grappa (Java PEG Parser)

I'm new to PEG parsing and trying to write a simple parser to parse out an expression like: "term1 OR term2 anotherterm" ideally into an AST that would look something like:
OR
-----------|---------
| |
"term1" "term2 anotherterm"
I'm currently using Grappa (https://github.com/fge/grappa) but it's not matching even the more basic expression "term1 OR term2". This is what I have:
package grappa;
import com.github.fge.grappa.annotations.Label;
import com.github.fge.grappa.parsers.BaseParser;
import com.github.fge.grappa.rules.Rule;
public class ExprParser extends BaseParser<Object> {
#Label("expr")
Rule expr() {
return sequence(terms(), wsp(), string("OR"), wsp(), terms(), push(match()));
}
#Label("terms")
Rule terms() {
return sequence(whiteSpaces(),
join(term()).using(wsp()).min(0),
whiteSpaces());
}
#Label("term")
Rule term() {
return sequence(oneOrMore(character()), push(match()));
}
Rule character() {
return anyOf(
"0123456789" +
"abcdefghijklmnopqrstuvwxyz" +
"ABCDEFGHIJKLMNOPQRSTUVWXYZ" +
"-_");
}
#Label("whiteSpaces")
Rule whiteSpaces() {
return join(zeroOrMore(wsp())).using(sequence(optional(cr()), lf())).min(0);
}
}
Can anyone point me in the right direction?
(author of grappa here...)
OK, so, what you seem to want is in fact a parse tree.
Very recently there has been an extension to grappa (2.0.x+) developed which can answer your needs: https://github.com/ChrisBrenton/grappa-parsetree.
Grappa, by default, only "blindly" matches text and has a stack at its disposal, so you could have, for instance:
public Rule oneOrOneOrEtc()
{
return join(one(), push(match())).using(or()).min(1));
}
But then all of your matches would have been on the stack... Not very practical, but still usable in some situations (see, for instance, sonar-sslr-grappa).
In your case you want this package. You can do this with it:
// define your root node
public final class Root
extends ParseNode
{
public Root(final String match, final List<ParseNode> children)
{
super(match, children);
}
}
// define your parse node
public final class Alternative
extends ParseNode
{
public Alternative(final String match, final List<ParseNode> children)
{
super(match, children);
}
}
That is the minimal implementation. And then your parser can look like this:
#GenerateNode(Alternative.class)
public Rule alternative() // or whatever
{
return // whatever an alternative is
}
#GenerateNode(Root.class)
public Rule root
{
return join(alternative())
.using(or())
.min(1);
}
What happens here is since the root node is matched before the alternative, if, say, you have a string:
a or b or c or d
then the root node will match the "whole sequence", and it will have four alternatives matching each a, b, c, and d.
Full credits here go to Christopher Brenton for coming up with this idea in the first place!

Antlr4 create more meaningful/consistent type name

By default the token.getType() method returns an int, and is pretty useless to code based upon, without loading and parsing the *.tokens file that is generated.
How do ANTLR users usually go about making consistent use of the token types? What I mean by consistent is that if you change the grammar, the token numbers are very likely to change.
Do you typically create a Utility class that loads the *.tokens file and parses it?
My sample Search.tokens file:
LOCATION=8
TIME=5
AGE=3
WS=1
COMPARATIVE=9
GENDER=4
PHRASE=2
A sample token stream:
(token.getType(), token.getText())
9 [MegaBlocks vs Legos], -1 [<EOF>]
Currently I'm doing something like:
public class TokenMapper {
private HashMap<Integer, String> tokens;
public TokenMapper(String file) {
tokens = new HashMap<Integer, String>();
parse(file);
}
private void parse(String file) {
// trivial code that maps the Integer typeId to the String name
}
public Integer type(String type) {
for(Map.Entry<Integer, String> entry : tokens.entrySet()) {
if(entry.getValue().equals(type)) {
return entry.getKey();
}
}
return null;
}
public String type(Integer type) {
return tokens.get(type);
}
}
Then I can always refer to my tokens by names such as LOCATION or GENDER and don't have to worry about the Integer values that tend to change.
When you generate your lexer and/or parser, the generated class will contain constants for each token type declared in the grammar as well as the ones imported via a tokens file.
For example, if you have the following grammar:
lexer grammar SearchLexer;
options { tokenVocab = Search; }
...
Then the generated SearchLexer.java class will contain constants (public static final int) for LOCATION and GENDER because they were imported due to the tokenVocab option.

Categories