I am using CocoR to generate a java-like scanner/parser:
I'm having some troubles in creating a EBNF expression to match a codeblock:
I'm assuming a code block is surrounded by two well-known tokens:
<& and &>
example:
public method(int a, int b) <&
various code
&>
If I define a nonterminal symbol
codeblock = "<&" {ANY} "&>"
If the code inside the two symbols contains a '<' character the generated compiler will not handle it thus giving a syntax error.
Any hint?
Edit:
COMPILER JavaLike
CHARACTERS
nonZeroDigit = "123456789".
digit = '0' + nonZeroDigit .
letter = 'A' .. 'Z' + 'a' .. 'z' + '_' + '$'.
TOKENS
ident = letter { letter | digit }.
PRODUCTIONS
JavaLike = {ClassDeclaration}.
ClassDeclaration ="class" ident ["extends" ident] "{" {VarDeclaration} {MethodDeclaration }"}" .
MethodDeclaration ="public" Type ident "("ParamList")" CodeBlock.
Codeblock = "<&" {ANY} "&>".
I have omitted some productions for the sake of simplicity.
This is my actual implementation of the grammar. The main bug is that it fails if the code in the block contains one of the symbols '>' or '&'.
Nick, late to the party here ...
A number of ways to do this:
Define tokens for <& and &> so the lexer knows about them.
You may be able to use a COMMENTS directive
COMMENTS FROM <& TO &> - quoted as CoCo expects.
Or make hack NextToken() in your scanner.frame file. Do something like this (pseudo-code):
if (Peek() == CODE_START)
{
while (NextToken() != CODE_END)
{
// eat tokens
}
}
Or can override the Read() method in the Buffer and eat at the lowest level.
HTH
You can expand the ANY term to include <&, &>, and another nonterminal (call it ANY_WITHIN_BLOCK say).
Then you just use
ANY = "<&" | {ANY_WITHIN_BLOCK} | "&>"
codeblock = "<&" {ANY_WITHIN_BLOCK} "&>"
And then the meaning of {ANY} is unchanged if you really need it later.
Okay, I didn't know anything about CocoR and gave you a useless answer, so let's try again.
As I started to say later in the comments, I feel that the real issue is that your grammar might be too loose and not sufficiently well specified.
When I wrote the CFG for the one language I've tried to create, I ended up using a sort of "meet-in-the-middle" approach: I wrote the top-level structure AND the immediate low-level combinations of tokens first, and then worked to make them meet in the mid-level (at about the level of conditionals and control flow, I guess).
You said this language is a bit like Java, so let me just show you the first lines I would write as a first draft to describe its grammar (in pseudocode, sorry. Actually it's like yacc/bison. And here, I'm using your brackets instead of Java's):
/* High-level stuff */
program: classes
classes: main-class inner-classes
inner-classes: inner-classes inner-class
| /* empty */
main-class: class-modifier "class" identifier class-block
inner-class: "class" identifier class-block
class-block: "<&" class-decls "&>"
class-decls: field-decl
| method
method: method-signature method-block
method-block: "<&" statements "&>"
statements: statements statement
| /* empty */
class-modifier: "public"
| "private"
identifier: /* well, you know */
And at the same time as you do all that, figure out your immediate token combinations, like for example defining "number" as a float or an int and then creating rules for adding/subtracting/etc. them.
I don't know what your approach is so far, but you definitely want to make sure you carefully specify everything and use new rules when you want a specific structure. Don't get ridiculous with creating one-to-one rules, but never be afraid to create a new rule if it helps you organize your thoughts better.
Related
I have a problem with antlr4 grammar in java.
I would like to have a lexer value, that is able to parse all of the following inputs:
Only letters
Letters and numbers
Only numbers
My code looks like this:
parser rule:
new_string: NEW_STRING+;
lexer rule:
NEW_DIGIT: [0-9]+;
STRING_CHAR : ~[;\r\n"'];
NEW_STRING: (NEW_DIGIT+ | STRING_CHAR+ | STRING_CHAR+ NEW_DIGIT+);
I know there must be an obvious solution, but I have been trying to find one, and I can't seem to figure out a way.
Thank you in advance!
Since the first two lexer rules are not fragments, they can (and will) be matched if the input contains just digits, or ~[;\r\n"'] (since if equally long sequence of input can be matched, first lexer rule wins).
In fact, STRING_CHAR can match anything that NEW_STRING can, so the latter will never be used.
You need to:
make sure STRING_CHAR does not match digits
make NEW_DIGIT and STRING_CHAR fragments
check the asterisks - almost everything is allowed to repeat in your lexer, it doesn't make sense at first look ( but you need to adjust that according to your requirements that we do not know)
Like this:
fragment NEW_DIGIT: [0-9];
fragment STRING_CHAR : ~[;\r\n"'0-9];
NEW_STRING: (NEW_DIGIT+ | STRING_CHAR+ (NEW_DIGIT+)?);
I'm trying to extract variables from code statements and "if" condition. I have a regex to that but mymatcher.find() doesn't return any values matched.
I don't know what is wrong.
here is my code:
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class test {
public static void main(String[] args) {
String test="x=y+z/n-10+my5th_integer+201";
Pattern mypattern = Pattern.compile("^[a-zA-Z_$][a-zA-Z_$0-9]*$");
Matcher mymatcher = mypattern.matcher(test);
while (mymatcher.find()) {
String find = mymatcher.group(1) ;
System.out.println("variable:" + find);
}
}
}
You need to remove ^ and $ anchors that assert positions at start and end of string repectively, and use mymatcher.group(0) instead of mymatcher.group(1) because you do not have any capturing groups in your regex:
String test="x=y+z/n-10+my5th_integer+201";
Pattern mypattern = Pattern.compile("[a-zA-Z_$][a-zA-Z_$0-9]*");
Matcher mymatcher = mypattern.matcher(test);
while (mymatcher.find()) {
String find = mymatcher.group(0) ;
System.out.println("variable:" + find);
}
See IDEONE demo, the results are:
variable:x
variable:y
variable:z
variable:n
variable:my5th_integer
Usually processing source code with just a regex simply fails.
If all you want to do is pick out identifiers (we discuss variables further below) you have some chance with regular expressions (after all, this is how lexers are built).
But you probably need a much more sophisticated version than what you have, even with corrections as suggested by other authors.
A first problem is that if you allow arbitrary statements, they often have keywords that look like identifiers. In your specific example, "if" looks like an identifier. So your matcher either has to recognize identifier-like substrings, and subtract away known keywords, or the regex itself must express the idea that an identifier has a basic shape but not cannot look like a specific list of keywords. (The latter is called a subtractive regex, and aren't found in most regex engines. It looks something like:
[a-zA-Z_$][a-zA-Z_$0-9]* - (if | else | class | ... )
Our DMS lexer generator [see my bio] has subtractive regex because this is extremely useful in language-lexing).
This gets more complex if the "keywords" are not always keywords, that is,
they can be keywords only in certain contexts. The Java "keyword" enum is just that: if you use it in a type context, it is a keyword; otherwise it is an identifier; C# is similar. Now the only way to know
if a purported identifier is a keyword is to actually parse the code (which is how you detect the context that controls its keyword-ness).
Next, identifiers in Java allow a variety of Unicode characters (Latin1, Russian, Chinese, ...) A regexp to recognize this, accounting for all the characters, is a lot bigger than the simple "A-Z" style you propose.
For Java, you need to defend against string literals containing what appear to be variable names. Consider the (funny-looking but valid) statement:
a = "x=y+z/n-10+my5th_integer+201";
There is only one identifier here. A similar problem occurs with comments
that contain content that look like statements:
/* Tricky:
a = "x=y+z/n-10+my5th_integer+201";
*/
For Java, you need to worry about Unicode escapes, too. Consider this valid Java statement:
\u0061 = \u0062; // means "a=b;"
or nastier:
a\u006bc = 1; // means "akc=1;" not "abc=1;"!
Pushing this, without Unicode character decoding, you might not even
notice a string. The following is a variant of the above:
a = \u0042x=y+z/n-10+my5th_integer+201";
To extract identifiers correctly, you need to build (or use) the equivalent of a full Java lexer, not just a simple regex match.
If you don't care about being right most of the time, you can try your regex. Usually regex-applied-to-source-code-parsing ends badly, partly because of the above problems (e.g, oversimplification).
You are lucky in that you are trying to do for Java. If you had to do this for C#, a very similar language, you'd have to handle interpolated strings, which allow expressions inside strings. The expressions themselves can contain strings... its turtles all the way down. Consider the C# (version 6) statement:
a = $"x+{y*$"z=${c /* p=q */}"[2]}*q" + b;
This contains the identifiers a, b, c and y. Every other "identifier" is actually just a string or comment character. PHP has similar interpolated strings.
To extract identifiers from this, you need a something that understands the nesting of string elements. Lexers usually don't do recursion (Our DMS lexers handle this, for precisely this reason), so to process this correctly you usually need a parser, or at least something that tracks nesting.
You have one other issue: do you want to extract just variable names?
What if the identifier represents a method, type, class or package?
You can't figure this out without having a full parser and full Java name and type resolution, and you have to do this in the context in which the statement is found. You'd be amazed how much code it takes to do this right.
So, if your goals are simpleminded and you don't care if it handles these complications, you can get by with a simple regex to pick out things
that look like identifiers.
If you want to it well (e.g., use this in some production code) the single regex will be total disaster. You'll spend your life explaining to users what they cannot type, and that never works.
Summary: because of all the complications, usually processing source code with just a regex simply fails. People keep re-learning this lesson. It is one of key reasons that lexer generators are widely used in language processing tools.
So, I have text like this
String s = "The if-then-else statement provides a secondary path of execution when an "if" clause evaluates to false. You could use an if-then-else statement in the applyBrakes method to take some action if the brakes are applied when the bicycle is not in motion. In this case, the action is to simply print an error message stating that the bicycle has already stopped."
I need to split this string in Sentences but save punctuation mark at the end of Sentence, so I cant just use something like this:
s.split("[\\.|!|\\?|:] ");
Because if I use it I receive this:
The if-then statement is the most basic of all the control flow statements
It tells your program to execute a certain section of code only if a particular test evaluates to true
For example, the Bicycle class could allow the brakes to decrease the bicycle's speed only if the bicycle is already in motion
One possible implementation of the applyBrakes method could be as follows:
And I'm loosing my punctuation mark at the end, so how can I do it?
First of all your regex [\\.|!|\\?|:] represents . or | or ! or | or ? or | or : because you used character class [...]. You probably wanted to use (\\.|!|\\?|:) or probably better [.!?:] (I am not sure why you want : here, but it is your choice).
Next thing is that if you want to split on space and make sure that . or ! or ? or : character is before it but not consume this preceding character use look-behind mechanism, like
split("(?<=[.!?:])\\s")
But best approach would be using proper tool for splitting sentences, which is BreakIterator. You can find example of usage in this question: Split string into sentences based on periods
You can simply alternate a whitespace with the end of input in your pattern:
// | your original punctuation class,
// | no need for "|" between items
// | (that would include "|"
// | as a delimiter)
// | nor escapes, now that I think of it
// | | look ahead for:
// | | either whitespace
// | | | or end
System.out.println(Arrays.toString(s.split("[.!?:](?=\\s|$)")));
That'll include the last chunk, and print (line breaks added for clarify):
[The if-then-else statement provides a secondary path of execution when an "if" clause evaluates to false,
You could use an if-then-else statement in the applyBrakes method to take some action if the brakes are applied when the bicycle is not in motion,
In this case, the action is to simply print an error message stating that the bicycle has already stopped]
I'm trying to write a grammar in ANTLR, and the rules for recognizing IDs and int literals are written as follows:
ID : Letter(Letter|Digit|'_')*;
TOK_INTLIT : [0-9]+ ;
//this is not the complete grammar btw
and when the input is :
void main(){
int 2a;
}
the problem is, the lexer is recognizing 2 as an int literal and a as an ID, which is completely logical based on the grammar I've written, but I don't want 2a to be recognized this way, instead I want an error to be displayed since identifiers cannot begin with something other than a letter... I'm really new to this compiler course... what should be done here?
It's at least interesting that in C and C++, 2n is an invalid number, not an invalid identifier. That's because the C lexer (or, to be more precise, the preprocessor) is required by the standard to interpret any sequence of digits and letters starting with a digit as a "preprocessor number". Later on, an attempt is made to reinterpret the preprocessor number (if it is still part of the preprocessed code) as one of the many possible numeric syntaxes. 2n isn't, so an error will be generated at that point.
Preprocessor numbers are more complicated than that, but that should be enough of a hint for you to come up with a simple solution for your problem.
Good evening, Stack Overflow.
I'd like to develop an interpreter for expressions based on a pretty simple context-free grammar:
Grammar
Basically, the language is constituted by 2 base statements
( SET var 25 ) // Output: var = 25
( GET ( MUL var 5 ) ) // Output: 125
( SET var2 ( MUL 30 5 ) ) //Output: var2 = 150
Now, I'm pretty sure about what should I do in order to interpret a statement: 1) Lexical analysis to turn a statement into a sequence of tokens 2) Syntax analysis to get a symbol table (HashMap with the variables and their values) and a syntactic tree (to perform the GET statements) to 3) perform an inorder visit of the tree to get the results I want.
I'd like some advice on the parsing method to read the source file. Considering the parser should ignore any whitespace, tabulation or newline, is it possible to use a Java Pattern to get a general statement I want to analyze? Is there a good way to read a statement weirdly formatted (and possibly more complex) like this
(
SET var
25
)
without confusing the parser with the open and closed parenthesises?
For example
Scanner scan; //scanner reading the source file
String pattern = "..." //ideal pattern I've found to represent an expression
while(scan.hasNext(pattern))
Interpreter.computeStatement(scan.next(pattern));
would it be a viable option for this problem?
Solution proposed by Ira Braxter:
Your title is extremely confused. You appear to want to parse what are commonly called "S-expressions" in the LISP world; this takes a (simple but) context-free grammar. You cannot parse such expressions with regexps. Time to learn about real parsers.
Maybe this will help: stackoverflow.com/a/2336769/120163
In the end, I understood thanks to Ira Baxter that this context free grammar can't be parsed with RegExp and I used the concepts of S-Expressions to build up the interpreter, whose source code you can find here. If you have any question about it (mainly because the comments aren't translated in english, even though I think the code is pretty clear), just message me or comment here.
Basically what I do is:
Parse every character and tokenize it (e.g '(' -> is OPEN_PAR, while "SET" -> STATEMENT_SET or a random letter like 'b' is parsed as a VARIABLE )
Then, I use the token list created to do a syntactic analysis, which checks the patterns occuring inside the token list, according to the grammar
If there's an expression inside the statement, I check recursively for any expression inside an expression, throwing an exception and going to the following correct statement if needed
At the end of analysing every single statement, I compute the statement as necessary as for specifications