I am writing a lexer for java in flex.
The java spec says:
"The longest possible translation is used at each step, even if the result does not ultimately make a correct program while another lexical translation would. There is one exception: if lexical translation occurs in a type context (§4.11) and the input stream has two or more consecutive > characters that are followed by a non-> character, then each > character must be translated to the token for the numerical comparison operator >."
So how can I distinguish between right shift operator and something like in <List<List>>?
The original Java generics proposal (JSR-14) required modifying the Java grammar for parameterized types so that it would accept >> and >>> in contexts where multiple close angle brackets were possible. (I couldn't find a useful authoritative link for JSR-14 but Gilad Bracha's GJ specification is still available on his website; the grammar modifications are shown in section 2.3.)
These modifications were never formally incorporated in any Java standard as far as I know; eventually, JLS8 incorporated the change to the description of lexical analysis which you quote in your question. (See JDK-8021600, which also reproduces the convoluted grammar which was originally proposed.)
The grammar modifications proposed by Bracha et al will work, but you might find that they make incorporating other grammar changes more complicated. (I haven't really looked at this in any depth, so it might not actually be a problem for the current Java Language Specification. But it still might be an issue for future editions.)
While contextual lexical analysis does allow the simpler grammar actually used in the JLS, it certainly creates difficulties for lexical analysis. One possible approach is to abandon lexical analysis altogether by using a scannerless parser; this will certainly work but you won't be able to accomplish that within the Bison/Flex model. Also, you might find that some of the modifications needed to support scannerless parsing also require non-trivial changes to the published grammar.
Another possibility is to use lexical feedback from the parser, by incorporating mid-rule actions (MRAs) which turn a "type context" flag on and off when type contexts are entered and exited. (There is a complete list of type contexts in §4.11 which can be used to find the appropriate locations in the grammar.) If you try this, please be aware that the execution of MRAs is not fully synchronised with lexical analysis because the parser generally requires a lookahead token to decide whether or not to reduce the MRA. You often need to put the MRA one symbol earlier in the grammar than you might think, so that it actually takes effect by the time it is needed.
Another possibility might be to never recognise >> and >>> as tokens. Instead, the lexer could return two different > tokens, one used when the immediate next character is a >:
>/> { return CONJUNCTIVE_GT; }
> { return INDEPENDENT_GT; }
/* These two don't need to be changed. */
>>= { return SHIFT_ASSIGN; }
>>>= { return LONG_SHIFT_ASSIGN; }
Then you can modify your grammar to recognise >> and >>> operators, while allowing either form of > as a close angle bracket:
shift_op : CONJUNCTIVE_GT INDEPENDENT_GT
long_shift_op: CONJUNCTIVE_GT CONJUNCTIVE_GT INDEPENDENT_GT
close_angle : CONJUNCTIVE_GT | INDEPENDENT_GT
gt_op : INDENPENDENT_GT /* This unit production is not really necessary */
That should work (although I haven't tried it), but it doesn't play well with the Bison/Yacc operator precedence mechanism, because you cannot declare precedence for a non-terminal. So you'd need to use an expression grammar with explicit operator precedence rules, rather than an ambiguous grammar augmented with precedence declarations.
Related
I have to give the user the option to enter in a text field a mathematical formula and then save it in the DB as a String. That is easy enough, but I also need to retrieve it and use it to do calculations.
For example, assume I allow someone to specify the formula of employee salary calculation which I must save in String format in the DB.
GROSS_PAY = BASIC_SALARY - NO_PAY + TOTAL_OT + ALLOWANCE_TOTAL
Assume that terms such as GROSS_PAY, BASIC_SALARY are known to us and we can make out what they evaluate to. The real issue is we can't predict which combinations of such terms (e.g. GROSS_PAY etc.) and other mathematical operators the user may choose to enter (not just the +, -, ×, / but also the radical sigh - indicating roots - and powers etc. etc.). So how do we interpret this formula in string format once where have retrieved it from DB, so we can do calculations based on the composition of the formula.
Building an expression evaluator is actually fairly easy.
See my SO answer on how to write a parser. With a BNF for the range of expression operators and operands you exactly want, you can follow this process to build a parser for exactly those expressions, directly in Java.
The answer links to a second answer that discusses how to evaluate the expression as you parse it.
So, you read the string from the database, collect the set of possible variables that can occur in the expression, and then parse/evaluate the string. If you don't know the variables in advance (seems like you must), you can parse the expression twice, the first time just to get the variable names.
as of Evaluating a math expression given in string form there is a JavaScript Engine in Java which can execute a String functionality with operators.
Hope this helps.
You could build a string representation of a class that effectively wraps your expression and compile it using the system JavaCompiler — it requires a file system. You can evaluate strings directly using javaScript or groovy. In each case, you need to figure out a way to bind variables. One approach would be to use regex to find and replace known variable names with a call to a binding function:
getValue("BASIC_SALARY") - getValue("NO_PAY") + getValue("TOTAL_OT") + getValue("ALLOWANCE_TOTAL")
or
getBASIC_SALARY() - getNO_PAY() + getTOTAL_OT() + getALLOWANCE_TOTAL()
This approach, however, exposes you to all kinds of injection type security bugs; so, it would not be appropriate if security was required. The approach is also weak when it comes to error diagnostics. How will you tell the user why their expression is broken?
An alternative is to use something like ANTLR to generate a parser in java. It's not too hard and there are a lot of examples. This approach will provide both security (users can't inject malicious code because it won't parse) and diagnostics.
This is the code sample which I want to parse. I want getSaveable PaymentMethodsSmartList() as a token, when I overwrite the function in the parserBaseListener.java file created by ANTLR.
/** #suppress */
public any function getSaveablePaymentMethodsSmartList() {
if(!structKeyExists(variables, "saveablePaymentMethodsSmartList")) {
variables.saveablePaymentMethodsSmartList = getService("paymentService").getPaymentMethodSmartList();
variables.saveablePaymentMethodsSmartList.addFilter('activeFlag', 1);
variables.saveablePaymentMethodsSmartList.addFilter('allowSaveFlag', 1);
variables.saveablePaymentMethodsSmartList.addInFilter('paymentMethodType', 'creditCard,giftCard,external,termPayment');
if(len(setting('accountEligiblePaymentMethods'))) {
variables.saveablePaymentMethodsSmartList.addInFilter('paymentMethodID', setting('accountEligiblePaymentMethods'));
}
}
return variables.saveablePaymentMethodsSmartList;
}
I already have the grammar that parses function declaration, but I need a new rule that can associate doctype comments with a function declaration and give the function name as separate token if there is a doctype comment associated with it.
Grammar looks like this:
functionDeclaration
: accessType? typeSpec? FUNCTION identifier
LEFTPAREN parameterList? RIGHTPAREN
functionAttribute* body=compoundStatement
;
You want grammar rules that:
return X if something "far away" in the source is a A,
returns Y if something far away is a B (or ...).
In general, this is context dependency. It is not handled well by context free grammars, which is something that ANTLR is trying to approximate with its BNF rules. In essence, what you think you want to do is to encode history of what the parser has seen long ago, to influence what is being produced now. Generally that is hard.
The usual solution to something like this is to not address it in the grammar at all. Instead:
have the grammar rules produce an X regardless of what is far away,
build a tree as you parse (ANTLR does this for you); this captures not only X but everything about the parsed entity, including tokens for A that are far away
walk over the tree, interpreting a found X as Y if the tree contains the A (usually far away in the tree).
For your specific case of docstring-influences-function name, you can probably get away with encoding far away history.
You need (IMHO, ugly) grammar rules that look something like this:
functionDeclaration: documented_function | undocumented_function ;
documented_function: docstring accessType? typeSpec? FUNCTION
documented_function_identifier rest_of_function ;
undocumented_function: accessType? typeSpec? FUNCTION
identifier rest_of_function ;
rest_of_function: // avoids duplication, not pretty
LEFTPAREN parameterList? RIGHTPAREN
functionAttribute* body=compoundStatement ;
You have to recognize the docstring as an explicit token that can be "seen" by the parser, which means modifying your lexer to make docstrings from comments (e.g, whitespace) into tokens. [This is the first ugly thing]. Then having seen such a docstring, the lexer has to switch to a lexical mode that will pick up identifier-like text and produce documented_function_identifier, and then switch back to normal mode. [This is the second ugly thing]. What you are doing is implementing literally a context dependency.
The reason you can accomplish this in spite of my remarks about context dependency is that A is not very far away; it is within few tokens of X.
So, you could do it this way. I would not do this; you are trying to make the parser do too much. Stick to the "usual solution". (You'll have different problem: your A is a comment/whitespace, and probably isn't stored in the tree by ANTLR. You'll have to solve that; I'm not an ANTLR expert.)
I had this question on a homework assignment (don't worry, already done):
[Using your favorite imperative language, give an example of
each of ...] An error that the compiler can neither catch nor easily generate code to
catch (this should be a violation of the language definition, not just a
program bug)
From "Programming Language Pragmatics" (3rd ed) Michael L. Scott
My answer, call main from main by passing in the same arguments (in C and Java), inspired by this. But I personally felt like that would just be a semantic error.
To me this question's asking how to producing an error that is neither syntactic nor semantic, and frankly, I can't really think of situation where it wouldn't fall in either.
Would it be code that is susceptible to exploitation, like buffer overflows (and maybe other exploitation I've never heard about)? Some sort of pit fall from the structure of the language (IDK, but lazy evaluation/weak type checking)? I'd like a simple example in Java/C++/C, but other examples are welcome.
Undefined behaviour springs to mind. A statement invoking UB is neither syntactically nor semantically incorrect, but rather the result of the code cannot be predicted and is considered erroneous.
An example of this would be (from the Wikipedia page) an attempt to modify a string-constant:
char * str = "Hello world!";
str[0] = 'h'; // undefined-behaviour here
Not all UB-statements are so easily identified though. Consider for example the possibility of signed-integer overflow in this case, if the user enters a number that is too big:
// get number from user
char input[100];
fgets(input, sizeof input, stdin);
int number = strtol(input, NULL, 10);
// print its square: possible integer-overflow if number * number > INT_MAX
printf("%i^2 = %i\n", number, number * number);
Here there may not necessarily be signed-integer overflow. And it is impossible to detect it at compile- or link-time since it involves user-input.
Statements invoking undefined behavior1 are semantically as well as syntactically correct but make programs behave erratically.
a[i++] = i; // Syntax (symbolic representation) and semantic (meaning) both are correct. But invokes UB.
Another example is using a pointer without initializing it.
Logical errors are also neither semantic nor syntactic.
1. Undefined behavior: Anything at all can happen; the Standard imposes no requirements. The program may fail to compile, or it may execute incorrectly (either crashing or silently generating incorrect results), or it may fortuitously do exactly what the programmer intended.
Here's an example for C++. Suppose we have a function:
int incsum(int &a, int &b) {
return ++a + ++b;
}
Then the following code has undefined behavior because it modifies an object twice with no intervening sequence point:
int i = 0;
incsum(i, i);
If the call to incsum is in a different TU from the definition of the function, then it's impossible to catch the error at compile time, because neither bit of code is inherently wrong on its own. It could be detected at link time by a sufficiently intelligent linker.
You can generate as many examples as you like of this kind, where code in one TU has behavior that's conditionally undefined for certain input values passed by another TU. I went for one that's slightly obscure, you could just as easily use an invalid pointer dereference or a signed integer arithmetic overflow.
You can argue how easy it is to generate code to catch this -- I wouldn't say it's very easy, but a compiler could notice that ++a + ++b is invalid if a and b alias the same object, and add the equivalent of assert (&a != &b); at that line. So detection code can be generated by local analysis.
I'm developing a software to generate a Turing Machine from a regular expression.
[ EDIT: To clarify, the OP wants to take a regular expression as input, and programmatically generate a Turing Machine to perform the same task. OP is seeking to perform the task of creating a TM from a regular expression, not using a regular expression. ]
First I'll explain a bit what I have done and then what is my specific problem:
I've modeled the regular expression as follows:
RegularExpression (interface): the classes below implements this interface
Simple (ie: "aaa","bbb","abcde"): this is a leaf class it does not have any subexpressions
ComplexWithoutOr (ie: "a(ab)*","(a(ab)c(b))*"): this class contains a list of RegularExpression.
ComplexWithOr (ie: "a(a|b)","(a((ab)|c(b))"): this class contains an Or operation, which contains a list of RegularExpression. It represents the "a|b" part of the first example and the "(ab)|c(b)" of the second one.
Variable (ie: "awcw", where w E {a,b}*): this is not yet implemented, but the idea is to model it as a leaf class with some different logic from Simple. It represents the "w" part of the examples.
It is important that you understand and agree with the model above. If you have questions make a comment, before continue reading...
When it comes to MT generation, I have different levels of complexity:
Simple: this type of expression is already working. Generates a new state for each letter and moves right. If in any state, the letter read is not the expected, it starts a "rollback circuit" that finishes with the MT head in the initial position and stops in a not final state.
ComplexWithoutOr: here it comes my problem. Here, the algorithm generates an MT for each subexpression and concat them. This work for some simple cases, but I have problems with the rollback mechanism.
Here is an example that does not work with my algorithm:
"(ab)abac": this is a ComplexWithoutOr expression that contains a ComplexWithOr expression "(ab)" (that has a Simple expression inside "ab") and a Simple expression "abac"
My algorithm generates first an MT1 for "ab". This MT1 is used by the MT2 for "(ab)*", so if MT1 succeed it enters again in MT1, otherwise MT1 rollbacks and MT2 finishes right. In other words, MT2 cannot fail.
Then, it generates an MT3 for "abac". The output of MT2 it is the input of MT3. The output of MT3 is the result of the execution
Now, let suppose this input string: "abac". As you can see it matches with the regular expression. So let see what happens when the MT is executed.
MT1 is executed right the first time "ab". MT1 fails the second time "ac" and rollback, putting the MT head in the 3rd position "a". MT2 finishes right and input is forwarded to MT3. MT3 fails, because "ac"!="abac". So MT does not recognize "abac".
Do you understand the problem? Do you know any solution for this?
I'm using Java to develop it, but the language it is not important, I'd like to discuss the algorithm.
It is not entirely clear to me what exactly you are trying to implement. It looks like you want to make a Turing Machine (or any FSM in general) that accepts only those strings that are also accepted by the regular expression. In effect, you want to convert a regular expression to a FSM.
Actually that is exactly what a real regex matcher does under the hood. I think this series of articles by Russ Cox covers a lot of what you want to do.
Michael Sipser, in Introduction to the Theory of Computation, proves in chapter 1 that regular expressions are equivalent to finite automata in their descriptive power. Part of the proof involves constructing a nondeterministic finite automaton (NDFA) that recognizes the language described by a specific regular expression. I'm not about to copy half that chapter, which would be quite hard due to the notation used, so I suggest you borrow or purchase the book (or perhaps a Google search using these terms will turn up a similar proof) and use that proof as the basis for your algorithm.
As Turing machines can simulate an NDFA, I assume an algorithm to produce an NDFA is good enough.
in the chomsky hierarchy a regex is Level3, whereas a TM is Level1. this means, that a TM can produce any regex, but not vice versa.
I worked the last 5 days to understand how unification algorithm works in Prolog .
Now ,I want to implement such algorithm in Java ..
I thought maybe best way is to manipulate the string and decompose its parts using some datastructure such as Stacks ..
to make it clear :
suppose user inputs is:
a(X,c(d,X)) = a(2,c(d,Y)).
I already take it as one string and split it into two strings (Expression1 and 2 ).
now, how can I know if the next char(s) is Variable or constants or etc.. ,
I can do it by nested if but it seems to me not good solution ..
I tried to use inheritance but the problem still ( how can I know the type of chars being read ?)
First you need to parse the inputs and build expression trees. Then apply Milner's unification algorithm (or some other unification algorithm) to figure out the mapping of variables to constants and expressions.
A really good description of Milner's algorithm may be found in the Dragon Book: "Compilers: Principles, Techniques and Tools" by Aho, Sethi and Ullman. (Milners algorithm can also cope with unification of cyclic graphs, and the Dragon Book presents it as a way to do type inference). By the sounds of it, you could benefit from learning a bit about parsing ... which is also covered by the Dragon Book.
EDIT: Other answers have suggested using a parser generator; e.g. ANTLR. That's good advice, but (judging from your example) your grammar is so simple that you could also get by with using StringTokenizer and a hand-written recursive descent parser. In fact, if you've got the time (and inclination) it is worth implementing the parser both ways as a learning exercise.
It sounds like this problem is more to do with parsing than unification specifically. Using something like ANTLR might help in terms of turning the original string into some kind of tree structure.
(It's not quite clear what you mean by "do it by nested", but if you mean that you're doing something like trying to read an expression, and recursing when meeting each "(", then that's actually one of the right ways to do it -- this is at heart what the code that ANTLR generates for you will do.)
If you are more interested in the mechanics of unifying things than you are in parsing, then one perfectly good way to do this is to construct the internal representation in code directly, and put off the parsing aspect for now. This can get a bit annoying during development, as your Prolog-style statements are now a rather verbose set of Java statements, but it lets you focus on one problem at a time, which is usually helpful.
(If you structure things this way, this should make it straightforward to insert a proper parser later, that will produce the same sort of tree as you have until then been constructing by hand. This will let you attack the two problems separately in a reasonably neat fashion.)
Before you get to do the semantics of the language, you have to convert the text into a form that's easy to operate on. This process is called parsing and the semantic representation is called an abstract syntax tree (AST).
A simple recursive descent parser for Prolog might be hand written, but it's more common to use a parser toolkit such as Rats! or Antlr
In an AST for Prolog, you might have classes for Term, and CompoundTerm, Variable, and Atom are all Terms. Polymorphism allows the arguments to a compound term to be any Term.
Your unification algorithm then becomes unifying the name of any compound term, and recursively unifying the value of each argument of corresponding compound terms.