Iterating over tokens in HIDDEN channel - java

I am currently working on creating an IDE for the custom, very lua-like scripting language MobTalkerScript (MTS), which provides me with an ANTLR4 lexer. Since the specifications from the language file for MTS puts comments into the HIDDEN_CHANNEL channel, I need to tell the lexer to actually read from the HIDDEN_CHANNEL channel. This is how I tried to do that.
Mts3Lexer lexer = new Mts3Lexer(new ANTLRInputStream("<replace this with the input>"));
lexer.setTokenFactory(new CommonTokenFactory(false));
lexer.setChannel(Token.HIDDEN_CHANNEL);
Token token = lexer.emit();
int type = token.getType();
do {
switch(type) {
case Mts3Lexer.LINE_COMMENT:
case Mts3Lexer.COMMENT:
System.out.println("token "+token.getText()+" is a comment");
default:
System.out.println("token "+token.getText()+" is not a comment");
}
} while((token = lexer.nextToken()) != null && (type = token.getType()) != Token.EOF);
Now, if I use this code on the following input, nothing but token ... is not a comment gets printed to the console.
function foo()
-- this should be a single-line comment
something = "blah"
--[[ this should
be a multi-line
comment ]]--
end
The tokens containing the comments never show up, though. So I searched for the source of this problem and found the following method in the ANTLR4 Lexer class:
/** Return a token from this source; i.e., match a token on the char
* stream.
*/
#Override
public Token nextToken() {
if (_input == null) {
throw new IllegalStateException("nextToken requires a non-null input stream.");
}
// Mark start location in char stream so unbuffered streams are
// guaranteed at least have text of current token
int tokenStartMarker = _input.mark();
try{
outer:
while (true) {
if (_hitEOF) {
emitEOF();
return _token;
}
_token = null;
_channel = Token.DEFAULT_CHANNEL;
_tokenStartCharIndex = _input.index();
_tokenStartCharPositionInLine = getInterpreter().getCharPositionInLine();
_tokenStartLine = getInterpreter().getLine();
_text = null;
do {
_type = Token.INVALID_TYPE;
// System.out.println("nextToken line "+tokenStartLine+" at "+((char)input.LA(1))+
// " in mode "+mode+
// " at index "+input.index());
int ttype;
try {
ttype = getInterpreter().match(_input, _mode);
}
catch (LexerNoViableAltException e) {
notifyListeners(e); // report error
recover(e);
ttype = SKIP;
}
if ( _input.LA(1)==IntStream.EOF ) {
_hitEOF = true;
}
if ( _type == Token.INVALID_TYPE ) _type = ttype;
if ( _type ==SKIP ) {
continue outer;
}
} while ( _type ==MORE );
if ( _token == null ) emit();
return _token;
}
}
finally {
// make sure we release marker after match or
// unbuffered char stream will keep buffering
_input.release(tokenStartMarker);
}
}
The line that caught my eye was the following.
_channel = Token.DEFAULT_CHANNEL;
I don't know much about ANTLR, but apparently this line keeps the lexer in the DEFAULT_CHANNEL channel.
Is the way I tried to read from the HIDDEN_CHANNEL channel right or can't I use nextToken() with the hidden channel?

I found out why the lexer didn't give me any tokens containing the comments - I seem to have missed that the grammar file skips comments instead of putting them into the hidden channel. Contacted the author, changed the grammar file and now it works.
Note to myself: pay more attention to what you read.

For Go (golang) this snippet works for me:
import (
"github.com/antlr/antlr4/runtime/Go/antlr"
)
type antlrparser interface {
GetParser() antlr.Parser
}
func fullText(prc antlr.ParserRuleContext) string {
p := prc.(antlrparser).GetParser()
ts := p.GetTokenStream()
tx := ts.GetTextFromTokens(prc.GetStart(), prc.GetStop())
return tx
}
just pass your ctx.GetSomething() into fullText. Of course, as shown above, whitespace has to go to the hidden channel in the *.g4 file:
WS: [ \t\r\n] -> channel(HIDDEN);

Related

How to efficiently check if read line from Buffered reader contains a string from an enum list

I am a computer science university student working on my first 'big' project outside of class. I'm attempting to read through large text files (2,000 - 3,000 lines of text), line by line with buffered reader. When a keyword from a list of enums is located, I want it to send the current line from buffered reader to its appropriate method to be handled appropriatley.
I have a solution, but I have a feeling in my gut that there is a much better way to handle this situation. Any suggestions or feedback would be greatly appreciated.
Current Solution
I am looping through the the list of enums, then checking if the current enum's toString return is in the current line from buffered reader using the String.contains method.
If the enum is located, the enum is used in a switch statement for the appropriate method call. (I have 13 total cases just wanted to keep the code sample short).
try (BufferedReader reader = new BufferedReader(new FileReader(inputFile.getAbsoluteFile()))){
while ((currentLine = reader.readLine()) != null) {
for (GameFileKeys gameKey : GameFileKeys.values()) {
if (currentLine.contains(gameKey.toString())) {
switch (gameKey) {
case SEAT -> seatAndPlayerAssignment(currentTableArr, currentLine);
case ANTE -> playerJoinLate(currentLine);
}
}
}
}
}
Previous Solution
Originally, I had a nasty list of if statements checking if the current line contained one of the keywords and then handled it appropriatley. Clearly that is far from optimal, but my gut tells me that my current solution is also less than optimal.
try (BufferedReader reader = new BufferedReader(new FileReader(inputFile.getAbsoluteFile()))){
while ((currentLine = reader.readLine()) != null) {
if(currentLine.contains(GameFileKey.SEAT){
seatAndPlayerAssignment(currentTableArr, currentLine);
}
else if(currentLine.contains(GameFileKey.ANTE){
playerJoinLate(currentLine);
}
}
}
Enum Class
In case you need this, or have any general feedback for how I'm implementing my enums.
public enum GameFileKeys {
ANTE("posts ante"),
SEAT("Seat ");
private final String gameKey;
GameFileKeys(String str) {
this.gameKey = str;
}
#Override
public String toString() {
return gameKey;
}
}
I cannot improve over the core of your code: the looping on values() of the enum, performing a String#contains for each enum object’s string, and using a switch. I can make a few minor suggestions.
I suggest you not override the toString method on your enum. The Object#toString method is generally best used only for debugging and logging, not logic or presentation.
Your string passed to constructor of the enum is likely similar to the idea of a display name commonly seen in such enums. The formal enum name (all caps) is used internally within Java, while the display name is used for display to the user or exchanged with external systems. See the Month and DayOfWeek enums as examples offering a getDisplayName method.
Also, an enum should be named in the singular. This avoids confusion with any collections of the enum’s objects.
By the way, looks like you have a stray SPACE in your second enum's argument.
At first I thought it would help to have a list of all the display names, and a map of display name to enum object. However, in the end neither is needed for your purpose. I kept those as they might prove interesting.
public enum GameFileKey
{
ANTE( "posts ante" ),
SEAT( "Seat" );
private String displayName = null;
private static final List < String > allDisplayNames = Arrays.stream( GameFileKey.values() ).map( GameFileKey :: getDisplayName ).toList();
private static final Map < String, GameFileKey > mapOfDisplayNameToGameFileKey = Arrays.stream( GameFileKey.values() ).collect( Collectors.toUnmodifiableMap( GameFileKey :: getDisplayName , Function.identity() ) );
GameFileKey ( String str ) { this.displayName = str; }
public String getDisplayName ( ) { return this.displayName; }
public static GameFileKey forDisplayName ( final String displayName )
{
return
Objects.requireNonNull(
GameFileKey.mapOfDisplayNameToGameFileKey.get( displayName ) ,
"None of the " + GameFileKey.class.getCanonicalName() + " enum objects has a display name of: " + displayName + ". Message # 4dcefee2-4aa2-48cf-bf66-9a4bde02ac37." );
}
public static List < String > allDisplayNames ( ) { return GameFileKey.allDisplayNames; }
}
You can use a stream of the lines of your file being processed. Just FYI, not necessarily better than your code.
public class Demo
{
public static void main ( String[] args )
{
Demo app = new Demo();
app.demo();
}
private void demo ( )
{
try
{
Path path = Demo.getFilePathToRead();
Stream < String > lines = Files.lines( path );
lines.forEach(
line -> {
for ( GameFileKey gameKey : GameFileKey.values() )
{
if ( line.contains( gameKey.getDisplayName() ) )
{
switch ( gameKey )
{
case SEAT -> this.seatAndPlayerAssignment( line );
case ANTE -> this.playerJoinLate( line );
}
}
}
}
);
}
catch ( IOException e )
{
throw new RuntimeException( e );
}
}
private void playerJoinLate ( String line )
{
System.out.println( "line = " + line );
}
private void seatAndPlayerAssignment ( String line )
{
System.out.println( "line = " + line );
}
public static Path getFilePathToRead ( ) throws IOException
{
Path tempFile = Files.createTempFile( "bogus" , ".txt" );
Files.write( tempFile , "apple\nSeat\norange\nposts ante\n".getBytes() );
return tempFile;
}
}
When run:
line = Seat
line = posts ante

Checking which parameter is missing from file content

I have a TransferReader class which reads a file containing transfer data from bank account to another using the following form:
SenderAccountID,ReceiverAccountID,Amount,TransferDate
"473728292,474728298,1500.00,2019-10-17 12:34:12" (unmodified string)
Suppose that the file has been modified before being read so that one of the above mentioned paramaters are missing, and I want to check which of those are missing.
"474728298,1500.00,2019-10-17 12:34:12" (modified string)
I am using a BufferedReader to read each line, and then splitting each element into a String[] using String.split(",") as delimeter.
As already realized, because the Sender Account ID and Receiver Account ID are right next to one another within a record there is no real way of knowing which ID might be missing unless a delimiter remains in its' place indicating a Null value. There are however mechanisms available to determine that it is indeed one of the two that is missing, which one will need to be carried out through User scrutiny and even then, that may not be good enough. The other record column fields like Amount and Transfer Date can be easily validated or if missing can be implicated within a specific File Data Status Log.
Below is some code that will read a data file (named Data.csv) and log potential data line (record) errors into a List Interface object which is iterated through and displayed within the Console Window when the read is complete. There are also some small helper methods. Here is the code:
private void checkDataFile(String filePath) {
String ls = System.lineSeparator();
List<String> validationFailures = new ArrayList<>();
StringBuilder sb = new StringBuilder();
// 'Try With Resources' used here to auto-close reader.
try (BufferedReader reader = new BufferedReader(new FileReader(filePath))) {
String line;
int lineCount = 0;
// Read the file line-by-line.
while ((line = reader.readLine()) != null) {
line = line.trim();
lineCount++;
if (lineCount == 1 || line.equals("")) {
continue;
}
sb.delete(0, sb.capacity()); // Clear the StringBuilder object
// Start the Status Log
sb.append("File Line Number: ").append(lineCount)
.append(" (\"").append(line).append("\")").append(ls);
// Split line into an Array based on a comma delimiter
// reguardless of the delimiter's spacing situation.
String[] lineParts = line.split("\\s{0,},\\s{0,}");
/* Validate each file line. Log any line that fails
any validation for any record column data into a
List Interface object named: validationFailures
*/
// Are there 4 Columns of data in each line...
if (lineParts.length < 4) {
sb.append("\t- Invalid Column Count!").append(ls);
// Which column is missing...
// *** You may need to add more conditions to suit your needs. ***
if (checkAccountIDs(lineParts[0]) && lineParts.length >= 2 && !checkAccountIDs(lineParts[1])) {
sb.append("\t- Either the 'Sender Account ID' or the "
+ "'ReceiverAccountID' is missing!").append(ls);
}
else if (lineParts.length >= 3 && !checkAmount(lineParts[2])) {
sb.append("\t- The 'Amount' value is missing!").append(ls);
}
else if (lineParts.length < 4) {
sb.append("\t- The 'Transfer Date' is missing!").append(ls);
}
}
else {
// Is SenderAccountID data valid...
if (!checkAccountIDs(lineParts[0])) {
sb.append("\t- Invalid Sender Account ID in column 1! (")
.append(lineParts[0].equals("") ? "Null" :
lineParts[0]).append(")");
if (lineParts[0].length() < 9) {
sb.append(" <-- Not Enough Or No Digits!").append(ls);
}
else if (lineParts[0].length() > 9) {
sb.append(" <-- Too Many Digits!").append(ls);
}
else {
sb.append(" <-- Not All Digits!").append(ls);
}
}
// Is ReceiverAccountID data valid...
if (!checkAccountIDs(lineParts[1])) {
sb.append("\t- Invalid Receiver Account ID in coloun 2! (")
.append(lineParts[1].equals("") ? "Null" :
lineParts[1]).append(")");
if (lineParts[1].length() < 9) {
sb.append(" <-- Not Enough Or No Digits!").append(ls);
}
else if (lineParts[1].length() > 9) {
sb.append(" <-- Too Many Digits!").append(ls);
}
else {
sb.append(" <-- Not All Digits!").append(ls);
}
}
// Is Amount data valid...
if (!checkAmount(lineParts[2])) {
sb.append("\t- Invalid Amount Value in column 3! (")
.append(lineParts[2].equals("") ? "Null" :
lineParts[2]).append(")").append(ls);
}
// Is TransferDate data valid...
if (!checkTransferDate(lineParts[3], "yyyy-MM-dd HH:mm:ss")) {
sb.append("\t- Invalid Transfer Date Timestamp in column 4! (")
.append(lineParts[3].equals("") ? "Null" :
lineParts[3]).append(")").append(ls);
}
}
if (!sb.toString().equals("")) {
validationFailures.add(sb.toString());
}
}
}
catch (FileNotFoundException ex) {
System.err.println(ex.getMessage());
}
catch (IOException ex) {
System.err.println(ex.getMessage());
}
// Display the Log...
String timeStamp = new SimpleDateFormat("yyyy/MM/dd - hh:mm:ssa").
format(new Timestamp(System.currentTimeMillis()));
String dispTitle = "File Data Status at " + timeStamp.toLowerCase()
+ " <:-:> (" + filePath + "):";
System.out.println(dispTitle + ls + String.join("",
Collections.nCopies(dispTitle.length(), "=")) + ls);
if (validationFailures.size() > 0) {
for (String str : validationFailures) {
if (str.split(ls).length > 1) {
System.out.println(str);
System.out.println(String.join("", Collections.nCopies(80, "-")) + ls);
}
}
}
else {
System.out.println("No Issues Detected!" + ls);
}
}
private boolean checkAccountIDs(String accountID) {
return (accountID.matches("\\d+") && accountID.length() == 9);
}
private boolean checkAmount(String amount) {
return amount.matches("-?\\d+(\\.\\d+)?");
}
private boolean checkTransferDate(String transferDate, String format) {
return isValidDateString(transferDate, format);
}
private boolean isValidDateString(String dateToValidate, String dateFromat) {
if (dateToValidate == null || dateToValidate.equals("")) {
return false;
}
SimpleDateFormat sdf = new SimpleDateFormat(dateFromat);
sdf.setLenient(false);
try {
// If not valid, it will throw a ParseException
Date date = sdf.parse(dateToValidate);
return true;
}
catch (ParseException e) {
return false;
}
}
I'm not exactly sure what your particular application process will ultimately entail but if other processes are accessing the file and making modifications to it then it may be wise utilize a locking mechanism to Lock the file during your particular process and Unlock the file when it is done. This however will most likely require you to utilize a different reading algorithm since locking a file must be done through a writable channel. Using the FileChannel and FileLock classes from the java.nio package could possibly assist you here. There would be examples of how to utilize these classes within the StackOverflow forum.

How to implement the visitor pattern for nested function

I am a newbie to Antlr and I wanted the below implementation to be done using Antlr4. I am having the below-written functions.
1. FUNCTION.add(Integer a,Integer b)
2. FUNCTION.concat(String a,String b)
3. FUNCTION.mul(Integer a,Integer b)
And I am storing the functions metadata like this.
Map<String,String> map=new HashMap<>();
map.put("FUNCTION.add","Integer:Integer,Integer");
map.put("FUNCTION.concat","String:String,String");
map.put("FUNCTION.mul","Integer:Integer,Integer");
Where, Integer:Integer,Integer represents Integer is the return type and input params the function will accespts are Integer,Integer.
if the input is something like this
FUNCTION.concat(Function.substring(String,Integer,Integer),String)
or
FUNCTION.concat(Function.substring("test",1,1),String)
Using the visitor implementation I wanted to check whether the input is validate or not against the functions metadata stored in map.
Below is the lexer and parser that I'm using:
Lexer MyFunctionsLexer.g4:
lexer grammar MyFunctionsLexer;
FUNCTION: 'FUNCTION';
NAME: [A-Za-z0-9]+;
DOT: '.';
COMMA: ',';
L_BRACKET: '(';
R_BRACKET: ')';
Parser MyFunctionsParser.g4:
parser grammar MyFunctionsParser;
options {
tokenVocab=MyFunctionsLexer;
}
function : FUNCTION '.' NAME '('(function | argument (',' argument)*)')';
argument: (NAME | function);
WS : [ \t\r\n]+ -> skip;
I am using Antlr4.
Below is the implementation I'm using as per the suggested answer.
Visitor Implementation:
public class FunctionValidateVisitorImpl extends MyFunctionsParserBaseVisitor {
Map<String, String> map = new HashMap<String, String>();
public FunctionValidateVisitorImpl()
{
map.put("FUNCTION.add", "Integer:Integer,Integer");
map.put("FUNCTION.concat", "String:String,String");
map.put("FUNCTION.mul", "Integer:Integer,Integer");
map.put("FUNCTION.substring", "String:String,Integer,Integer");
}
#Override
public String visitFunctions(#NotNull MyFunctionsParser.FunctionsContext ctx) {
System.out.println("entered the visitFunctions::");
for (int i = 0; i < ctx.getChildCount(); ++i)
{
ParseTree c = ctx.getChild(i);
if (c.getText() == "<EOF>")
continue;
String top_level_result = visit(ctx.getChild(i));
System.out.println(top_level_result);
if (top_level_result == null)
{
System.out.println("Failed semantic analysis: "+ ctx.getChild(i).getText());
}
}
return null;
}
#Override
public String visitFunction( MyFunctionsParser.FunctionContext ctx) {
// Get function name and expected type information.
String name = ctx.getChild(2).getText();
String type=map.get("FUNCTION." + name);
if (type == null)
{
return null; // not declared in function table.
}
String result_type = type.split(":")[0];
String args_types = type.split(":")[1];
String[] expected_arg_type = args_types.split(",");
int j = 4;
ParseTree a = ctx.getChild(j);
if (a instanceof MyFunctionsParser.FunctionContext)
{
String v = visit(a);
if (v != result_type)
{
return null; // Handle type mismatch.
}
} else {
for (int i = j; i < ctx.getChildCount(); i += 2)
{
ParseTree parameter = ctx.getChild(i);
String v = visit(parameter);
if (v != expected_arg_type[(i - j)/2])
{
return null; // Handle type mismatch.
}
}
}
return result_type;
}
#Override
public String visitArgument(ArgumentContext ctx){
ParseTree c = ctx.getChild(0);
if (c instanceof TerminalNodeImpl)
{
// Unclear if what this is supposed to parse:
// Mutate "1" to "Integer"?
// Mutate "Integer" to "String"?
// Or what?
return c.getText();
}
else
return visit(c);
}
}
Testcalss:
public class FunctionValidate {
public static void main(String[] args) {
String input = "FUNCTION.concat(FUNCTION.substring(String,Integer,Integer),String)";
ANTLRInputStream str = new ANTLRInputStream(input);
MyFunctionsLexer lexer = new MyFunctionsLexer(str);
CommonTokenStream tokens = new CommonTokenStream(lexer);
MyFunctionsParser parser = new MyFunctionsParser(tokens);
parser.removeErrorListeners(); // remove ConsoleErrorListener
parser.addErrorListener(new VerboseListener()); // add ours
FunctionsContext tree = parser.functions();
FunctionValidateVisitorImpl visitor = new FunctionValidateVisitorImpl();
visitor.visit(tree);
}
}
Lexer:
lexer grammar MyFunctionsLexer;
FUNCTION: 'FUNCTION';
NAME: [A-Za-z0-9]+;
DOT: '.';
COMMA: ',';
L_BRACKET: '(';
R_BRACKET: ')';
WS : [ \t\r\n]+ -> skip;
Parser:
parser grammar MyFunctionsParser;
options { tokenVocab=MyFunctionsLexer; }
functions : function* EOF;
function : FUNCTION '.' NAME '(' (function | argument (',' argument)*) ')';
argument: (NAME | function);
Verbose Listener:
public class VerboseListener extends BaseErrorListener {
#Override
public void syntaxError(Recognizer<?, ?> recognizer, Object offendingSymbol, int line, int charPositionInLine, String msg, RecognitionException e) {
List<String> stack = ((Parser)recognizer).getRuleInvocationStack();
Collections.reverse(stack);
throw new FunctionInvalidException("line "+line+":"+charPositionInLine+" at "+ offendingSymbol+": "+msg);
}
}
Output:
It is not entering visitor implementation as it is not printing System.out.println("entered the visitFunctions::"); statement.
Below is a solution in C#. This should give you an idea of how to proceed. You should be able to easily translate the code to Java.
For ease, I implemented the code using my extension AntlrVSIX for Visual Studio 2019 with NET Core C#. It makes life easier using a full IDE that supports the building of split lexer/parser grammars, debugging, and a plug-in that is suited for editing Antlr grammars.
There are several things to note with your grammar. First, your parser grammar isn't accepted by Antlr 4.7.2. Production "WS : [ \t\r\n]+ -> skip;" is a lexer rule, it can't go in a parser grammar. It has to go into the lexer grammar (or you define a combined grammar). Second, I personally wouldn't define lexer symbols like DOT, and then use in the parser the RHS of the lexer symbol directly in the parser grammar, e.g., '.'. It's confusing, and I'm pretty sure there isn't an IDE or editor would know how to go to the definition "DOT: '.';" in the lexer grammar if you positioned your cursor on the '.' in the parser grammar. I never understood why it's allowed in Antlr, but c'est la vie. I would instead use the lexer symbol you define. Third, I would consider augmenting the parser grammar in the usual way with EOF, e.g., "functions : function* EOF". But, this is entirely up to you.
Now, on the problem statement, your example input contains an inconsistency. In the first case, "substring(String,Integer,Integer)", the input is in a meta-like description of substring(). In the second case, "substring(\"test\",1,1)", you are parsing code. The first case parses with your grammar, the second does not--there's no string literal lexer rule defined in your lexer grammar. It's unclear what you really want to parse.
Overall, I defined the visitor code over strings, i.e., each method returns a string representing the output type of the function or argument, e.g., "Integer" or "String" or null if there was an error (or you could throw an exception for static semantic errors). Then, using Visit() on each child in the parse tree node, check the resulting string if it is expected, and handle matches as you like.
One other thing to note. You can solve this problem via a visitor or listener class. The visitor class is useful for purely synthesized attributes. In this example solution, I return a string that represents the type of the function or arg up the associated parse tree, checking the value for each important child. The listener class is useful for L-attributed grammars--i.e., where you are passing attributes in a DFS-oriented manner, left to right at each node in the tree. For this example, you could use the listener class and only override the Exit() functions, but you would then need a Map/Dictionary to map a "context" into an attribute (string).
lexer grammar MyFunctionsLexer;
FUNCTION: 'FUNCTION';
NAME: [A-Za-z0-9]+;
DOT: '.';
COMMA: ',';
L_BRACKET: '(';
R_BRACKET: ')';
WS : [ \t\r\n]+ -> skip;
parser grammar MyFunctionsParser;
options { tokenVocab=MyFunctionsLexer; }
functions : function* EOF;
function : FUNCTION '.' NAME '(' (function | argument (',' argument)*) ')';
argument: (NAME | function);
using Antlr4.Runtime;
namespace AntlrConsole2
{
public class Program
{
static void Main(string[] args)
{
var input = #"FUNCTION.concat(FUNCTION.substring(String,Integer,Integer),String)";
var str = new AntlrInputStream(input);
var lexer = new MyFunctionsLexer(str);
var tokens = new CommonTokenStream(lexer);
var parser = new MyFunctionsParser(tokens);
var listener = new ErrorListener<IToken>();
parser.AddErrorListener(listener);
var tree = parser.functions();
if (listener.had_error)
{
System.Console.WriteLine("error in parse.");
}
else
{
System.Console.WriteLine("parse completed.");
}
var visitor = new Validate();
visitor.Visit(tree);
}
}
}
namespace AntlrConsole2
{
using System;
using Antlr4.Runtime.Misc;
using System.Collections.Generic;
class Validate : MyFunctionsParserBaseVisitor<string>
{
Dictionary<String, String> map = new Dictionary<String, String>();
public Validate()
{
map.Add("FUNCTION.add", "Integer:Integer,Integer");
map.Add("FUNCTION.concat", "String:String,String");
map.Add("FUNCTION.mul", "Integer:Integer,Integer");
map.Add("FUNCTION.substring", "String:String,Integer,Integer");
}
public override string VisitFunctions([NotNull] MyFunctionsParser.FunctionsContext context)
{
for (int i = 0; i < context.ChildCount; ++i)
{
var c = context.GetChild(i);
if (c.GetText() == "<EOF>")
continue;
var top_level_result = Visit(context.GetChild(i));
if (top_level_result == null)
{
System.Console.WriteLine("Failed semantic analysis: "
+ context.GetChild(i).GetText());
}
}
return null;
}
public override string VisitFunction(MyFunctionsParser.FunctionContext context)
{
// Get function name and expected type information.
var name = context.GetChild(2).GetText();
map.TryGetValue("FUNCTION." + name, out string type);
if (type == null)
{
return null; // not declared in function table.
}
string result_type = type.Split(":")[0];
string args_types = type.Split(":")[1];
string[] expected_arg_type = args_types.Split(",");
const int j = 4;
var a = context.GetChild(j);
if (a is MyFunctionsParser.FunctionContext)
{
var v = Visit(a);
if (v != result_type)
{
return null; // Handle type mismatch.
}
} else {
for (int i = j; i < context.ChildCount; i += 2)
{
var parameter = context.GetChild(i);
var v = Visit(parameter);
if (v != expected_arg_type[(i - j)/2])
{
return null; // Handle type mismatch.
}
}
}
return result_type;
}
public override string VisitArgument([NotNull] MyFunctionsParser.ArgumentContext context)
{
var c = context.GetChild(0);
if (c is Antlr4.Runtime.Tree.TerminalNodeImpl)
{
// Unclear if what this is supposed to parse:
// Mutate "1" to "Integer"?
// Mutate "Integer" to "String"?
// Or what?
return c.GetText();
}
else
return Visit(c);
}
}
}

How can I detect if a user enters a string which does not follow my ANTLR grammar rules?

I am making a Computer Algebra System which will take an algebraic expression and simplify or differentiate it.
As you can see by the following code the user input is taken, but if it is a string which does not conform to my grammar rules the error,
line 1:6 mismatched input '' expecting {'(', INT, VAR}, occurs and the program continues running.
How would I catch the error and stop the program from running? Thank you in advance for any help.
Controller class:
public static void main(String[] args) throws IOException {
String userInput = "x*x*x+";
getAST(userInput);
}
public static AST getAST(String userInput) {
ParseTree tree = null;
ExpressionLexer lexer = null;
ANTLRInputStream input = new ANTLRInputStream(userInput);
try {
lexer = new ExpressionLexer(input);
}catch(Exception e) {
System.out.println("Incorrect grammar");
}
System.out.println("Lexer created");
CommonTokenStream tokens = new CommonTokenStream(lexer);
System.out.println("Tokens created");
ExpressionParser parser = new ExpressionParser(tokens);
System.out.println("Tokens parsed");
tree = parser.expr();
System.out.println("Tree created");
System.out.println(tree.toStringTree(parser)); // print LISP-style tree
Trees.inspect(tree, parser);
ParseTreeWalker walker = new ParseTreeWalker();
ExpressionListener listener = new buildAST();
walker.walk(listener, tree);
listener.printAST();
listener.extractExpression();
return new AST();
}
}
My Grammar:
grammar Expression;
#header {
package exprs;
}
#members {
// This method makes the parser stop running if it encounters
// invalid input and throw a RuntimeException.
public void reportErrorsAsExceptions() {
//removeErrorListeners();
addErrorListener(new ExceptionThrowingErrorListener());
}
private static class ExceptionThrowingErrorListener extends BaseErrorListener {
#Override
public void syntaxError(Recognizer<?, ?> recognizer,
Object offendingSymbol, int line, int charPositionInLine,
String msg, RecognitionException e) {
throw new RuntimeException(msg);
}
}
}
#rulecatch {
// ANTLR does not generate its normal rule try/catch
catch(RecognitionException e) {
throw e;
}
}
expr : left=expr op=('*'|'/'|'^') right=expr
| left=expr op=('+'|'-') right=expr
| '(' expr ')'
| atom
;
atom : INT|VAR;
INT : ('0'..'9')+ ;
VAR : ('a' .. 'z') | ('A' .. 'Z') | '_';
WS : [ \t\r\n]+ -> skip ;
A typical parse run with ANTLR4 consists of 2 stages:
A "quick'n dirty" run with SLL prediction mode that bails out on the first found syntax error.
A normal run using the LL prediction mode which tries to recover from parser errors. This second step only needs to be executed if there was an error in the first step.
The first step is kinda loose parse run which doesn't resolve certain ambiquities and hence can report an error which doesn't really exist (when resolved in LL mode). But the first step is faster and delivers so a quicker result for syntactically correct input. This (JS) code shows the setup:
this.parser.removeErrorListeners();
this.parser.addErrorListener(this.errorListener);
this.parser.errorHandler = new BailErrorStrategy();
this.parser.interpreter.setPredictionMode(PredictionMode.SLL);
try {
this.tree = this.parser.grammarSpec();
} catch (e) {
if (e instanceof ParseCancellationException) {
this.tokenStream.seek(0);
this.parser.reset();
this.parser.errorHandler = new DefaultErrorStrategy();
this.parser.interpreter.setPredictionMode(PredictionMode.LL);
this.tree = this.parser.grammarSpec();
} else {
throw e;
}
}
In order to avoid any resolve attempt for syntax errors in the first step you also have to set the BailErrorStrategy. This strategy simply throws a ParseCancellationException in case of a syntax error (similar like you do in your code). You could add your own handling in the catch clause to ask the user for correct input and respin the parse step.

Lucene multi word tokens with delimiter

I am just starting with Lucene so it's probably a beginners question. We are trying to implement a semantic search on digital books and already have a concept generator, so for example the contexts I generate for a new article could be:
|Green Beans | Spring Onions | Cooking |
I am using Lucene to create an index on the books/articles using only the extracted concepts (stored in a temporary document for that purpose). Now the standard analyzer is creating single word tokens: Green, Beans, Spring, Onions, Cooking, which of course is not the same.
My question: is there an analyzer that is able to detect delimiters around tokens (|| in our example), or an analyzer that is able to detect multi-word constructs?
I'm afraid we'll have to create our own analyzer, but I don't quite know where to start for that one.
Creating an analyzer is pretty easy. An analyzer is just a tokenizer optionally followed by token filters. In your case, you'd have to create your own tokenizer. Fortunately, you have a convenient base class for this: CharTokenizer.
You implement the isTokenChar method and make sure it returns false on the | character and true on any other character. Everything else will be considered part of a token.
Once you have the tokenizer, the analyzer should be straightforward, just look at the source code of any existing analyzer and do likewise.
Oh, and if you can have spaces between your | chars, just add a TrimFilter to the analyzer.
I came across this question because I am doing something with my Lucene mechanisms which creates data structures to do with sequencing, in effect "hijacking" the Lucene classes. Otherwise I can't imagine why people would want knowledge of the separators ("delimiters") between tokens, but as it was quite tricky I thought I'd put it here for the benefit of anyone who might need to.
You have to rewrite your own versions of StandardTokenizer and StandardTokenizerImpl. These are both final classes so you can't extend them.
SeparatorDeliveringTokeniserImpl (tweaked from source of StandardTokenizerImpl):
3 new fields:
private int startSepPos = 0;
private int endSepPos = 0;
private String originalBufferAsString;
Tweak these methods:
public final void getText(CharTermAttribute t) {
t.copyBuffer(zzBuffer, zzStartRead, zzMarkedPos - zzStartRead);
if( originalBufferAsString == null ){
originalBufferAsString = new String( zzBuffer, 0, zzBuffer.length );
}
// startSepPos == -1 is a "flag condition": it means that this token is the last one and it won't be followed by a sep
if( startSepPos != -1 ){
// if the flag is NOT set, record the start pos of the next sep...
startSepPos = zzMarkedPos;
}
}
public final void yyreset(java.io.Reader reader) {
zzReader = reader;
zzAtBOL = true;
zzAtEOF = false;
zzEOFDone = false;
zzEndRead = zzStartRead = 0;
zzCurrentPos = zzMarkedPos = 0;
zzFinalHighSurrogate = 0;
yyline = yychar = yycolumn = 0;
zzLexicalState = YYINITIAL;
if (zzBuffer.length > ZZ_BUFFERSIZE)
zzBuffer = new char[ZZ_BUFFERSIZE];
// reset fields responsible for delivering separator...
originalBufferAsString = null;
startSepPos = 0;
endSepPos = 0;
}
(inside getNextToken:)
if ((zzAttributes & 1) == 1) {
zzAction = zzState;
zzMarkedPosL = zzCurrentPosL;
if ((zzAttributes & 8) == 8) {
// every occurrence of a separator char leads here...
endSepPos = zzCurrentPosL;
break zzForAction;
}
}
And make a new method:
String getPrecedingSeparator() {
String sep = null;
if( originalBufferAsString == null ){
sep = new String( zzBuffer, 0, endSepPos );
}
else if( startSepPos == -1 || endSepPos <= startSepPos ){
sep = "";
}
else {
sep = originalBufferAsString.substring( startSepPos, endSepPos );
}
if( zzMarkedPos < startSepPos ){
// ... then this is a sign that the next token will be the last one... and will NOT have a trailing separator
// so set a "flag condition" for next time this method is called
startSepPos = -1;
}
return sep;
}
SeparatorDeliveringTokeniser (tweaked from source of StandardTokenizer):
Add this:
private String separator;
String getSeparator(){
// normally this delivers a preceding separator... but after incrementToken returns false, if there is a trailing
// separator, it then delivers that...
return separator;
}
(inside incrementToken:)
while(true) {
int tokenType = scanner.getNextToken();
// added NB this gives you the separator which PRECEDES the token
// which you are about to get from scanner.getText( ... )
separator = scanner.getPrecedingSeparator();
if (tokenType == SeparatorDeliveringTokeniserImpl.YYEOF) {
// NB at this point sep is equal to the trailing separator...
return false;
}
...
Usage:
In my FilteringTokenFilter subclass, called TokenAndSeparatorExamineFilter, the methods accept and end look like this:
#Override
public boolean accept() throws IOException {
String sep = ((SeparatorDeliveringTokeniser) input).getSeparator();
// a preceding separator can only be an empty String if we are currently
// dealing with the first token and if the sequence starts with a token
if (!sep.isEmpty()) {
// ... do something with the preceding separator
}
// then get the token...
String token = getTerm();
// ... do something with the token
// my filter does no filtering! Every token is accepted...:
return true;
}
#Override
public void end() throws IOException {
// deals with trailing separator at the end of a sequence of tokens and separators (if there is one, i.e. if it doesn't end with a token)
String sep = ((SeparatorDeliveringTokeniser) input).getSeparator();
// NB will be an empty String if there is no trailing separator
if (!sep.isEmpty()) {
// ... do something with this trailing separator
}
}

Categories