How do I parse a String in Java with a specified grammar?
Let's say I have this eBNF grammar:
object = "O:", natural_number, ":", value, ":", natural_number, ":{", { element }, "}";
value = '"' , character , { character } , '"';
element = string | boolean | array | empty_element, ";" ;
empty_element = "N" ;
string = "s:", natural_number, ":", value ;
boolean = "b:". "0" | "1" ;
array = "a:" ;
etc. etc. won't specify it in full here
How do I let Java handle parsing such a String into a usable tree?
Use ANTLR to parse eBNF, do not bother yourself to write it.
Related
Antlr-3 generating an error on encountering the Pound char ("£") of the French language, which is equivalent char of Hash "#" char of English, even the Unicode value for three special characters #, #, and $ are specified in lexer/parser rule.
FYI: The Unicode value of Pound char (of the French language) = The Unicode value of Hash char (of ENGLISH language).
The lexer/parser rules:
grammar SimpleCalc;
options
{
k = 8;
language = Java;
//filter = true;
}
tokens {
PLUS = '+' ;
MINUS = '-' ;
MULT = '*' ;
DIV = '/' ;
}
/*------------------------------------------------------------------
* PARSER RULES
*------------------------------------------------------------------*/
expr : n1=NUMBER ( exp = ( PLUS | MINUS ) n2=NUMBER )*
{
if ($exp.text.equals("+"))
System.out.println("Plus Result = " + $n1.text + $n2.text);
else
System.out.println("Minus Result = " + $n1.text + $n2.text);
}
;
/*------------------------------------------------------------------
* LEXER RULES
*------------------------------------------------------------------*/
NUMBER : (DIGIT)+ ;
WHITESPACE : ( '\t' | ' ' | '\r' | '\n'| '\u000C' )+ { $channel = HIDDEN; } ;
fragment DIGIT : '0'..'9' | '£' | ('\u0040' | '\u0023' | '\u0024');
The text file also reading in UTF-8 as:
public static void main(String[] args) throws Exception
{
try
{
args = new String[1];
args[0] = new String("antlr_test.txt");
SimpleCalcLexer lex = new SimpleCalcLexer(new ANTLRFileStream(args[0], "UTF-8"));
CommonTokenStream tokens = new CommonTokenStream(lex);
SimpleCalcParser parser = new SimpleCalcParser(tokens);
parser.expr();
//System.out.println(tokens);
}
catch (Exception e)
{
e.printStackTrace();
}
}
The input file is having only 1 line:
£3 + 4£
the error is:
antlr_test.txt line 1:1 no viable alternative at character '£'
antlr_test.txt line 1:7 no viable alternative at character '£'
What is wrong with my approach?
or did I miss something?
I cannot reproduce what you describe. When I test your grammar without modifications, I get a NumberFormatException, which is expected, because Integer.parseInt("£3") cannot succeed.
When I change your embedded code into this:
{
if ($exp.text.equals("+"))
System.out.println("Result = " + (Integer.parseInt($n1.text.replaceAll("\\D", "")) + Integer.parseInt($n2.text.replaceAll("\\D", ""))));
else
System.out.println("Result = " + (Integer.parseInt($n1.text.replaceAll("\\D", "")) - Integer.parseInt($n2.text.replaceAll("\\D", ""))));
}
and regenerate lexer and parser classes (something you might not have done) and rerun the driver code, I get the following output:
Result = 7
EDIT
Perhaps the pound sign in the grammar is the issue? What if you try:
fragment DIGIT : '0'..'9' | '\u00A3' | ('\u0040' | '\u0023' | '\u0024');
instead of:
fragment DIGIT : '0'..'9' | '£' | ('\u0040' | '\u0023' | '\u0024');
?
I have the following parser rule
study: 'study' '(' ( assign* | ( assign (',' assign)*) ) ')' NEWLINE;
assign: ID '=' (INT | DATA );
INT : [0-9]+ ;
DATA : '"' ID '"' | '"' INT '"';
ID : [a-zA-Z]+ ;
my problem now how I can retrieve the variables defined in the study in the entryStudy method
#Override
public void enterStudy(StudyParser.StudyContext ctx) {
// get the declared variables
// study(hello = "hello",world = "world")
// study(hello = "hello",world = "world",name = "name")
System.out.println("enterStudy");
}
Add the following snippet to your grammar:
#members {
public final java.util.List<java.util.Map.Entry<String, String>> parameters = new java.util.ArrayList<>();
}
Modify your assign rule:
assign: name=ID '=' value=(INT | DATA ) {
parameters.add(new java.util.AbstractMap.SimpleImmutableEntry($name.text, $value.text));
};
Now you can use StudyParser.parameters field to access required information:
StudyParser parser = ...;
parser.study();
System.out.println(parser.parameters);
Also please note that your grammar probably is slightly wrong, because it allows the following input: study(x=1y=2).
MSH|^~\&|RAD|MCH|SOARCLIN|MCH|201309281506||ORU^R01|RMS|P|2.4
PID|0001|_MISSING_|059805^a~059805^a~059805^a||RENNER^KATHRYN^
In a string like the above I need to replace the string on basis of | (pipe sign) count.
e.g. :
MSH line want to replace after 3rth position of (|) pipe sign "MCH"
with "ABC"
input : MSH|^~\&|RAD|MCH|SOARCLIN|MCH|201309281506||ORU^R01|RMS|P|2.4
output : MSH|^~\&|RAD|MCH|SOARCLIN|ABC|201309281506||ORU^R01|RMS|P|2.4
String repSection( String del, int count, String rep ){
String[] toks = theString.split( Pattern.quote( del ) );
toks[count] = rep;
theString = String.join( del, toks );
}
Call:
String result = repSection( "|", 3, "ABC" );
It depends on counting alone; it doesn't matter what is there between the 3rd and 4th pipe char.
I prefer this to some fancy and difficult to maintain regex.
s = s.replaceAll( "^((?:[^|]*\\|){3})[^|]*", "$1|ABC" );
Again, this doesn't care what is between 3rd and 4th pipe symbol.
I'm trying to use regex to split string into field, but unfortunately it's not working 100% and is skipping some part which should be split. Here is part of program processing string:
void parser(String s) {
String REG1 = "(',\\d)|(',')|(\\d,')|(\\d,\\d)";
Pattern p1 = Pattern.compile(REG1);
Matcher m1 = p1.matcher(s);
while (m1.find() ) {
System.out.println(counter + ": "+s.substring(end, m1.end()-1)+" "+end+ " "+m1.end());
end =m1.end();
counter++;
}
}
The string is:
s= 3101,'12HQ18U0109','11YX27X0041','XX21','SHV7-P Hig, Hig','','GW1','MON','E','A','ASEXPORT-1',1,101,0,'0','1500','V','','',0,'mb-master1'
and the problem is that it doesn't split ,1, or ,0,
Rules for parsing are: String is enclosed by ,' ', for example ,'ASEXPORT-1',
int is enclosed only by , ,
expected output =
3101 | 12HQ18U0109 | 11YX27X0041 | XX21 | SHV7-P Hig, Hig| |GW1 |MON |E | A| ASEXPORT-1| 1 |101 |0 | 0 |1500 | V| | | 0 |mb-master1
Altogether 21 elements.
You can split it with this regex
,(?=([^']*'[^']*')*[^']*$)
It splits at , only if there are even number of ' ahead
So for
3101,'12HQ18,U0109','11YX27X0041'
output would be
3101
'12HQ18,U0109'
'11YX27X0041'
Note
it wont work for nested strings like 'hello 'h,i'world'..If there are any such cases you should use the following regex
(?<='),(?=')|(?<=\d),(?=\d|')|(?<=\d|'),(?=\d)
If you also (for some bizarre reason) need to know each matches start and end index in the original string (like you have it in your sample output), you can use the following pattern:
String regex = "('[^']*'|\\d+)";
which would match an unquoted integer or asingle-quoted string.
You can optionally remove the leading and trailing ' using a "second-pass" on the matching substring:
match = match.replaceAll("\\A'|'\\Z", "");
which replaces a leading and trailing ' with nothing.
The code could look like this:
Pattern pat = Pattern.compile("('[^']*'|\\d+)");
Matcher m = pat.matcher(str);
int counter = 0, start = 0;
while (m.find()) {
String match = m.group(1);
int end = start + match.length();
match = match.replaceAll("\\A'|'\\Z", ""); // <-- comment out for NOT replacing
// leading and trailing quotes
System.out.format("%d: %s [%d - %d]%n", ++counter, match, start, end);
start = end + 1; // <-- the "+1" is to account for the ',' separator
}
See, also, this short demo.
I'm trying to implement a parser for the example file listed below. I'd like to recognize quoted strings with '+' between them as a single token. So I created a jj file, but it doesn't match such strings. I was under the impression that JavaCC is supposed to match the longest possible match for each token spec. But that doesn't seem to be case for me.
What am I doing wrong here? Why isn't my <STRING> token matching the '+' even though it's specified in there? Why is whitespace not being ignored?
options {
TOKEN_FACTORY = "Token";
}
PARSER_BEGIN(Parser)
package com.example.parser;
public class Parser {
public static void main(String args[]) throws ParseException {
ParserTokenManager manager = new ParserTokenManager(new SimpleCharStream(Parser.class.getResourceAsStream("example")));
Token token = manager.getNextToken();
while (token != null && token.kind != ParserConstants.EOF) {
System.out.println(token.toString() + "[" + token.kind + "]");
token = manager.getNextToken();
}
Parser parser = new Parser(Parser.class.getResourceAsStream("example"));
parser.start();
}
}
PARSER_END(Parser)
// WHITE SPACE
<DEFAULT, IN_STRING_KEYWORD>
SKIP :
{
" " // <-- skipping spaces
| "\t"
| "\n"
| "\r"
| "\f"
}
// TOKENS
TOKEN :
{
< KEYWORD1 : "keyword1" > : IN_STRING_KEYWORD
}
<IN_STRING_KEYWORD>
TOKEN : {<STRING : <CONCAT_STRING> | <UNQUOTED_STRING> > : DEFAULT
| <#CONCAT_STRING : <QUOTED_STRING> ("+" <QUOTED_STRING>)+ >
// <-- CONCAT_STRING never matches "+" part when input is "'smth' +", because whitespace is not ignored!?
| <#QUOTED_STRING : <SINGLEQUOTED_STRING> | <DOUBLEQUOTED_STRING> >
| <#SINGLEQUOTED_STRING : "'" (~["'"])* "'" >
| <#DOUBLEQUOTED_STRING :
"\""
(
(~["\"", "\\"]) |
("\\" ["n", "t", "\"", "\\"])
)*
"\""
>
| <#UNQUOTED_STRING : (~[" ","\t", ";", "{", "}", "/", "*", "'", "\"", "\n", "\r"] | "/" ~["/", "*"] | "*" ~["/"])+ >
}
void start() :
{}
{
(<KEYWORD1><STRING>";")+ <EOF>
}
Here's an example file that should get parsed:
keyword1 "foo" + ' bar';
I'd like to match the argument of the first keyword1 as a single <STRING> token.
Current output:
keyword1[6]
Exception in thread "main" com.example.parser.TokenMgrError: Lexical error at line 1, column 15. Encountered: " " (32), after : "\"foo\""
at com.example.parser.ParserTokenManager.getNextToken(ParserTokenManager.java:616)
at com.example.parser.Parser.main(Parser.java:12)
I'm using JavaCC 5.0.
STRING is expanding to the longest sequence that can be matched, which is "foo" as the error indicates. The space after the closing double quote is not part of the definition of the private token CONCAT_STRING. Skip tokens do not apply within the definition of other tokens, so you must incorporate the space directly into the definition, on either side of the +.
As an aside, I recommend have a final token definition like so:
<each-state-in-which-the-empty-string-cannot-be-recognized>
TOKEN : {
< ILLEGAL : ~[] >
}
This prevents TokenMgrErrors from being thrown and makes debugging a bit easier.