How can I Ignore certain text in ANTLR4?

How can I Ignore certain text in ANTLR4? - java

first of all, thank you in advance for your answer, this problem is killing me
My first question is how can ignore certain text?
I wanna ignore certain text from my document, I have the next text:
And I wanna ignore the text enclosed by the rectangle...when the lexer find the "demandante" word it will stop to ignore...
I used this grammar
grammar A;
documento:((acciondemandante acciondemandado) | (acciondemandado acciondemandante));
acciondemandante: PALABRASDEMANDA informacionentidad+;
acciondemandado: PALABRASDEMANDADO informacionentidad+;
informacionentidad: nombres distancia? identificacion;
nombres: nombrenormal|nombremayuscula;
nombrenormal: WORDCAPITALIZE WORDCAPITALIZE+;
nombremayuscula: WORDUPPER WORDUPPER+;
distancia: WORDLOWER;
identificacion: tipo indicador? INT+;
tipo: cedula | NIT;
cedula: CEDULA | LCASE_LETTER LCASE_LETTER | UCASE_LETTER UCASE_LETTER;
indicador: WORDCAPITALIZE | WORDLOWER;
CEDULA: 'cedula' | 'cc' | 'CC';
NIT: 'NIT' | 'nit';
PALABRASDEMANDADO: 'demandados' | 'demandado';
PALABRASDEMANDA: 'demandante' | 'demandantes';
WORDUPPER: UCASE_LETTER UCASE_LETTER+;
WORDLOWER: LCASE_LETTER LCASE_LETTER+;
WORDCAPITALIZE: UCASE_LETTER LCASE_LETTER+;
LCASE_LETTER: 'a'..'z' | 'ñ' | 'á' | 'é' | 'í' | 'ó' | 'ú';
UCASE_LETTER: 'A'..'Z' | 'Ñ' | 'Á' | 'É' | 'Í' | 'Ó' | 'Ú';
INT: DIGIT+;
DIGIT: '0'..'9';
SPECIAL_CHAR: '.' -> skip;
WS : [ \t\r\n]+ -> skip;
//ANY: ~[ ]+;
I have tried a trick skipping the whitespaces WS : [ \t\r\n]+ -> skip; and then ignoring what is not whitespaces ANY: ~[ ]+; But it does not work because the lexer never recognize the ANY token...
What I would like my grammar to read
bullshit bullshit demandado Julian Solarte c.c 120109321 bullshit bullshit
My second problem is that I get the "mismatched input ''" problem, and in order to resolve this problem I add this rule "SKIPEND: EOF ->skip;" but it does not works...
Thank you thank you so much.

My approach to this problem would be 2 steps:
Find the keyword in the input stream (here demandado).
Let a parser parse from this position without forcing an EOF for the input in the grammar. It will go as far as possible ignoring everything it doesn't understand after what was understood.
This will make your grammar much simpler and you will get a parse tree only for the relevant input.

Related

antlr4 - any text and keywords

I am trying to parse the following:
SELECT name-of-key[random text]
This is part of a larger grammar which I am trying to construct. I left it our for clarity.
I came up with the following rules:
select : 'select' NAME '[' anything ']'
;
anything : (ANYTHING | NAME)+
;
NAME : ('a'..'z' | 'A'..'Z' | '0'..'9' | '-' | '_')+
;
ANYTHING : (~(']' | '['))+
;
WHITESPACE : ('\t' | ' ' | '\r' | '\n')+ -> skip
;
This doesn't seem to work. For example, input SELECT a[hello world!] gives the following error:
line 1:0 mismatched input 'SELECT a' expecting 'SELECT'
This goes wrong because the input SELECT a is recognized by ANYTHING, instead of select. How do I fix that? I feel that I am missing some concept(s) here, but it is difficult to get started.

Maybe the concept you are missing is rule priority.
[1] Lexer rules matching the longest possible string have priority.
As you mentioned, the ANYTHING token rule above matches "select a", which is longer than what the (implicit) token rule 'select' matches, hence its precedence. Non-greedy behaviour is indicated by a question mark.
ANYTHING : (~(']' | '['))+?
Just making the ANYTHING rule non-greedy doesn't completely solve your problem though, because after properly matching 'select', the lexer will produce an ANYTHING token for the space, because ...
[2] Lexer rules appearing first have priority.
Switching lexer rules WHITE_SPACE and ANYTHING fixes this. The grammar below should parse your example.
select : 'select' NAME '[' anything ']'
;
anything : (ANYTHING | NAME)+
;
NAME : ('a'..'z' | 'A'..'Z' | '0'..'9' | '-' | '_')+
;
WHITESPACE : ('\t' | ' ' | '\r' | '\n')+ -> skip
;
ANYTHING : (~(']' | '['))+?
;
I personally avoid implicit token rules, especially if your grammar is complex, precisely because of token rule priority. I would thus write this.
SELECT : 'select' ;
L_BRACKET : '[';
R_BRACKET : ']';
NAME : ('a'..'z' | 'A'..'Z' | '0'..'9' | '-' | '_')+ ;
WHITESPACE : ('\t' | ' ' | '\r' | '\n')+ -> skip ;
ANY : . ;
select : SELECT NAME L_BRACKET anything R_BRACKET ;
anything : (~R_BRACKET)+ ;
Also note that the space in "hello world" will be swallowed by the WHITESPACE rule. To properly manage this, you need ANTLR island grammars.
'Hope this helps!

Sub string detection performance?

I need to match a sub string, and I wonder which one is faster when it comes to matching RegEx?
if ( str.matches(".*hello.*") ) {
...
}
Pattern p = Pattern.compile( ".*hello.*" );
Matcher m = p.matcher( str );
if ( m.find() ) {
...
}
And if don't need a regEx, should I use 'contains' ?
if ( str.contains("hello") ) {
...
}
Thanks.

Although matches() and using a Matcher are identical (matches() uses a Matcher in its implementation), using a Matcher can be faster if you cache and reuse the compiled Pattern. I did some rough testing and it improved performance (in my case) by 400% - the improvement depends on the regex, but there will always be sone improvement.
Although I haven't tested it, I would expect contains() to outperform any regex approach, because the algorithm is far simpler and you don't need regex for this situation.
Here are the results of 6 ways to test for a String containing a substring, with the target ("http") located at various places within a standard 60 character input:
|------------------------------------------------------------|
| Code tested with "http" in the input | µsec | µsec | µsec |
| at the following positions: | start| mid|absent|
|------------------------------------------------------------|
| input.startsWith("http") | 6 | 6 | 6 |
|------------------------------------------------------------|
| input.contains("http") | 2 | 22 | 49 |
|------------------------------------------------------------|
| Pattern p = Pattern.compile("^http.*")| | | |
| p.matcher(input).find() | 90 | 88 | 86 |
|------------------------------------------------------------|
| Pattern p = Pattern.compile("http.*") | | | |
| p.matcher(input).find() | 84 | 145 | 181 |
|------------------------------------------------------------|
| input.matches("^http.*") | 745 | 346 | 340 |
|------------------------------------------------------------|
| input.matches("http.*") | 1663 | 1229 | 1034 |
|------------------------------------------------------------|
The two-line options are where a static pattern was compiled then reused.

They are more or less equivalent if you use m.match() in the second code snippet. String.matches() specs this :
An invocation of this method of the form str.matches(regex) yields exactly the same result as the expression Pattern.matches(regex, str)
this in turn specifies:
An invocation of this convenience method of the form
Pattern.matches(regex, input);
behaves in exactly the same way as the expression
Pattern.compile(regex).matcher(input).matches()
If a pattern is to be used multiple times, compiling it once and
reusing it will be more efficient than invoking this method each time.
So calling String.matches(String) in itself will not bring performance benefits, but storing a pattern (e.g. as a constant) and reusing it does.
If you use find then matches could be more efficient if the terms don't match early, as find may keep looking. But find and matches don't perform the same function, so comparison of performance is moot.

antlr 4.2.2 output to console warning (157)

I downloaded latest release of ANTLR - 4.2.2 (antlr-4.2.2-complete.jar)
When I use it to generate parsers for grammar file Java.g4 it prints me some warnings like:
"Java.g4:525:16: rule 'expression' contains an 'assoc' terminal option in an unrecognized location"
Files was generated but didn't compile
Previous version works fine.
Whats wrong?

The <assoc> should now be moved left of the "expression".
It must be placed always right to the surrounding |:
Look here: https://theantlrguy.atlassian.net/wiki/display/ANTLR4/Left-recursive+rules
...
| expression '&&' expression
| expression '||' expression
| expression '?' expression ':' expression
|<assoc=right> expression
( '='
| '+='
| '-='
| '*='
| '/='
| '&='
| '|='
| '^='
| '>>='
| '>>>='
| '<<='
| '%='
)
expression

Regex expression to capture hyphenated word between lines, and non hyphenated words

I am trying to write a regular expression, in java, that matches words and hyphenated words. So far I have:
Pattern p1 = Pattern.compile("\\w+(?:-\\w+)",Pattern.CASE_INSENSITIVE);
Pattern p2 = Pattern.compile("[a-zA-Z0-9]+",Pattern.CASE_INSENSITIVE);
Pattern p3 = Pattern.compile("(?<=\\s)[\\w]+-$",Pattern.CASE_INSENSITIVE | Pattern.DOTALL);
This is my test case:
Programs
Dsfasdf. Programs Programs Dsfasdf. Dsfasdf. as is wow woah! woah. woah? okay.
he said, "hi." aasdfa. wsdfalsdjf. go-to go-
to
asdfasdf.. , : ; " ' ( ) ? ! - / \ # # $ % & ^ ~ ` * [ ] { } + _ 123
Any help would be awesome
My expected result would be to match all the words ie.
Programs Dsfasdf Programs Programs Dsfasdf Dsfasdf
as is wow woah woah woah okay he said hi aasdfa
wsdfalsdjf go-to go-to asdfasdf
the part I'm struggling with is matching the words that are split up between lines as one word.
ie.
go-
to

\p{L}+(?:-\n?\p{L}+)*
\ /^\ /^\ /\ /^^^
\ / | | | | \ / |||
| | | | | | ||`- Previous can repeat 0 or more times (group of literal '-', optional new-line and one or more of any letter (upper/lower case))
| | | | | | |`-- End first non-capture group
| | | | | | `--- Match one or more of previous (any letter, upper/lower case)
| | | | | `------ Match any letter (upper/lower case)
| | | | `---------- Match a single new-line (optional because of `?`)
| | | `------------ Literal '-'
| | `-------------- Start first non-capture group
| `---------------- Match one or more of previous (any letter between A-Z (upper/lower case))
`------------------- Match any letter (upper/lower case)
Is this OK?

I would go with regex:
\p{L}+(?:\-\p{L}+)*
Such regex should match also words "fiancé", "À-la-carte" and other words containing some special category "letter" characters. \p{L} matches a single code point in the category "letter".

Ant path style patterns

What are the rules for Ant path style patterns.
The Ant site itself is surprisingly uninformative.

Ant-style path patterns matching in spring-framework:
The mapping matches URLs using the following rules:
? matches one character
* matches zero or more characters
** matches zero or more 'directories' in a path
{spring:[a-z]+} matches the regexp [a-z]+ as a path variable named "spring"
Some examples:
com/t?st.jsp - matches com/test.jsp but also com/tast.jsp or com/txst.jsp
com/*.jsp - matches all .jsp files in the com directory
com/**/test.jsp - matches all test.jsp files underneath the com path
org/springframework/**/*.jsp - matches all .jsp files underneath the org/springframework path
org/**/servlet/bla.jsp - matches org/springframework/servlet/bla.jsp but also org/springframework/testing/servlet/bla.jsp and org/servlet/bla.jsp
com/{filename:\\w+}.jsp will match com/test.jsp and assign the value test to the filename variable
http://docs.spring.io/spring/docs/current/javadoc-api/org/springframework/util/AntPathMatcher.html

I suppose you mean how to use path patterns
If it is about whether to use slashes or backslashes these will be translated to path-separators on the platform used during execution-time.

Most upvoted answer by #user11153 using tables for a more readable format.
The mapping matches URLs using the following rules:
+-----------------+---------------------------------------------------------+
| Wildcard | Description |
+-----------------+---------------------------------------------------------+
| ? | Matches exactly one character. |
| * | Matches zero or more characters. |
| ** | Matches zero or more 'directories' in a path |
| {spring:[a-z]+} | Matches regExp [a-z]+ as a path variable named "spring" |
+-----------------+---------------------------------------------------------+
Some examples:
+------------------------------+--------------------------------------------------------+
| Example | Matches: |
+------------------------------+--------------------------------------------------------+
| com/t?st.jsp | com/test.jsp but also com/tast.jsp or com/txst.jsp |
| com/*.jsp | All .jsp files in the com directory |
| com/**/test.jsp | All test.jsp files underneath the com path |
| org/springframework/**/*.jsp | All .jsp files underneath the org/springframework path |
| org/**/servlet/bla.jsp | org/springframework/servlet/bla.jsp |
| also: | org/springframework/testing/servlet/bla.jsp |
| also: | org/servlet/bla.jsp |
| com/{filename:\\w+}.jsp | com/test.jsp & assign value test to filename variable |
+------------------------------+--------------------------------------------------------+

ANT Style Pattern Matcher
Wildcards
The utility uses three different wildcards.
+----------+-----------------------------------+
| Wildcard | Description |
+----------+-----------------------------------+
| * | Matches zero or more characters. |
| ? | Matches exactly one character. |
| ** | Matches zero or more directories. |
+----------+-----------------------------------+

As #user11153 mentioned, Spring's AntPathMatcher implements and documents the basics of Ant-style path pattern matching.
In addition, Java 7's nio APIs added some built in support for basic pattern matching via FileSystem.getPathMatcher

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

How can I Ignore certain text in ANTLR4? - java

Related

antlr4 - any text and keywords

Sub string detection performance?

antlr 4.2.2 output to console warning (157)

Regex expression to capture hyphenated word between lines, and non hyphenated words

Ant path style patterns

Categories

Resources