Capturing a regex group nested/enclosed by special character

Capturing a regex group nested/enclosed by special character - java

I am trying to search for words that appear within tilde (~) sign borders.
e.g. ~albert~ is a ~good~ boy.
I know that this is possible by using ~.+?~,and it already works for me. But There are special cases when I need to match a nested tilde sentence.
e.g. ~The ~spectacle~~ was ~broken~
In the example above, I have to capture 'The Spectacle', 'spectacle', and 'broken' separately. These will be translated either word-by-word or with accompanying article (An, The, whatever). The reason is that in my system:
1) 'The spectacle' requires a separate translation on a specific cases.
2) 'Spectacle' also needs translation on specific cases.
3) IF a tranlsation exist for The spectacle, we will use that, ELSE
we will use
Another example to explain this is:
~The ~spectacle~~ was ~borken~, but that was not the same ~spectacle~
that was given to ~me~.
In the example above, I will have translation for:
1) 'The spectacle' (because the translation case exists for 'The spectacle', otherwise I would've only translated spectacle on it's own)
2) 'broken'
3) 'spectacle'
4) me
I am having trouble combining an expression which will make sure that this is captured in my regular expression. The one that I have managed to work with so far is '~.+?~'. But I know that with some form of lookahead or lookbehind, I can get this working. Could anyone help me on this?
The most important aspect in this is the regression-proofing, which will ensure that the existing stuff don't break. If I manage to get it right, I will post it.
N.B. If it helps, currently I will have instances where only one level of nesting will require decomposition. so ~The ~spectacle~~ will be deepest level (until i need more!!!!!)

I wrote something like this a while ago, I haven't tested it much though:
(~(?(?=.*?~~.*?~).*?~.*?~.*?~|[^~]+?~))
or
(~(?(?=.*?~[A-Za-z]*?~.*?~).*?~.*?~.*?~|[^~]+?~))
RegEx101
Another alternative
(~(?:.*?~.*?~){0,2}.*?~)
^^ change to max depth
which ever works best
To add more add a few extra sets of .*?~ in the two places where you see a bunch.
The main problem
If we allow unlimited nesting How would we know where it would end and begin? A clumsy diagram:
~This text could be nested ~ so could this~ and this~ this ~Also this~
| | |_________| | |
| |_______________________________| |
|____________________________________________________________________|
or:
~This text could be nested ~ so could this~ and this~ this ~Also this~
| | | | |_________|
| |______________| |
|___________________________________________________|
The compiler would have no idea which to choose
For your sentence
~The ~spectacle~~ was ~broken~, but that was not the same ~spectacle~ that was given to ~me~.
| | ||_____| | | |
| | |_____________| | |
| |____________________________________________________| |
|___________________________________________________________________|
or:
~The ~spectacle~~ was ~broken~, but that was not the same ~spectacle~ that was given to ~me~.
| |_________|| |______| |_________| |__|
|_______________|
What should I do?
Use an alternating character (as #tbraun suggested) so the compiler knows where to start and end:
{This text can be {properly {nested}} without problems} because {the compiler {can {see {the}}} start and end points} easily. Or use a compiler:
Note: I don't do Java much so some code might be incorrect
import java.util.List;
String[] chars = myString.split('');
int depth = 0;
int lastMath = 0;
List<String> results = new ArrayList<String>();
for (int i = 0; i < chars.length; i += 1) {
if (chars[i] === '{') {
depth += 1;
if (depth === 1) {
lastIndex = i;
}
}
if (chars[i] === '}') {
depth -= 1;
if (depth === 0) {
results.add(StringUtils.join(Arrays.copyOfRange(chars, lastIndex, i + 1), ''));
}
if (depth < 0) {
// Balancing problem Handle an error
}
}
}
This uses StringUtils

You'll need something to differentiate start/finish patterns. I.e. {}
Than you can use pattern \{[^{]*?\} to exclude {:
{The {spectacle}} was {broken}
First iteration
{spectacle}
{broken}
Second iteration
{The spectacle}

Related

ANTLR 4: Parsing grammar

I want to parse some data from AppleSoft Basic script.
I choose ANTLR and download this grammar: jvmBasic
I'm trying to extract function name without parameters:
return parser.prog().line(0).amprstmt(0).statement().getText();
but it returns PRINT"HELLO" e.g full expression except the line number
Here is string i want to parse:
10 PRINT "Hello!"

I think this question really depends on your ANTLR program implementation but if you are using a treewalker/listener you probably want to be targeting the rule for the specific tokens not the entire "statement" rule which is circular and encompasses many types of statement :
//each line can have one to many amprstmt's
line
: (linenumber ((amprstmt (COLON amprstmt?)*) | (COMMENT | REM)))
;
amprstmt
: (amperoper? statement) //encounters a statement here
| (COMMENT | REM)
;
//statements can be made of 1 to many sub statements
statement
: (CLS | LOAD | SAVE | TRACE | NOTRACE | FLASH | INVERSE | GR | NORMAL | SHLOAD | CLEAR | RUN | STOP | TEXT | HOME | HGR | HGR2)
| prstmt
| printstmt1 //the print rule
//MANY MANY OTHER RULES HERE TOO LONG TO PASTE........
;
//the example rule that occurs when the token's "print" is encountered
printstmt1
: (PRINT | QUESTION) printlist?
;
printlist
: expression (COMMA | SEMICOLON)? printlist*
;
As you can see from the BNF type grammar here the statement rule in this grammar includes the rules for a print statement as well as every other type of statement so it will encompass 10, PRINT and hello and subsequently return the text with the getText() method when any of these are encountered in your case, everything but linenumber which is a rule outside of the statement rule.
If you want to target these specific rules to handle what happens when they are encountered you most likely want to add functionality to each of the methods ANTLR generates for each rule by extending the jvmBasiListener class as shown here
example:
-jvmBasicListener.java
-extended to jvmBasicCustomListener.java
void enterPrintstmt1(jvmBasicParser.Printstmt1Context ctx){
System.out.println(ctx.getText());
}
However if all this is setup and you are just wanting to return a string value etc using the single line you have then trying to access the methods at a lower level by addressing the child nodes of statement may work amprstmt->statement->printstmt1->value :
return parser.prog().line().amprstmt(0).statement().printstmt1().getText();
Just to maybe narrow my answer slightly, the rules specifically that address your input "10 PRINT "HELLO" " would be :
linenumber (contains Number) , statement->printstmt1 and statement->datastmt->datum (contains STRINGLITERAL)
So as shown above the linenumber rule exists on its own and the other 2 rules that defined your text are children of statement, which explains outputting everything except the line number when getting the statement rules text.
Addressing each of these and using getText() rather than an encompassing rule such as statement may give you the result you are looking for.
I will update to address your question since the answer may be slightly longer, the easiest way in my opinion to handle specific rules rather than generating a listener or visitor would be to implement actions within your grammar file rules like this :
printstmt1
: (PRINT | QUESTION) printlist? {System.out.println("Print"); //your java code }
;
This would simply allow you to address each rule and perform whichever java action you would wish to carry out. You can then simply compile your code with something like :
java -jar antlr-4.5.3-complete.jar jvmBasic.g4 -visitor
After this you can simply run your code however you wish, here is an example:
import JVM1.jvmBasicLexer;
import org.antlr.v4.runtime.ANTLRInputStream;
import org.antlr.v4.runtime.CommonTokenStream;
import org.antlr.v4.runtime.tree.ParseTree;
public class Jvm extends jvmBasicBaseVisitor<Object> {
public static void main(String[] args) {
jvmBasicLexer lexer = new jvmBasicLexer(new ANTLRInputStream("10 PRINT \"Hello!\""));
jvmBasicParser parser = new jvmBasicParser(new CommonTokenStream(lexer));
ParseTree tree = parser.prog();
}
}
The output for this example would then be just :
Print
You could also incorporate whatever Java methods you like within the grammar to address each rule encountered and either develop your own classes and methods to handle it or directly print it out a result.
Update
Just to address the latest question now :
parser.line().linenumber().getText() - For line Number, as line is not part of a statement
parser.prog().line(0).amprstmt(0).statement().printstmt1().PR‌INT().getText() - For PRINT as it is isolated in printstmt1, however does not include CLR in the rule
parser.prog().line(0).amprstmt(0).statement().printstmt1().pr‌intlist().expression().getText() - To get the value "hello" as it is part of an expression contained within the printstmt1 rule.
:) Good luck

How to structure my XText terminals? WORDS/SL_STRING/ML_STRING

In my XText DSL, I want to be able to use three different kinds of text terminals. They are all used for adding comments on top of arrows drawn in a UML diagram:
terminal WORD:
Actor -> Actor: WORD
terminal SL_STRINGS:
Actor -> Actor: A sequence of words on a single line
terminal ML_STRINGS:
Actor -> Actor: A series of words on
multiple
lines
My initial approach was to use the ID terminal from the org.eclipse.xtext.common.Terminals as my WORD terminal, and then just have SL_STRINGS be (WORD)*, and ML_STRINGS be (NEWLINE? WORD)*, but this creates a lot of problems with ambiguity between the rules.
How would I go about structuring this in a good way?
More information about the project. (And as this is the first time working with XText, please bear with me):
I am trying to implement a DSL to be used together with the Eclipse Plugin for PlantUML http://plantuml.sourceforge.net/sequence.html mainly for Syntax Checking and Colorization. Currently my grammar works as such:
Model:
(diagrams+=Diagram)*;
Diagram:
'#startuml' NEWLINE (instructions+=(Instruction))* '#enduml' NEWLINE*
;
An instruction can be lots of things:
Instruction:
((name1=ID SEQUENCE name2=ID (':' ID)?)
| Definition (Color)?
| AutoNumber
| Title
| Legend
| Newpage
| AltElse
| GroupingMessages
| Note
| Divider
| Reference
| Delay
| Space
| Hidefootbox
| Lifeline
| ParticipantCreation
| Box)? NEWLINE
;
Example of rules that need different kinds of text terminals:
Group:
'group' TEXT
;
Reference:
'ref over' ID (',' ID)* ((':' SL_TEXT)|((ML_TEXT) NEWLINE 'end ref'))
;
For Group, the text can only be on one line, while for Reference, the text can be on two lines if there is no ":" follwing the rule call.
Currently my terminals look like this:
terminal NEWLINE : ('\r'? '\n');
// Multiline comment begins with /', and ends with '/
terminal ML_COMMENT : '/\'' -> '\'/';
// Singleline comment begins with ', and continues until end of line.
terminal SL_COMMENT : '\'' !('\n'|'\r')* ('\r'? '\n')?;
// INT is a sequence of numbers 0-9.
terminal INT returns ecore::EInt: ('0'..'9')+;
terminal WS : (' '|'\t')+;
terminal ANY_OTHER: .;
And I want on top of this to add to this add three new terminals that takes care of the text.

You should implement a data type rule in order to achieve the desired behavior.
Sebastian wrote an excellent blog post on this topic which can be found here: http://zarnekow.blogspot.de/2012/11/xtext-corner-6-data-types-terminals-why.html
Here is a minimal example of a grammar:
grammar org.xtext.example.mydsl.MyDsl with org.eclipse.xtext.common.Terminals
generate myDsl "http://www.xtext.org/example/mydsl/MyDsl"
Model:
greetings+=Greeting*;
Greeting:
'Example' ':' comment=Comment;
Comment:
(ID ('\r'? '\n')?)+
;
That will allow you to write something like this:
Example: A series of words
Example: A series of words on
multiple lines
You then may want to implement your own value converter in order to fine-tune the conversion to a String.
Let me know if that helps!

Regexp pattern for list of email

I need valid regexp for email seperated by " " and ends with #a.com or b.com
for example:
valid email string: "email1#a.com email2#b.com email3#a.com"
invalid email string: "email1#a.com email2#b.com email3#c.com"

I don't necessarily think a regexp is an extensible and maintainable solution here. I would rather:
split the list on whitespace (perhaps on whitespace preceeded by a .com/.org etc.)
extract the domain name post-#
compare this vs. a whitelist (or blacklist)
I like regexps a lot, but I don't always think they're the solution. See here for a discussion on this, and note the below!
Some people, when confronted with a problem, think "I know, I'll use
regular expressions." Now they have two problems.

You can try this expression:
^(( |^)[^ #]+#[ab]\.com)+$
// ^ ^ ^ ^ ^
// | | | | +- The mandatory .com
// | | | +------ Either a or b
// | | +--------- An # sign
// | +------------- Anything but space or # repeated at least once
// +----------------------- Preceded by a space or the beginning of line

Try this:
^(.#(a|b).com(|\s)$
Permalink - try entering an invalid string like "c.com" and see that it works too
Regexpal is a nice easy tool to start working on making regex for whatever problem you are trying to solve!

(email[1-3]\#[ab].com )*email[1-3]\#[ab].com ?
(replace [1-3] and [ab] with whatever really suits you).

[A-Za-z0-9_.-]+#[ab]\.com( [A-Za-z0-9_.-]+#[ab]\.com)*
You can change the [A-Za-z0-9_.-]+ part if you want to be more restrictive.

ANTLRworks creating interpreter from grammar

Hey I have a quick question. I am using ANTLRworks to create an interpreter in Java from a set of grammar. I was going to write it out by hand but then realized I didn't have to because of antlrworks. I am getting this error though
T.g:9:23: label ID conflicts with token with same name
Is ANTLRworks the way to go when creating a interpreter from grammar. And do y'all see any error in my code?
I am trying to make ID one letter from a-z and not case sensitive. and to have white space in between every lexeme. THANK YOU
grammar T;
programs : ID WS compound_statement;
statement:
if_statement|assignment_statement|while_statement|print_statement|compound_statement;
compound_statement: 'begin' statement_list 'end';
statement_list: statement|statement WS statement_list;
if_statement: 'if' '(' boolean_expression ')' 'then' statement 'else' statement;
while_statement: 'while' boolean_expression 'do' statement;
assignment_statement: ID = arithmetic_expression;
print_statement: 'print' ID;
boolean_expression: operand relative_op operand;
operand : ID |INT;
relative_op: '<'|'<='|'>'|'>='|'=='|'/=';
arithmetic_expression: operand|operand WS arithmetic_op WS operand;
arithmetic_op: '+'|'-'|'*'|'/';
ID : ('a'..'z'|'A'..'Z'|'_').
;
INT : '0'..'9'+
;
WS : ( ' '
| '\t'
| '\r'
| '\n'
) {$channel=HIDDEN;}
;
and here is the grammar
<program> → program id <compound_statement>
<statement> → <if_statement> | <assignment_statement> | <while_statement> |
<print_statement> | <compound_statement>
<compound_statement> → begin <statement_list> end
<statement_list> → <statement> | <statement> ; <statement_list>
<if_statement> → if <boolean_expression> then <statement> else <statement>
<while_statement> → while <boolean_expression> do <statement>
<assignment_statement> -> id := <arithmetic_expression>
<print_statement> → print id
<boolean_expression> → <operand> <relative_op> <operand>
<operand> → id | constant
<relative_op> → < | <= | > | >= | = | /=
<arithmetic_expression> → <operand> | <operand> <arithmetic_op> <operand>
<arithmetic_op> → + | - | * | /

Is ANTLRworks the way to go when creating a interpreter from grammar.
No.
ANTLRWorks can only be used to write your grammar and possibly test to see if it input properly (through its debugger or interpreter). It cannot be used to create an interpreter for the language you've written the grammar for. ANTLRWorks is just a fancy text-editor, nothing more.
And do y'all see any error in my code?
As indicated by Treebranch: you didn't have quotes around the = sign in:
assignment_statement: ID = arithmetic_expression;
making ANTLR "think" you wanted to assign the label ID to the parser rule arithmetic_expression, which is illegal: you can't have a label-name that is also the name of a rule (ID, in your case).

Some possible issues in your code:
I think you want your ID tag to have a + regex so that it can be of length 1 or more, like so:
ID : ('a'..'z'|'A'..'Z'|'_')+
;
It also looks like you are missing quotes around your = sign:
assignment_statement: ID '=' arithmetic_expression;
EDIT
Regarding your left recursion issue: ANTLR is very powerful because of the regex functionality. While an EBNF (like the one you have presented) may be limited in the way it can express things, ANTLR can be used to express certain grammar rules in a much simpler way. For instance, if you want to have a statement_list in your compound_statement, just use your statement rule with closure (*). Like so:
compound_statement: 'begin' statement* 'end';
Suddently, you can remove unnecessary rules like statement_list.

What are some good free parsing programs?

Are there any good free parsing programs out there in Python or Java?
I have been using a lot of textfiles recently and they are all different. I have been spending a lot of time writing code to parse these textfiles. I was wondering if there is some program that could get all the names of a person out of a textfile or parse the file based on a keyword.

Pyparsing is a good Python add-on module for plain text. Easy to get something going quickly, but has enough supporting components to do some pretty elaborate parsing work. See http://pyparsing.wikispaces.com, and check out the Examples page. (Plus it is very liberally licensed, so there are no restrictions or runtime encumberances.)

ANTLR is pretty popular and even has an IDE to help you develop / test your grammars.

Take a look at JavaCC.
From the JavaCC FAQ:
JavaCC stands for "the Java Compiler
Compiler"; it is a parser generator
and lexical analyzer generator. JavaCC
will read a description of a language
and generate code, written in Java,
that will read and analyze that
language. JavaCC is particularly
useful when you have to write code to
deal with an input language has a
complex structure

I think you are looking for something like Apache Lucene.
Check this: http://lucene.apache.org/java/docs/index.html

Lepl is a general-purpose, recursive descent parser for Python that I maintain.
It's similar to pyparsing, in that both are parsers that you write directly in Python. Here's an example that parses and evaluates an arithmetic expression:
>>> from operator import add, sub, mul, truediv
>>> # ast nodes
... class Op(List):
... def __float__(self):
... return self._op(float(self[0]), float(self[1]))
...
>>> class Add(Op): _op = add
...
>>> class Sub(Op): _op = sub
...
>>> class Mul(Op): _op = mul
...
>>> class Div(Op): _op = truediv
...
>>> # tokens
>>> value = Token(UnsignedFloat())
>>> symbol = Token('[^0-9a-zA-Z \t\r\n]')
>>> number = Optional(symbol('-')) + value >> float
>>> group2, group3 = Delayed(), Delayed()
>>> # first layer, most tightly grouped, is parens and numbers
... parens = ~symbol('(') & group3 & ~symbol(')')
>>> group1 = parens | number
>>> # second layer, next most tightly grouped, is multiplication
... mul_ = group1 & ~symbol('*') & group2 > Mul
>>> div_ = group1 & ~symbol('/') & group2 > Div
>>> group2 += mul_ | div_ | group1
>>> # third layer, least tightly grouped, is addition
... add_ = group2 & ~symbol('+') & group3 > Add
>>> sub_ = group2 & ~symbol('-') & group3 > Sub
>>> group3 += add_ | sub_ | group2
... ast = group3.parse('1+2*(3-4)+5/6+7')[0]
>>> print(ast)
Add
+- 1.0
`- Add
+- Mul
| +- 2.0
| `- Sub
| +- 3.0
| `- 4.0
`- Add
+- Div
| +- 5.0
| `- 6.0
`- 7.0
>>> float(ast)
6.833333333333333
>>> 1+2*(3-4)+5/6+7
6.833333333333333
The main advantages of Lepl over pyparsing are that it's slightly more powerful (it can compile itself to regular expressions in places for speed, handle left recursive grammars, uses trampolining to avoid running out of stack space). The main disadvantages are that it's younger than pyparsing, so doesn't have the same number of users or as large and supportive a community.

If the text has a known format, a grammar parser might be your best bet.
Gold Parser is open source and has both Java and Python support, among others.

It depends on what you need to parse.
If you need to solve a particular problem domain then the best way is to create a domain-specific language and parse it in Groovy.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Capturing a regex group nested/enclosed by special character - java

You'll need something to differentiate start/finish patterns. I.e. {} Than you can use pattern \{[^{]*?\} to exclude {: {The {spectacle}} was {broken} First iteration {spectacle} {broken} Second iteration {The spectacle}

Related

ANTLR 4: Parsing grammar

How to structure my XText terminals? WORDS/SL_STRING/ML_STRING

Regexp pattern for list of email

ANTLRworks creating interpreter from grammar

What are some good free parsing programs?

Categories

Resources