I need valid regexp for email seperated by " " and ends with #a.com or b.com
for example:
valid email string: "email1#a.com email2#b.com email3#a.com"
invalid email string: "email1#a.com email2#b.com email3#c.com"
I don't necessarily think a regexp is an extensible and maintainable solution here. I would rather:
split the list on whitespace (perhaps on whitespace preceeded by a .com/.org etc.)
extract the domain name post-#
compare this vs. a whitelist (or blacklist)
I like regexps a lot, but I don't always think they're the solution. See here for a discussion on this, and note the below!
Some people, when confronted with a problem, think "I know, I'll use
regular expressions." Now they have two problems.
You can try this expression:
^(( |^)[^ #]+#[ab]\.com)+$
// ^ ^ ^ ^ ^
// | | | | +- The mandatory .com
// | | | +------ Either a or b
// | | +--------- An # sign
// | +------------- Anything but space or # repeated at least once
// +----------------------- Preceded by a space or the beginning of line
Try this:
^(.#(a|b).com(|\s)$
Permalink - try entering an invalid string like "c.com" and see that it works too
Regexpal is a nice easy tool to start working on making regex for whatever problem you are trying to solve!
(email[1-3]\#[ab].com )*email[1-3]\#[ab].com ?
(replace [1-3] and [ab] with whatever really suits you).
[A-Za-z0-9_.-]+#[ab]\.com( [A-Za-z0-9_.-]+#[ab]\.com)*
You can change the [A-Za-z0-9_.-]+ part if you want to be more restrictive.
Related
I want to parse some data from AppleSoft Basic script.
I choose ANTLR and download this grammar: jvmBasic
I'm trying to extract function name without parameters:
return parser.prog().line(0).amprstmt(0).statement().getText();
but it returns PRINT"HELLO" e.g full expression except the line number
Here is string i want to parse:
10 PRINT "Hello!"
I think this question really depends on your ANTLR program implementation but if you are using a treewalker/listener you probably want to be targeting the rule for the specific tokens not the entire "statement" rule which is circular and encompasses many types of statement :
//each line can have one to many amprstmt's
line
: (linenumber ((amprstmt (COLON amprstmt?)*) | (COMMENT | REM)))
;
amprstmt
: (amperoper? statement) //encounters a statement here
| (COMMENT | REM)
;
//statements can be made of 1 to many sub statements
statement
: (CLS | LOAD | SAVE | TRACE | NOTRACE | FLASH | INVERSE | GR | NORMAL | SHLOAD | CLEAR | RUN | STOP | TEXT | HOME | HGR | HGR2)
| prstmt
| printstmt1 //the print rule
//MANY MANY OTHER RULES HERE TOO LONG TO PASTE........
;
//the example rule that occurs when the token's "print" is encountered
printstmt1
: (PRINT | QUESTION) printlist?
;
printlist
: expression (COMMA | SEMICOLON)? printlist*
;
As you can see from the BNF type grammar here the statement rule in this grammar includes the rules for a print statement as well as every other type of statement so it will encompass 10, PRINT and hello and subsequently return the text with the getText() method when any of these are encountered in your case, everything but linenumber which is a rule outside of the statement rule.
If you want to target these specific rules to handle what happens when they are encountered you most likely want to add functionality to each of the methods ANTLR generates for each rule by extending the jvmBasiListener class as shown here
example:
-jvmBasicListener.java
-extended to jvmBasicCustomListener.java
void enterPrintstmt1(jvmBasicParser.Printstmt1Context ctx){
System.out.println(ctx.getText());
}
However if all this is setup and you are just wanting to return a string value etc using the single line you have then trying to access the methods at a lower level by addressing the child nodes of statement may work amprstmt->statement->printstmt1->value :
return parser.prog().line().amprstmt(0).statement().printstmt1().getText();
Just to maybe narrow my answer slightly, the rules specifically that address your input "10 PRINT "HELLO" " would be :
linenumber (contains Number) , statement->printstmt1 and statement->datastmt->datum (contains STRINGLITERAL)
So as shown above the linenumber rule exists on its own and the other 2 rules that defined your text are children of statement, which explains outputting everything except the line number when getting the statement rules text.
Addressing each of these and using getText() rather than an encompassing rule such as statement may give you the result you are looking for.
I will update to address your question since the answer may be slightly longer, the easiest way in my opinion to handle specific rules rather than generating a listener or visitor would be to implement actions within your grammar file rules like this :
printstmt1
: (PRINT | QUESTION) printlist? {System.out.println("Print"); //your java code }
;
This would simply allow you to address each rule and perform whichever java action you would wish to carry out. You can then simply compile your code with something like :
java -jar antlr-4.5.3-complete.jar jvmBasic.g4 -visitor
After this you can simply run your code however you wish, here is an example:
import JVM1.jvmBasicLexer;
import org.antlr.v4.runtime.ANTLRInputStream;
import org.antlr.v4.runtime.CommonTokenStream;
import org.antlr.v4.runtime.tree.ParseTree;
public class Jvm extends jvmBasicBaseVisitor<Object> {
public static void main(String[] args) {
jvmBasicLexer lexer = new jvmBasicLexer(new ANTLRInputStream("10 PRINT \"Hello!\""));
jvmBasicParser parser = new jvmBasicParser(new CommonTokenStream(lexer));
ParseTree tree = parser.prog();
}
}
The output for this example would then be just :
Print
You could also incorporate whatever Java methods you like within the grammar to address each rule encountered and either develop your own classes and methods to handle it or directly print it out a result.
Update
Just to address the latest question now :
parser.line().linenumber().getText() - For line Number, as line is not part of a statement
parser.prog().line(0).amprstmt(0).statement().printstmt1().PRINT().getText() - For PRINT as it is isolated in printstmt1, however does not include CLR in the rule
parser.prog().line(0).amprstmt(0).statement().printstmt1().printlist().expression().getText() - To get the value "hello" as it is part of an expression contained within the printstmt1 rule.
:) Good luck
I was using the below regex to substitute file names
Regex -> .*\/([A-Z0-9_]{1,9})_(O).*.cmd
Substitution -> $1
The file names were like below:
File Name | Substituted Name
---------------------------------- ------------------
/V3/OGM_REC_Offline_Level0_4D.cmd OGM_REC
/V2/PIE_PROD_Online_Level1_6D.cmd PIE_PROD
/V3/BR2_OnDemand.cmd BR2
/opt/STING_Online_Inc0_1W.cmd STING
Then the files changed and I modified the regex
Regex -> .*\/([A-Z0-9_]{1,9})(_O|Full).*.cmd
Substitution -> $1
Additional new file names
File Name | Substituted Name
---------------------- ------------------
/opt/RSU10Full.cmd RSU10
/V4/REZ40_1Full.cmd REZ40_1
Now, it seems there are new files are getting updated with below name formats
/app/OMGIT_FullOnDemand_4W.cmd
/admin/FOC_STG_Full_6D.cmd
I've modified the regex again, but it's not getting successful
Regex -> .*\/([A-Z0-9_]{1,9})(_O|Full|_Full).*.cmd
Substitution -> $1
I suggest using a version with a lazy limiting quantifier {1,9}? and optional _:
.*/([A-Z0-9_]{1,9}?)(_O|_?Full).*[.]cmd
This way, we match as few characters with [A-Z0-9_]{1,9}? as possible to return a valid captured subtext, and _?Full part can hold the optional underscore.
See the regex demo
I've noticed that unnecessary tail is allways started with: (optional) _, letter in uppercase, letter in lowercase.
So, universal solution is:
.*\/([^a-z]*?)[_]?[A-Z][a-z].*
In my XText DSL, I want to be able to use three different kinds of text terminals. They are all used for adding comments on top of arrows drawn in a UML diagram:
terminal WORD:
Actor -> Actor: WORD
terminal SL_STRINGS:
Actor -> Actor: A sequence of words on a single line
terminal ML_STRINGS:
Actor -> Actor: A series of words on
multiple
lines
My initial approach was to use the ID terminal from the org.eclipse.xtext.common.Terminals as my WORD terminal, and then just have SL_STRINGS be (WORD)*, and ML_STRINGS be (NEWLINE? WORD)*, but this creates a lot of problems with ambiguity between the rules.
How would I go about structuring this in a good way?
More information about the project. (And as this is the first time working with XText, please bear with me):
I am trying to implement a DSL to be used together with the Eclipse Plugin for PlantUML http://plantuml.sourceforge.net/sequence.html mainly for Syntax Checking and Colorization. Currently my grammar works as such:
Model:
(diagrams+=Diagram)*;
Diagram:
'#startuml' NEWLINE (instructions+=(Instruction))* '#enduml' NEWLINE*
;
An instruction can be lots of things:
Instruction:
((name1=ID SEQUENCE name2=ID (':' ID)?)
| Definition (Color)?
| AutoNumber
| Title
| Legend
| Newpage
| AltElse
| GroupingMessages
| Note
| Divider
| Reference
| Delay
| Space
| Hidefootbox
| Lifeline
| ParticipantCreation
| Box)? NEWLINE
;
Example of rules that need different kinds of text terminals:
Group:
'group' TEXT
;
Reference:
'ref over' ID (',' ID)* ((':' SL_TEXT)|((ML_TEXT) NEWLINE 'end ref'))
;
For Group, the text can only be on one line, while for Reference, the text can be on two lines if there is no ":" follwing the rule call.
Currently my terminals look like this:
terminal NEWLINE : ('\r'? '\n');
// Multiline comment begins with /', and ends with '/
terminal ML_COMMENT : '/\'' -> '\'/';
// Singleline comment begins with ', and continues until end of line.
terminal SL_COMMENT : '\'' !('\n'|'\r')* ('\r'? '\n')?;
// INT is a sequence of numbers 0-9.
terminal INT returns ecore::EInt: ('0'..'9')+;
terminal WS : (' '|'\t')+;
terminal ANY_OTHER: .;
And I want on top of this to add to this add three new terminals that takes care of the text.
You should implement a data type rule in order to achieve the desired behavior.
Sebastian wrote an excellent blog post on this topic which can be found here: http://zarnekow.blogspot.de/2012/11/xtext-corner-6-data-types-terminals-why.html
Here is a minimal example of a grammar:
grammar org.xtext.example.mydsl.MyDsl with org.eclipse.xtext.common.Terminals
generate myDsl "http://www.xtext.org/example/mydsl/MyDsl"
Model:
greetings+=Greeting*;
Greeting:
'Example' ':' comment=Comment;
Comment:
(ID ('\r'? '\n')?)+
;
That will allow you to write something like this:
Example: A series of words
Example: A series of words on
multiple lines
You then may want to implement your own value converter in order to fine-tune the conversion to a String.
Let me know if that helps!
I am trying to search for words that appear within tilde (~) sign borders.
e.g. ~albert~ is a ~good~ boy.
I know that this is possible by using ~.+?~,and it already works for me. But There are special cases when I need to match a nested tilde sentence.
e.g. ~The ~spectacle~~ was ~broken~
In the example above, I have to capture 'The Spectacle', 'spectacle', and 'broken' separately. These will be translated either word-by-word or with accompanying article (An, The, whatever). The reason is that in my system:
1) 'The spectacle' requires a separate translation on a specific cases.
2) 'Spectacle' also needs translation on specific cases.
3) IF a tranlsation exist for The spectacle, we will use that, ELSE
we will use
Another example to explain this is:
~The ~spectacle~~ was ~borken~, but that was not the same ~spectacle~
that was given to ~me~.
In the example above, I will have translation for:
1) 'The spectacle' (because the translation case exists for 'The spectacle', otherwise I would've only translated spectacle on it's own)
2) 'broken'
3) 'spectacle'
4) me
I am having trouble combining an expression which will make sure that this is captured in my regular expression. The one that I have managed to work with so far is '~.+?~'. But I know that with some form of lookahead or lookbehind, I can get this working. Could anyone help me on this?
The most important aspect in this is the regression-proofing, which will ensure that the existing stuff don't break. If I manage to get it right, I will post it.
N.B. If it helps, currently I will have instances where only one level of nesting will require decomposition. so ~The ~spectacle~~ will be deepest level (until i need more!!!!!)
I wrote something like this a while ago, I haven't tested it much though:
(~(?(?=.*?~~.*?~).*?~.*?~.*?~|[^~]+?~))
or
(~(?(?=.*?~[A-Za-z]*?~.*?~).*?~.*?~.*?~|[^~]+?~))
RegEx101
Another alternative
(~(?:.*?~.*?~){0,2}.*?~)
^^ change to max depth
which ever works best
To add more add a few extra sets of .*?~ in the two places where you see a bunch.
The main problem
If we allow unlimited nesting How would we know where it would end and begin? A clumsy diagram:
~This text could be nested ~ so could this~ and this~ this ~Also this~
| | |_________| | |
| |_______________________________| |
|____________________________________________________________________|
or:
~This text could be nested ~ so could this~ and this~ this ~Also this~
| | | | |_________|
| |______________| |
|___________________________________________________|
The compiler would have no idea which to choose
For your sentence
~The ~spectacle~~ was ~broken~, but that was not the same ~spectacle~ that was given to ~me~.
| | ||_____| | | |
| | |_____________| | |
| |____________________________________________________| |
|___________________________________________________________________|
or:
~The ~spectacle~~ was ~broken~, but that was not the same ~spectacle~ that was given to ~me~.
| |_________|| |______| |_________| |__|
|_______________|
What should I do?
Use an alternating character (as #tbraun suggested) so the compiler knows where to start and end:
{This text can be {properly {nested}} without problems} because {the compiler {can {see {the}}} start and end points} easily. Or use a compiler:
Note: I don't do Java much so some code might be incorrect
import java.util.List;
String[] chars = myString.split('');
int depth = 0;
int lastMath = 0;
List<String> results = new ArrayList<String>();
for (int i = 0; i < chars.length; i += 1) {
if (chars[i] === '{') {
depth += 1;
if (depth === 1) {
lastIndex = i;
}
}
if (chars[i] === '}') {
depth -= 1;
if (depth === 0) {
results.add(StringUtils.join(Arrays.copyOfRange(chars, lastIndex, i + 1), ''));
}
if (depth < 0) {
// Balancing problem Handle an error
}
}
}
This uses StringUtils
You'll need something to differentiate start/finish patterns. I.e. {}
Than you can use pattern \{[^{]*?\} to exclude {:
{The {spectacle}} was {broken}
First iteration
{spectacle}
{broken}
Second iteration
{The spectacle}
I am totally new to reg-ex and I want to get validation for the string for valid combination of logical operators like ( ! , & , ( , ) , | ) . for Example if & | combined than it should be invalid as AND OR should come together. likewise possible invalid combination are &|, |& , (), !& ,&! etc
like example of below String
1. (ABC)&(DFG)|!(ZXC) - pass because all operators are correctly combined
2. !(ABC|DKJ)&VBN - pass
3. !(ADF&(!&(BER|CTY))|DGH) = failed as !& combined
4. !(ABC&DKJ)&|VBN - failed as & | combined
I know their several ways like I can use String's contains method to get check and reject if not passed the validation. But I am looking for solution through reg-ex in java
Just to avoid matching invalid operator combos you can use negative lookahead regex like this:
^(?!.*?(&\\||\\|&|\\(\\)|!&|&!))
Use it with MULTILINE option like this for multiline inputs:
Pattern p = Pattern.compile( "(?m)^(?!.*?(&[!|]|[(|]&|\\(\\)))" );
RegEx Demo
For using it with a string input you can do:
boolean value = input.matches( "(?!.*?(&[!|]|[(|]&|\\(\\))).+" );