Regex for emoticon - java

have this regex:
(:?^|\s)+(;\))+
Im tryng to capture all ocurrences of ;) if it appears alone (between spaces) or at start of line.
Valid examples
;)
;)
;) ;) -> Should be 2 groups of ;)
Dont allow
a;)a
a;)
;)a
Current regex just captures first group for ;) ;) case because second ;) is expecting for a space but its used by first group..

You can match ;) using lookarounds:
(?<=\s|^);\)(?=\s|$)
RegEx Demo
(?<=\s|^) is look behind that asserts line start or a whitespace is at previous position
(?=\s|$) is look ahead that matches line end or a whitespace is at next position
In Java:
Pattern p = Pattern.compile("(?<=\\s|^);\\)(?=\\s|$)");

I'd like to suggest a more concise lookaround based solution:
String rx = "(?x)(?<!\\S) ;\\) (?!\\S)";
See the regex demo
Explanation:
(?x) - COMMENTS modifier ensuring that all pattern whitespace is ignored, and embedded comments starting with # are ignored until the end of a line (so that we can better see the parts of the pattern)
(?<!\\S) - a negative lookbehind failing the match if a ;) is preceded with a non-whitespace char
;\\) - literal ;)
(?!\\S) - a negative lookahead that fails the match if there is a non-whitespace char after the ;).
See the Java demo with a replaceAll to show it finds only those ;) you need:
String s = ";)\n ;) \n;) ;) -> Should be 2 groups of ; )\n\nDont allow\na;)a\na;) \n;)a";
System.out.println(s.replaceAll("(?x)(?<!\\S) ;\\) (?!\\S)", "<found>$0</found>"));
If you want to further enhance the pattern and you do not feel comfortable with the COMMENTS modifier, remove it. Use "(?<!\\S);\\)(?!\\S)" then.

Related

RegEx matching $1 tokens [duplicate]

Imagine you are trying to pattern match "stackoverflow".
You want the following:
this is stackoverflow and it rocks [MATCH]
stackoverflow is the best [MATCH]
i love stackoverflow [MATCH]
typostackoverflow rules [NO MATCH]
i love stackoverflowtypo [NO MATCH]
I know how to parse out stackoverflow if it has spaces on both sites using:
/\s(stackoverflow)\s/
Same with if its at the start or end of a string:
/^(stackoverflow)\s/
/\s(stackoverflow)$/
But how do you specify "space or end of string" and "space or start of string" using a regular expression?
You can use any of the following:
\b #A word break and will work for both spaces and end of lines.
(^|\s) #the | means or. () is a capturing group.
/\b(stackoverflow)\b/
Also, if you don't want to include the space in your match, you can use lookbehind/aheads.
(?<=\s|^) #to look behind the match
(stackoverflow) #the string you want. () optional
(?=\s|$) #to look ahead.
(^|\s) would match space or start of string and ($|\s) for space or end of string. Together it's:
(^|\s)stackoverflow($|\s)
Here's what I would use:
(?<!\S)stackoverflow(?!\S)
In other words, match "stackoverflow" if it's not preceded by a non-whitespace character and not followed by a non-whitespace character.
This is neater (IMO) than the "space-or-anchor" approach, and it doesn't assume the string starts and ends with word characters like the \b approach does.
\b matches at word boundaries (without actually matching any characters), so the following should do what you want:
\bstackoverflow\b

How to make Regex lookahead to match one and two digit numbers?

Let's say for example that I have a string reading "1this12string". I would like to use String#split with regex using lookahead that will give me ["1this", "12string"].
My current statement is (?=\d), which works very well for single digit numbers. I am having trouble modifying this statement to include both 1 and 2 digit numbers.
Add a look behind so you don't split within numbers:
(?<!\d)(?=\d)
See live demo.
If you really want to use Regex Lookahead, try this:
(\d{1,2}[^\d]*)(?=\d|\b)
Regex Demo
Note that this assume every string split must have 1 or 2 digits at the front. In case this is not the case, please let us know so that we can further enhance it.
Regex Logics
\d{1,2} to match 1 or 2 digits at the front
[^\d]* to match non-digit characters following the first 1 or 2 digit(s)
Enclose the the above 2 segments in parenthesis () so as to make it a capturing group for extraction of matched text.
(?=\d to fulfill your requirement to use Regex Lookahead
|\b to allow the matching text to be at the end of a text (just before a word boundary)
I think you can also achieve your task with a simpler regex, without using the relatively more sophisticated feature like Regex Lookahead. For example:
\d{1,2}[^\d]*
You can see in the Regex Demo that this works equally well for your sample input. Anyway, in case your requirement is anything more than this, please let us know to fine-tune it.
Use
String[] splits = string.split("(?<=\\D)(?=\\d)");
See regex proof
Explanation
--------------------------------------------------------------------------------
(?<= look behind to see if there is:
--------------------------------------------------------------------------------
\D non-digits (all but 0-9)
--------------------------------------------------------------------------------
) end of look-behind
--------------------------------------------------------------------------------
(?= look ahead to see if there is:
--------------------------------------------------------------------------------
\d digits (0-9)
--------------------------------------------------------------------------------
) end of look-ahead

Java Regex how to get all the matching occurences of a pattern [duplicate]

What are these two terms in an understandable way?
Greedy will consume as much as possible. From http://www.regular-expressions.info/repeat.html we see the example of trying to match HTML tags with <.+>. Suppose you have the following:
<em>Hello World</em>
You may think that <.+> (. means any non newline character and + means one or more) would only match the <em> and the </em>, when in reality it will be very greedy, and go from the first < to the last >. This means it will match <em>Hello World</em> instead of what you wanted.
Making it lazy (<.+?>) will prevent this. By adding the ? after the +, we tell it to repeat as few times as possible, so the first > it comes across, is where we want to stop the matching.
I'd encourage you to download RegExr, a great tool that will help you explore Regular Expressions - I use it all the time.
'Greedy' means match longest possible string.
'Lazy' means match shortest possible string.
For example, the greedy h.+l matches 'hell' in 'hello' but the lazy h.+?l matches 'hel'.
Greedy quantifier
Lazy quantifier
Description
*
*?
Star Quantifier: 0 or more
+
+?
Plus Quantifier: 1 or more
?
??
Optional Quantifier: 0 or 1
{n}
{n}?
Quantifier: exactly n
{n,}
{n,}?
Quantifier: n or more
{n,m}
{n,m}?
Quantifier: between n and m
Add a ? to a quantifier to make it ungreedy i.e lazy.
Example:
test string : stackoverflow
greedy reg expression : s.*o output: stackoverflow
lazy reg expression : s.*?o output: stackoverflow
Greedy means your expression will match as large a group as possible, lazy means it will match the smallest group possible. For this string:
abcdefghijklmc
and this expression:
a.*c
A greedy match will match the whole string, and a lazy match will match just the first abc.
As far as I know, most regex engine is greedy by default. Add a question mark at the end of quantifier will enable lazy match.
As #Andre S mentioned in comment.
Greedy: Keep searching until condition is not satisfied.
Lazy: Stop searching once condition is satisfied.
Refer to the example below for what is greedy and what is lazy.
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class Test {
public static void main(String args[]){
String money = "100000000999";
String greedyRegex = "100(0*)";
Pattern pattern = Pattern.compile(greedyRegex);
Matcher matcher = pattern.matcher(money);
while(matcher.find()){
System.out.println("I'm greedy and I want " + matcher.group() + " dollars. This is the most I can get.");
}
String lazyRegex = "100(0*?)";
pattern = Pattern.compile(lazyRegex);
matcher = pattern.matcher(money);
while(matcher.find()){
System.out.println("I'm too lazy to get so much money, only " + matcher.group() + " dollars is enough for me");
}
}
}
The result is:
I'm greedy and I want 100000000 dollars. This is the most I can get.
I'm too lazy to get so much money, only 100 dollars is enough for me
Taken From www.regular-expressions.info
Greediness: Greedy quantifiers first tries to repeat the token as many times
as possible, and gradually gives up matches as the engine backtracks to find
an overall match.
Laziness: Lazy quantifier first repeats the token as few times as required, and
gradually expands the match as the engine backtracks through the regex to
find an overall match.
From Regular expression
The standard quantifiers in regular
expressions are greedy, meaning they
match as much as they can, only giving
back as necessary to match the
remainder of the regex.
By using a lazy quantifier, the
expression tries the minimal match
first.
Greedy matching. The default behavior of regular expressions is to be greedy. That means it tries to extract as much as possible until it conforms to a pattern even when a smaller part would have been syntactically sufficient.
Example:
import re
text = "<body>Regex Greedy Matching Example </body>"
re.findall('<.*>', text)
#> ['<body>Regex Greedy Matching Example </body>']
Instead of matching till the first occurrence of ‘>’, it extracted the whole string. This is the default greedy or ‘take it all’ behavior of regex.
Lazy matching, on the other hand, ‘takes as little as possible’. This can be effected by adding a ? at the end of the pattern.
Example:
re.findall('<.*?>', text)
#> ['<body>', '</body>']
If you want only the first match to be retrieved, use the search method instead.
re.search('<.*?>', text).group()
#> '<body>'
Source: Python Regex Examples
Greedy Quantifiers are like the IRS
They’ll take as much as they can. e.g. matches with this regex: .*
$50,000
Bye-bye bank balance.
See here for an example: Greedy-example
Non-greedy quantifiers - they take as little as they can
Ask for a tax refund: the IRS sudden becomes non-greedy - and return as little as possible: i.e. they use this quantifier:
(.{2,5}?)([0-9]*) against this input: $50,000
The first group is non-needy and only matches $5 – so I get a $5 refund against the $50,000 input.
See here: Non-greedy-example.
Why do we need greedy vs non-greedy?
It becomes important if you are trying to match certain parts of an expression. Sometimes you don't want to match everything - as little as possible. Sometimes you want to match as much as possible. Nothing more to it.
You can play around with the examples in the links posted above.
(Analogy used to help you remember).
Greedy means it will consume your pattern until there are none of them left and it can look no further.
Lazy will stop as soon as it will encounter the first pattern you requested.
One common example that I often encounter is \s*-\s*? of a regex ([0-9]{2}\s*-\s*?[0-9]{7})
The first \s* is classified as greedy because of * and will look as many white spaces as possible after the digits are encountered and then look for a dash character "-". Where as the second \s*? is lazy because of the present of *? which means that it will look the first white space character and stop right there.
Best shown by example. String. 192.168.1.1 and a greedy regex \b.+\b
You might think this would give you the 1st octet but is actually matches against the whole string. Why? Because the.+ is greedy and a greedy match matches every character in 192.168.1.1 until it reaches the end of the string. This is the important bit! Now it starts to backtrack one character at a time until it finds a match for the 3rd token (\b).
If the string a 4GB text file and 192.168.1.1 was at the start you could easily see how this backtracking would cause an issue.
To make a regex non greedy (lazy) put a question mark after your greedy search e.g
*?
??
+?
What happens now is token 2 (+?) finds a match, regex moves along a character and then tries the next token (\b) rather than token 2 (+?). So it creeps along gingerly.
To give extra clarification on Laziness, here is one example which is maybe not intuitive on first look but explains idea of "gradually expands the match" from Suganthan Madhavan Pillai answer.
input -> some.email#domain.com#
regex -> ^.*?#$
Regex for this input will have a match. At first glance somebody could say LAZY match(".*?#") will stop at first # after which it will check that input string ends("$"). Following this logic someone would conclude there is no match because input string doesn't end after first #.
But as you can see this is not the case, regex will go forward even though we are using non-greedy(lazy mode) search until it hits second # and have a MINIMAL match.
try to understand the following behavior:
var input = "0014.2";
Regex r1 = new Regex("\\d+.{0,1}\\d+");
Regex r2 = new Regex("\\d*.{0,1}\\d*");
Console.WriteLine(r1.Match(input).Value); // "0014.2"
Console.WriteLine(r2.Match(input).Value); // "0014.2"
input = " 0014.2";
Console.WriteLine(r1.Match(input).Value); // "0014.2"
Console.WriteLine(r2.Match(input).Value); // " 0014"
input = " 0014.2";
Console.WriteLine(r1.Match(input).Value); // "0014.2"
Console.WriteLine(r2.Match(input).Value); // ""

Word that matches ^.*(?=.*\\d)(?=.*[a-zA-Z])(?=.*[!##$%^&]).*$

I am totally confused right now.
What is a word that matches: ^.*(?=.*\\d)(?=.*[a-zA-Z])(?=.*[!##$%^&]).*$
I tried at Regex 101 this 1Test#!. However that does not work.
I really appreciate your input!
What happens is that your regex seems to be in Java-flavor (Note the \\d)
that is why you have to convert it to work with regex101 which does not work with jave (only works with php, phyton, javascript)
see converted regex:
^.*(?=.*\d)(?=.*[a-zA-Z])(?=.*[!##$%^&]).*$
which will match your string 1Test#!. Demo here: http://regex101.com/r/gE3iQ9
You just want something that matches that regex?
Here:
a1a!
This pattern matches
\dTest#!
if u want a pattern which matches 1Test#! try this pattern
^.(?=.\d)(?=.[a-zA-Z])(?=.[!##$%^&]).*$
Your java string ^.*(?=.*\\d)(?=.*[a-zA-Z])(?=.*[!##$%^&]).*$ encodes the regexp expression ^.*(?=.*\d)(?=.*[a-zA-Z])(?=.*[!##$%^&]).*$.
This is because the \ is an escape sequence.
The latter matches the string you specified.
If your original string was a regexp, rather than a java string, it would match strings such as \dTest#!
Also you should consider removing the first .*, doing so would make the regexp more efficient. The reason is that regexp's by default are greedy. So it will start by matching the whole string to the initial .*, the lookahead will then fail. The regexp will backtrack, matchine the first .* to all but the last character, and will fail all but one of the loohaheads. This will proceed until it hits a point where the different lookaheads succeed. Dropping the first .*, putting the lookahead immidiately after the start of string anchor, will avoid this problem, and in this case the set of strings matched will be the same.

java regex to exclude specific strings from a larger one

I have been banging my head against this for some time now:
I want to capture all [a-z]+[0-9]? character sequences excluding strings such as sin|cos|tan etc.
So having done my regex homework the following regex should work:
(?:(?!(sin|cos|tan)))\b[a-z]+[0-9]?
As you see I am using negative lookahead along with alternation - the \b after the non-capturing group closing parenthesis is critical to avoid matching the in of sin etc. The regex makes sense and as a matter of fact I have tried it with RegexBuddy and Java as the target implementation and get the wanted result but it doesn't work using Java Matcher and Pattern objects!
Any thoughts?
cheers
The \b is in the wrong place. It would be looking for a word boundary that didn't have sin/cos/tan before it. But a boundary just after any of those would have a letter at the end, so it would have to be an end-of-word boundary, which is can't be if the next character is a-z.
Also, the negative lookahead would (if it worked) exclude strings like cost, which I'm not sure you want if you're just filtering out keywords.
I suggest:
\b(?!sin\b|cos\b|tan\b)[a-z]+[0-9]?\b
Or, more simply, you could just match \b[a-z]+[0-9]?\b and filter out the strings in the keyword list afterwards. You don't always have to do everything in regex.
So you want [a-z]+[0-9]? (a sequence of at least one letter, optionally followed by a digit), unless that letter sequence resembles one of sin cos tan?
\b(?!(sin|cos|tan)(?=\d|\b))[a-z]+\d?\b
results:
cos - no match
cosy - full match
cos1 - no match
cosy1 - full match
bla9 - full match
bla99 - no match
i forgot to escape the \b for java so \b should be \\b and it now works.
cheers

Categories