Porting Twemoji regex to extract Unicode emojis in Java

Porting Twemoji regex to extract Unicode emojis in Java - java

I'm trying to identify the same emojis in a String for extraction that Twemoji would, using Java. A straight up port isn't working for a great deal of emojis - I think I've identified the issue, so I'll give it in an example below:
Suppose we have the emoji 🪔 (Codeunits being \ud83e\ude94). In Javascript regex, this is captured by, \ud83e[\ude94-\ude99] which will first match the \ude83e then find subsequent \ude94 within the range indicated inside the brackets. The same expression in Java regex, however, fails to match at all. If I modify the Java pattern to [\ud83e[\ude94-\ude99]], according to an online engine, the 2nd half is captured, but not the 1st.
My working theory is that Java encounters the brackets and treats everything inside as a single codepoint and when combined with the outside codeunit, thinks it's looking for two codepoints instead of one. Is there an easy way to fix this or the regex pattern to work around it? The obvious fix would be to use something like [\ud83e\ude94-\ud83e\ude99], the actual regex pattern is quite lengthy. I wonder if there might be an easy encoding fix somewhere here as well.
Toy sample below:
public static void main(String[] args) {
String emojiPattern = "\ud83e[\ude94-\ude99]";
String raw = "\ud83e\ude94";
Pattern pattern = Pattern.compile(emojiPattern);
Matcher matcher = pattern.matcher(raw);
System.out.println(matcher.matches());
}

If you're trying to match a single specific codepoint, don't mess with surrogate pairs; refer to it by number:
String emojiPattern = "\\x{1FA94}";
or by name:
String emojiPattern = "\\N{DIYA LAMP}"
If you want to match any codepoint in the block U+1FA94 is in, use the name of the block in a property atom:
String emojiPattern = "\\p{blk=Symbols and Pictographs Extended-A}";
If you switch out any of these three regular expressions your example program will print 'true'.
The problem you're running into is a UTF-16 surrogate pair is a single codepoint, and the RE engine matches codepoints, not code units; you can't match just the low or high half - just the pattern "\ud83e" will fail to match too (When used with Matcher#find instead of Matcher#matches of course), for example. It's all or none.
To do the kind of ranged matching you want, you have to turn away from regular expressions and look at the code units directly. Something like
char[] codeUnits = raw.toCharArray();
for (int i = 0; i < codeUnits.length - 1; i++) {
if (codeUnits[i] == 0xD83E &&
(codeUnits[i + 1] >= 0xDE94 && codeUnits[i + 1] <= 0xDE99)) {
System.out.println("match");
}
}

Related

Regex for finding between 1 and 3 character in a string

I am trying to write a regex which should return true, if [A-Za-z] is occured between 1 and 3, but I am not able to do this
public static void main(String[] args) {
String regex = "(?:([A-Za-z]*){3}).*";
String regex1 = "(?=((([A-Za-z]){1}){1,3})).*";
Pattern pattern = Pattern.compile(regex);
System.out.println(pattern.matcher("AD1CDD").find());
}
Note: for consecutive 3 characters I am able to write it, but what I want to achieve is the occurrence should be between 1 and 3 only for the entire string. If there are 4 characters, it should return false. I have used look-ahead to achieve this

If I understand your question correctly, you want to check if
1 to 3 characters of the range [a-zA-Z] are in the string
Any other character can occur arbitrary often?
First of all, just counting the characters and not using a regular expression is more efficient, as this is not a regular language problem, but a trivial counting problem. There is nothing wrong with using a for loop for this problem (except that interpreters such as Python and R can be fairly slow).
Nevertheless, you can (ab-) use extended regular expressions:
^([^A-Za-z]*[A-Za-z]){1,3}[^A-Za-z]*$
This is fairly straightforward, once you also model the "other" characters. And that is what you should do to define a pattern: model all accepted strings (i.e. the entire "language"), not only those characters you want to find.
Alternatively, you can "findAll" matches of ([A-Za-z]), and look at the length of the result. This may be more convenient if you also need the actual characters.
The for loop would look something like this:
public static boolean containsOneToThreeAlphabetic(String str) {
int matched = 0;
for(int i=0; i<str.length; i++) {
char c = str.charAt(i);
if ((c>='A' && c<='Z') || (c>='a' && c<='z')) matched++;
}
return matched >=1 && matched <= 3;
}
This is straightforward, readable, extensible, and efficient (in compiled languages). You can also add a if (matched>=4) return false; (or break) to stop early.

Please, stop playing with regex, you'll complicate not only your own life, but the life of the people, who have to handle your code in the future. Choose a simpler approach, find all [A-Za-z]+ strings, put them into the list, then check every string, if the length is within 1 and 3 or beyond that.

Regex
/([A-Za-z])(?=(?:.*\1){3})/s
Looking for a char and for 3 repetitions of it. So if it matches there are 4 or more equal chars present.

Regex in Java: match groups until first symbol occurrence

My string looks like this:
"Chitkara DK, Rawat DJY, Talley N. The epidemiology of childhood recurrent abdominal pain in Western countries: a systematic review. Am J Gastroenterol. 2005;100(8):1868-75. DOI."
What I want is to get letters in uppercase (as separate words only) until first dot, to get: DK DJY N. But not other characters after, like J DOI.
Here`s my part of code for Java class Pattern:
\\b[A-Z]{1,3}\\b
Is there a general option in regex to stop matching after certain character?

You can make use of the contionous matching using \G and extract your desired matches from the first capturing group:
(?:\\G|^)[^.]+?\\b([A-Z]{1,3})\\b
You need to use the MULTILINE flag to use this in a multiline context. If your content is always a single line you may drop the |^ from your pattern.
See https://regex101.com/r/JXIu21/3
Note that regex101 uses a PCRE pattern, but all features used are also available in Java regex.

Sebastian Proske's answer is great, but it's often easier (and more readable) to split complex parsing tasks into separate steps. We can split your goal into two separate steps and thereby create a much simpler and more clearly-correct solution, using your original pattern.
private static final Pattern UPPER_CASE_ABBV_PATTERN =
Pattern.compile("\\b[A-Z]{1,3}\\b");
public static List<String> getAbbreviationsInFirstSentence(String input) {
// isolate the first sentence, since that's all we care about
String firstSentence = input.split("\\.")[0];
// then look for matches in the first sentence
Matcher m = UPPER_CASE_ABBV_PATTERN.matcher(firstSentence);
List<String> results = new ArrayList<>();
while (m.find()) {
results.add(m.group());
}
return results;
}

Java regular expression for number starts with code

I am not a Java developer but I am interfacing with a Java system.
Please help me with a regular expression that would detect all numbers starting with with 25678 or 25677.
For example in rails would be:
^(25677|25678)
Sample input is 256776582036 an 256782405036

^(25678|25677)
or
^2567[78]
if you do ^(25678|25677)[0-9]* it Guarantees that the others are all numbers and not other characters.
Should do the trick for you...Would look for either number and then any number after

In Java the regex would be the same, assuming that the number takes up the entire line. You could further simplify it to
^2567[78]
If you need to match a number anywhere in the string, use \b anchor (double the backslash if you are making a string literal in Java code).
\b2567[78]
how about if there is a possibility of a + at the beginning of a number
Add an optional +, like this [+]? or like this \+? (again, double the backslash for inclusion in a string literal).
Note that it is important to know what Java API is used with the regular expression, because some APIs will require the regex to cover the entire string in order to declare it a match.

Try something like:
String number = ...;
if (number.matches("^2567[78].*$")) {
//yes it starts with your number
}
Regex ^2567[78].*$ Means:
Number starts with 2567 followed by either 7 or 8 and then followed by any character.
If you need just numbers after say 25677, then regex should be ^2567[78]\\d*$ which means followed by 0 or n numbers after your matching string in begining.

The regex syntax of Java is pretty close to that of rails, especially for something this simple. The trick is in using the correct API calls. If you need to do more than one search, it's worthwhile to compile the pattern once and reuse it. Something like this should work (mixed Java and pseudocode):
Pattern p = Pattern.compile("^2567[78]");
for each string s:
if (p.matcher(s).find()) {
// string starts with 25677 or 25678
} else {
// string starts with something else
}
}
If it's a one-shot deal, then you can simplify all this by changing the pattern to cover the entire string:
if (someString.matches("2567[78].*")) {
// string starts with 25677 or 25678
}
The matches() method tests whether the entire string matches the pattern; hence the leading ^ anchor is unnecessary but the trailing .* is needed.
If you need to account for an optional leading + (as you indicated in a comment to another answer), just include +? at the start of the pattern (or after the ^ if that's used).

pattern matching in java using regular expression

I am looking for a pattern to match this "LA5#10.232.140.133#Po6" and one more "LA5#10.232.140.133#Port-channel7" expression in Java using regular expression.
Like we have \d{1,3}.\d{1,3}.\d{1,3}.\d{1,3} for IP address validation.
Can we have the pattern like below? Please suggest--
[a-zA-Z0-9][a-zA-Z0-9][a-zA-Z0-9]#\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}#Po\d[1-9]
[a-zA-Z0-9][a-zA-Z0-9][a-zA-Z0-9]#\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}#Port-channel\d[1-9]
Thanks in advance.
==============================
In my program i have,
import java.util.regex.*;
class ptternmatch {
public static void main(String [] args) {
Pattern p = Pattern.compile("\\w\\w\\w#\d{1,3}.\d{1,3}.\d{1,3}.\d{1,3}#*");
Matcher m = p.matcher("LA5#10.232.140.133#Port-channel7");
boolean b = false;
System.out.println("Pattern is " + m.pattern());
while(b = m.find()) {
System.out.println(m.start() + " " + m.group());
}
}
}
But i am getting compilation error with the pattern.--> Invalid escape sequence
The sequence will be like a ->a 3 character word of digit n letter#ipaddress#some text..

Well, if you want to validate the IP address, then you need something a little bit more involved than \d{1,3}. Also, keep in mind that for Java string literals, you need to escape the \ with \\ so you end up with a single backslash in the actual regex to escape a character such as a period (.).
Assuming the LA5# bit is static and that you're fine with either Po or Port-channel followed by a digit on the end, then you probably need a regex along these lines:
LA5#(((2((5[0-5])|([0-4][0-9])))|(1[0-9]{2})|([1-9][0-9]?)\\.){3}(2(5[0-5]|[0-4][0-9]))|(1[0-9]{2})|([1-9][0-9]?)#Po(rt-channel)?[1-9]
(Bracketing may be wonky, my apologies)

You can do something like matcher.find() and, if it is true, the groups to capture the information. Take a look a the tutorial here:
http://download.oracle.com/javase/tutorial/essential/regex/
You would need to wrap the necessary parts int parentheses - e.g. (\d{1,3}). If you wrap all 4, you will have 4 groups to access.
Also, take a look at this tutorial
http://www.javaworld.com/javaworld/jw-07-2001/jw-0713-regex.html?page=3
It's a very good tutorial, I think this one would explain most of your questions.
To match the second of your strings:
LA5#10.232.140.133#Port-channel7
you can use something like:
\w{2}\d#\d{1,3}.\d{1,3}.\d{1,3}.\d{1,3}#[a-zA-Z\-]+\d
This depends on what you want to do, so the regex might change.

Capturing <thisPartOnly> and (thisPartOnly) with the same group

Let's say we have the following input:
<amy>
(bob)
<carol)
(dean>
We also have the following regex:
<(\w+)>|\((\w+)\)
Now we get two matches (as seen on rubular.com):
<amy> is a match, \1 captures amy, \2 fails
(bob) is a match, \2 captures bob, \1 fails
This regex does most of what we want, which are:
It matches the open and close brackets properly (i.e. no mixing)
It captures the part we're interested in
However, it does have a few drawbacks:
The capturing pattern (i.e. the "main" part) is repeated
It's only \w+ in this case, but generally speaking this can be quite complex,
If it involves backreferences, then they must be renumbered for each alternate!
Repetition makes maintenance a nightmare! (what if it changes?)
The groups are essentially duplicated
Depending on which alternate matches, we must query different groups
It's only \1 or \2 in this case, but generally the "main" part can have capturing groups of their own!
Not only is this inconvenient, but there may be situations where this is not feasible (e.g. when we're using a custom regex framework that is limited to querying only one group)
The situation quickly worsens if we also want to match {...}, [...], etc.
So the question is obvious: how can we do this without repeating the "main" pattern?
Note: for the most part I'm interested in java.util.regex flavor, but other flavors are welcomed.
Appendix
There's nothing new in this section; it only illustrates the problem mentioned above with an example.
Let's take the above example to the next step: we now want to match these:
<amy=amy>
(bob=bob)
[carol=carol]
But not these:
<amy=amy) # non-matching bracket
<amy=bob> # left hand side not equal to right hand side
Using the alternate technique, we have the following that works (as seen on rubular.com):
<((\w+)=\2)>|\(((\w+)=\4)\)|\[((\w+)=\6)\]
As explained above:
The main pattern can't simply be repeated; backreferences must be renumbered
Repetition also means maintenance nightmare if it ever changes
Depending on which alternate matches, we must query either \1 \2, \3 \4, or \5 \6

You can use a lookahead to "lock in" the group number before doing the real match.
String s = "<amy=amy>(bob=bob)[carol=carol]";
Pattern p = Pattern.compile(
"(?=[<(\\[]((\\w+)=\\2))(?:<\\1>|\\(\\1\\)|\\[\\1\\])");
Matcher m = p.matcher(s);
while(m.find())
{
System.out.printf("found %s in %s%n", m.group(2), m.group());
}
output:
found amy in <amy=amy>
found bob in (bob=bob)
found carol in [carol=carol]
It's still ugly as hell, but you don't have to recalculate all the group numbers every time you make a change. For example, to add support for curly brackets, it's just:
"(?=[<(\\[{]((\\w+)=\\2))(?:<\\1>|\\(\\1\\)|\\[\\1\\]|\\{\\1\\})"

In preg (Perl Regex library), this will match your example, and \3 will catch the insides:
((<)|\()(\w+)(?(2)>|\))
It will not work in JS, though - you did not specify the dialect...
It depends on the conditional operator (?(2)...|...) which basically says if 2 is a non-null capture, then match before the pipe, else match after the pipe. In this form, pipe is not alternation ("or").
UPDATE Sorry, I completely missed the Java bit :) Anyway, apparently Java does not support the conditional construct; and I have no idea how else I'd go about it :(
Also, for your Appendix (even though it's the wrong dialect):
(?:(<)|(\()|\[)(\w+)=\3(?(1)>|(?(2)\)|]))
The name is in again in \3 (I got rid of the first capturing paren, but I had to add another one for one extra opening paren check)

The only solution that I was able to come up with is inspired by technique of capturing an empty string on different alternates; backreferencing to these groups later can serve as pseudo-conditionals.
Thus, this pattern works for the second example (as seen on rubular.com):
__main__
/ \
(?:<()|\(()|\[())((\w+)=\5)(\1>|\2\)|\3\])
\_______________/ \_____________/
\1 \2 \3
So essentially for each opening bracket, we assign a group that captures an empty string. Then when we try to match the closing bracket, we see which group was succesful, and match the corresponding closing bracket.
The "main" part does not have to be repeated, but in Java, backreferences may have to be renumbered. This won't be a problem in flavors that support named groups.

May be this example in Perl will interest you :
$str = q/<amy=amy> (bob=bob) [carol=carol] <amy=amy) <amy=bob>/;
$re = qr/(?:<((\w+)=\2)>|\(((\w+)=\4)\)|\[((\w+)=\6)\])+/;
#list = ($str =~ /$re/g);
for(#list) {
say $i++," = ",$_;
}
I just surround your regex by (?:regex)+

When you get things like this, using a single regex is a silly restriction, and I simply don't agree with your "maintenance nightmare" to using more than one - repeating a similar-but-different expression several times is likely to be more maintainable (well, less unmaintainable), and maybe even better performance too, than a single overly-complex regex.
But anyway, there's no repetition if you just use variables to compose your regex.
Here's some pseudo-code:
Brackets = "<>,(),[]"
CoreRegex = "(\w+)=\1"
loop CurBracket in Brackets.split(',')
{
Input.match( Regex.quote(CurBracket.left(1)) & CoreRegex & Regex.quote(CurBracket.right(1)) )
}
(p.s.that's just to give the general idea - I'd probably use already-escaped arrays for the bracket sets in actual implementation).

Assuming there is no easy way to manually write this regular expression, why not leave it to the computer?
You could have a function, maybe like below (I am using C# syntax here, as I am a bit more familiar with regexes here than in Java, but it should not be too difficult to adapt it to Java).
Note that I left the function AdaptBackreferences() more or less unimplemented as an exercise to the reader. It should just adapt the backreference numbering.
struct BracketPair {public string Open; public string Close;};
static string[] MatchTextInBrackets(string text, string innerPattern, BracketPair[] bracketPairs) {
StringBuilder sb = new StringBuilder();
// count number of catching parentheses of innerPattern here:
int numberOfInnerCapturingParentheses = Regex.Match("", innerPattern).Groups.Count - 1;
bool firstTime = true;
foreach (BracketPair pair in bracketPairs) {
// apply logic to change backreference numbering:
string adaptedInnerPattern = AdaptBackreferences(innerPattern);
if (firstTime) { firstTime = false; } else { sb.Append('|'); }
sb.Append(pair.Open).Append("(").Append(adaptedInnerPattern).Append(")").Append(pair.Close);
}
string myPattern = sb.ToString();
MatchCollection matches = Regex.Matches(text, myPattern);
string[] result = new string[matches.Count];
for(int i=0; i < matches.Count; i++) {
StringBuilder mb = new StringBuilder();
for(int j=0; j < bracketPairs.Length; j++) {
mb.Append(matches[i].Groups[1 + j * (numberOfInnerCapturingParentheses + 1)]); // append them all together, assuming all exept one are empty
}
result[i] = mb.ToString();
}
return result;
}
static string AdaptBackreferences(string pattern) { return pattern; } // to be written

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Porting Twemoji regex to extract Unicode emojis in Java - java

Related

Regex for finding between 1 and 3 character in a string

Regex in Java: match groups until first symbol occurrence

Java regular expression for number starts with code

pattern matching in java using regular expression

Capturing <thisPartOnly> and (thisPartOnly) with the same group

Categories

Resources