Finding longest regex match in Java? - java

I have this:
import java.util.regex.*;
String regex = "(?<m1>(hello|universe))|(?<m2>(hello world))";
String s = "hello world";
Pattern pattern = Pattern.compile(regex);
Matcher matcher = pattern.matcher(s);
while(matcher.find()) {
MatchResult matchResult = m.toMatchResult();
String substring = s.substring(matchResult.start(), matchResult.end());
System.out.println(substring);
}
The above only prints hello whereas I want it to print hello world.
One way to fix this is to re-order the groups in String regex = "(?<m2>(hello world))|(?<m1>(hello|universe))" but I don't have control over the regex I get in my case...
So what is the best way to find the longest match? An obvious way would be to check all possible substrings of s as mentioned here (Efficiently finding all overlapping matches for a regular expression) by length and pick the first but that is O(n^2). Can we do better?

Here is a way of doing it using matcher regions, but with a single loop over the string index:
public static String findLongestMatch(String regex, String s) {
Pattern pattern = Pattern.compile("(" + regex + ")$");
Matcher matcher = pattern.matcher(s);
String longest = null;
int longestLength = -1;
for (int i = s.length(); i > longestLength; i--) {
matcher.region(0, i);
if (matcher.find() && longestLength < matcher.end() - matcher.start()) {
longest = matcher.group();
longestLength = longest.length();
}
}
return longest;
}
I'm forcing the pattern to match until the region's end, and then I move the region's end from the rightmost string index towards the left. For each region's end tried, Java will match the leftmost starting substring that finishes at that region's end, i.e. the longest substring that ends at that place. Finally, it's just a matter of keeping track of the longest match found so far.
As a matter of optimization, and since I start from the longer regions towards the shorter ones, I stop the loop as soon as all regions that would come after are already shorter than the length of longest substring already found.
An advantage of this approach is that it can deal with arbitrary regular expressions and no specific pattern structure is required:
findLongestMatch("(?<m1>(hello|universe))|(?<m2>(hello world))", "hello world")
==> "hello world"
findLongestMatch("hello( universe)?", "hello world")
==> "hello"
findLongestMatch("hello( world)?", "hello world")
==> "hello world"
findLongestMatch("\\w+|\\d+", "12345 abc")
==> "12345"

If you are dealing with just this specific pattern:
There is one or more named group on the highest level connected by |.
The regex for the group is put in superfluous braces.
Inside those braces is one or more literal connected by |.
Literals never contain |, ( or ).
Then it is possible to write a solution by extracting the literals, sorting them by their length and then returning the first match:
private static final Pattern g = Pattern.compile("\\(\\?\\<[^>]+\\>\\(([^)]+)\\)\\)");
public static final String findLongestMatch(String s, Pattern p) {
Matcher m = g.matcher(p.pattern());
List<String> literals = new ArrayList<>();
while (m.find())
Collections.addAll(literals, m.group(1).split("\\|"));
Collections.sort(literals, new Comparator<String>() {
public int compare(String a, String b) {
return Integer.compare(b.length(), a.length());
}
});
for (Iterator<String> itr = literals.iterator(); itr.hasNext();) {
String literal = itr.next();
if (s.indexOf(literal) >= 0)
return literal;
}
return null;
}
Test:
System.out.println(findLongestMatch(
"hello world",
Pattern.compile("(?<m1>(hello|universe))|(?<m2>(hello world))")
));
// output: hello world
System.out.println(findLongestMatch(
"hello universe",
Pattern.compile("(?<m1>(hello|universe))|(?<m2>(hello world))")
));
// output: universe

just add the $ (End of string) before the Or separator |.
Then it check whether the string is ended of not. If ended, it will return the string. Otherwise skip that part of regex.
The below code gives what you want
import java.util.regex.*;
public class RegTest{
public static void main(String[] arg){
String regex = "(?<m1>(hello|universe))$|(?<m2>(hello world))";
String s = "hello world";
Pattern pattern = Pattern.compile(regex);
Matcher matcher = pattern.matcher(s);
while(matcher.find()) {
MatchResult matchResult = matcher.toMatchResult();
String substring = s.substring(matchResult.start(), matchResult.end());
System.out.println(substring);
}
}
}
Likewise, the below code will skip hello , hello world and match hello world there
See the usage of $ there
import java.util.regex.*;
public class RegTest{
public static void main(String[] arg){
String regex = "(?<m1>(hello|universe))$|(?<m2>(hello world))$|(?<m3>(hello world there))";
String s = "hello world there";
Pattern pattern = Pattern.compile(regex);
Matcher matcher = pattern.matcher(s);
while(matcher.find()) {
MatchResult matchResult = matcher.toMatchResult();
String substring = s.substring(matchResult.start(), matchResult.end());
System.out.println(substring);
}
}
}

If the structure of the regex is always the same, this should work:
String regex = "(?<m1>(hello|universe))|(?<m2>(hello world))";
String s = "hello world";
//split the regex into the different groups
String[] allParts = regex.split("\\|\\(\\?\\<");
for (int i=1; i<allParts.length; i++) {
allParts[i] = "(?<" + allParts[i];
}
//find the longest string
int longestSize = -1;
String longestString = null;
for (int i=0; i<allParts.length; i++) {
Pattern pattern = Pattern.compile(allParts[i]);
Matcher matcher = pattern.matcher(s);
while(matcher.find()) {
MatchResult matchResult = matcher.toMatchResult();
String substring = s.substring(matchResult.start(), matchResult.end());
if (substring.length() > longestSize) {
longestSize = substring.length();
longestString = substring;
}
}
}
System.out.println("Longest: " + longestString);

Related

Java - Regex split decimal, minus, and math operation

i need to split h[0] to first number("-12.0"), h[1] to operation symbol(+) and h[2] to second number(-15.3) but i don't know how it works
a=12.0+-15.3;
h = a.split("(?<=\\d)(?=\\D)|(?<=\\D)(?=\\d)");
Could somebody help me?
You may use a regex to match * or / anywhere n the string and - and + only when they are after a digit. In case of a match expression, you may match + or - after a word char, so, basically, you may check for a word boundary on the left: [/*]|\b[-+].
See the regex demo.
Then just split and keep the matches:
public static final Pattern regex = Pattern.compile("[/*]|\\b[-+]");
public static List<String> split(String s, Pattern pattern) {
Matcher m = pattern.matcher(s);
List<String> ret = new ArrayList<String>();
int start = 0;
while (m.find()) {
ret.add(s.substring(start, m.start()));
ret.add(m.group());
start = m.end();
}
if (start >= s.length()) {
ret.add(s.substring(start));
}
return ret;
}
Usage example:
String s = "12.0+-15.3*-45.7/+67.9";
List<String> res = split(s, regex);
System.out.println(res);
// => [12.0, +, -15.3, *, -45.7, /]
See the Java demo

find overlapping regex pattern

I'm using regex to find a pattern
I need to find all matches in this way :
input :"word1_word2_word3_..."
result: "word1_word2","word2_word3", "word4_word5" ..
It can be done using (?=) positive lookahead.
Regex: (?=(?:_|^)([^_]+_[^_]+))
Java code:
String text = "word1_word2_word3_word4_word5_word6_word7";
String regex = "(?=(?:_|^)([^_]+_[^_]+))";
Matcher matcher = Pattern.compile(regex).matcher(text);
while (matcher.find()) {
System.out.println(matcher.group(1));
}
Output:
word1_word2
word2_word3
word3_word4
...
Code demo
You can do it without regex, using split:
String input = "word1_word2_word3_word4";
String[] words = input.split("_");
List<String> outputs = new LinkedList<>();
for (int i = 0; i < words.length - 1; i++) {
String first = words[i];
String second = words[i + 1];
outputs.add(first + "_" + second);
}
for (String output : outputs) {
System.out.println(output);
}

Remove zeros from begining

I am trying to remove zeros in value using regex(non capturing group). Does anyone have an idea?
Matcher matcher = Pattern.compile("(?:[0]+)?(\\S+)").matcher("00100");//.group(0));
//Matcher matcher = pattern.matcher(mydata);
if(matcher.matches()) {
System.out.println("value "+matcher.group(0));
}
str.replaceAll("^0+(?!$)", "")
If you want to remove leading zeroes, you can just use replaceAll:
String input = "00100";
input = input.replaceAll("^0+([^0].*)$", "$1");
Regex101
I found a solution to my own question:
public static void main(String[] args) {
extractValuesFromRegex("(?:0+|)(\\d+)", "00123");
extractValuesFromRegex("(?:0+|)(\\d+)", "123");
extractValuesFromRegex("(\\d+)", "00123");
extractValuesFromRegex("(\\d+)", "00123");
}
public static final String extractValuesFromRegex(String regex, String input) {
String extractevalue = input;
Matcher matcher = Pattern.compile(regex).matcher(input);
if (matcher.matches()) {
extractevalue = matcher.group(1);
}
return extractevalue;
}

How can I split string using Regular Expression(Pattern)?

I want to split my String (e.g. "20150101") using Regular Expression.
For example I need these values: "2015","01","01"
String pattern = "(....)(..)(..)";
Pattern r = Pattern.compile(pattern);
Matcher m = r.matcher(inputString);//inputString:"20150101"
Now you can use m.group(x) to get the parts of the string. For example:
m.group(1) is first four digit ("2015" in your question).
Bit hard to say without more details, but try:
(\d{4})(\d{2})(\d{2})
Your Matcher's three captured group references will then have the values you want.
Just combining the two answers both are valid
First
public static void main(String[] args) {
String input = "20150101";
String pattern = "(....)(..)(..)";
Pattern r = Pattern.compile(pattern);
Matcher m = r.matcher(input);
m.find();
for(int i=1;i<=m.groupCount();i++){
String token = m.group( i );
System.out.println(token);
}
}
Second
public static void main(String[] args) {
String input = "20150101";
String pattern = "(\\d{4})(\\d{2})(\\d{2})";
Pattern r = Pattern.compile(pattern);
Matcher m = r.matcher(input);
m.find();
for(int i=1;i<=m.groupCount();i++){
String token = m.group( i );
System.out.println(token);
}
}
private static final Pattern DATE_PATTERN = Pattern.compile("^([\\d]{4})([\\d]{2})([\\d]{2})$");
public static Optional<String[]> split(String str) {
final Matcher matcher = DATE_PATTERN.matcher(str);
if (matcher.find()) {
final String[] array = new String[3];
array[0] = matcher.group(1);
array[1] = matcher.group(2);
array[2] = matcher.group(3);
return Optional.of(array);
}
return Optional.empty();
}

in matcher.replace method,how to limit replace times?

in matcher.replace method,only has:
replaceFirst() and replaceAll() two methods
i want limit replace 3 times,how to do?
example:
String content="aaaaaaaaaa";
i want to get result is: "bbbaaaaaaa"
my code:
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class T1 {
public static void main(String[] args) {
String content="aaaaaaaaaa";
Pattern pattern = Pattern.compile("a");
Matcher m=pattern.matcher(content);
if(m.find()){
String result=m.replaceFirst("b");
System.out.println(result);
}
}
}
thanks :)
On appendReplacement/Tail
You'd have to use appendReplacement and appendTail explicitly. Unfortunately you have to use StringBuffer to do this. Here's a snippet (see also in ideone.com):
String content="aaaaaaaaaa";
Pattern pattern = Pattern.compile("a");
Matcher m = pattern.matcher(content);
StringBuffer sb = new StringBuffer();
final int N = 3;
for (int i = 0; i < N; i++) {
if (m.find()) {
m.appendReplacement(sb, "b");
} else {
break;
}
}
m.appendTail(sb);
System.out.println(sb); // bbbaaaaaaa
See also
StringBuilder and StringBuffer in Java
StringBuffer is synchronized and therefore slower than StringBuilder
BugID 5066679: Matcher should make more use of Appendable
If granted, this request for enhancement would allow Matcher to append to any Appendable
Another example: N times uppercase replacement
Here's another example that shows how appendReplacement/Tail can give you more control over replacement than replaceFirst/replaceAll:
// replaces up to N times with uppercase of matched text
static String replaceUppercase(int N, Matcher m) {
StringBuffer sb = new StringBuffer();
for (int i = 0; i < N; i++) {
if (m.find()) {
m.appendReplacement(
sb,
Matcher.quoteReplacement(m.group().toUpperCase())
);
} else {
break;
}
}
m.appendTail(sb);
return sb.toString();
}
Then we can have (see also on ideone.com):
Pattern p = Pattern.compile("<[^>]*>");
Matcher m = p.matcher("<a> b c <ddd> e <ff> g <$$$> i <jjj>");
System.out.println(replaceUppercase(4, m));
// <A> b c <DDD> e <FF> g <$$$> i <jjj>
// 1 2 3 4
The pattern <[^>]*> is just a simple example pattern that matches "<tags like this>".
Note that Matcher.quoteReplacement is necessary in this particular case, or else appending "<$$$>" as replacement would trigger IllegalArgumentException about an illegal group reference (because $ unescaped in replacement string is a backreference sigil).
On replaceFirst and replaceAll
Attached is the java.util.regex.Matcher code for replaceFirst and replaceAll (version 1.64 06/04/07). Note that it's done using essentially the same appendReplacement/Tail logic:
// Excerpt from #(#)Matcher.java 1.64 06/04/07
public String replaceFirst(String replacement) {
if (replacement == null)
throw new NullPointerException("replacement");
StringBuffer sb = new StringBuffer();
reset(); // !!!!
if (find())
appendReplacement(sb, replacement);
appendTail(sb);
return sb.toString();
}
public String replaceAll(String replacement) {
reset(); // !!!!
boolean result = find();
if (result) {
StringBuffer sb = new StringBuffer();
do {
appendReplacement(sb, replacement);
result = find();
} while (result);
appendTail(sb);
return sb.toString();
}
return text.toString();
}
Note that the Matcher is reset() prior to any replaceFirst/All. Thus, simply calling replaceFirst 3 times would always get you the same result (see also on ideone.com):
String content="aaaaaaaaaa";
Pattern pattern = Pattern.compile("a");
Matcher m = pattern.matcher(content);
String result;
result = m.replaceFirst("b"); // once!
result = m.replaceFirst("b"); // twice!
result = m.replaceFirst("b"); // one more for "good" measure!
System.out.println(result);
// baaaaaaaaa
// i.e. THIS DOES NOT WORK!!!
See also
java.util.regex.Matcher source code, OpenJDK version
i think use StringUtils
code
org.apache.commons.lang3.StringUtils.replace(content,"a","b",3);

Categories