Extract substring where startString and endString are same - java

I want to extract the sentence/word, where the start string and end string are same,
for example :
String originalString = "this is an example to extract sentence between is";
here start string and end string is same that is : "is"
So the final output should be : an example to extract sentence between
I tried as below, but it returned output as "is" only
String originalString = "this is an example to extract sentence between is";
String startEndString = "is";
int startIndex = originalString.indexOf(startEndString);
int endIndex = originalString.indexOf(startEndString, startIndex + startEndString.length());
String substring = originalString.substring(startIndex, endIndex);
System.out.println(substring);
I also checked org.apache.commons.lang.StringUtils substring methods, but could not find any to fulfill this type of extract. Is there any java8 / StringUtil method / API already available to do this job?

You can obtain the desired result using regex:
String originalString = "this is an example to extract sentence between is";
Pattern p = Pattern.compile( "(?<=(?:\\bis\\b))(.*)(?=(?:\\bis\\b))" );
Matcher m = p.matcher( originalString );
if ( m.find() ) {
System.out.println(m.group(1));
}
(?<=(?:\\bis\\b)) is a positive look behind, with a non capturing group. The word you're looking for is placed between the boundary keywords \b. The boundaries make sure it will only look for whole words (so 'this' will be skipped). (?=(?:\\bis\\b)) is a positive look ahead. The end result will be anything between the two groups.

Related

How to get a sub string from string using word starts with some text

I am trying to read the word from a string using some starting text. For example
String str = " Ads : 234983 ABCD2987423 availabe"
In the above example, the value will change every time. So I want to read the string which starts with the text "ABCD". i.e (ABCD2987423)
Note: Sometimes the extra text will be added in the string, So index won't work in this example.
You could use String#replaceAll for a one-liner using regex:
String str = " Ads : 234983 ABCD2987423 available";
String text = str.replaceAll(".*\\b(ABCD\\w*)\\b.*", "$1");
System.out.println(text);
This prints:
ABCD2987423
In case you plan to use this frequently, you could use RegExp:
public final class AbcdExtractor implements Function<String, String> {
private final Pattern pattern = Pattern.compile("(?<abcd>ABCD\\d+)");
public String apply(String str) {
Matcher matcher = pattern.matcher(str);
return matcher.find() ? matcher.group("abcd") : null;
}
}
Then you code could look like this:
AbcdExtractor extractor = new AbcdExtractor();
System.out.println(extractor.apply("Ads : 234983 ABCD2987423 availabe")); // ABCD2987423
System.out.println(extractor.apply("This is another string with ABCD2987423 and with suffix")); // ABCD2987423
System.out.println(extractor.apply("This is string without data")); // null
Demo: https://regex101.com
If you want to read the string which starts with the text "ABCD" and what precedes it is a whitespace char or it can be at the start of the string and what follows can only be a whitespace char or the end of the string, you could use lookarounds.
(?<!\S)ABCD\w*(?!\S)
(?<!\S) Negative lookbehind, assert what is directly on the left is not a non whitespace char
ABCD Match literally
\w* Match 0+ times a word char
(?!\S) Negative lookahead, assert what is directly on the right is not a non whitesspace char
Regex demo | Java demo
For example
String regex = "(?<!\\S)ABCD\\w*(?!\\S)";
String string = "Ads : 234983 ABCD2987423 availabe\n"
+ "Ads : 234983 #ABCD2987423 availabe\n"
+ "Ads : 234983 ABCD2987423# availabe\n"
+ "Ads : 234983 #ABCD2987423# availabe";
Pattern pattern = Pattern.compile(regex);
Matcher matcher = pattern.matcher(string);
while (matcher.find()) {
System.out.println(matcher.group(0));
}
Output will only match the first
ABCD2987423
An alternative solution. Using streams but no regexes.
Split the string into an array of words, then pick the first word that starts with ABCD.
String[] words = " Ads : 234983 ABCD2987423 available".split(" ");
String foundWord = Stream.of(words).filter(w -> w.startsWith("ABCD")).findFirst().get();
Note: If no word in the input starts with "ABCD", the optional produced by findFirst throws NoSuchElementException when calling get().

Regex replace space and word to toFirstUpper of word

I was trying to use regex to change the following string
String input = "Creation of book orders"
to
String output = "CreationOfBookOrders"
I tried the following expecting to replace the space and word with word.
input.replaceAll("\\s\\w", "(\\w)");
input.replaceAll("\\s\\w", "\\w");
but here the string is replacing space and word with character 'w' instead of the word.
I am in a position not to use any WordUtils or StringUtils or such Util classes. Else I could have replaced all spaces with empty string and applied WordUtils.capitalize or similar methods.
How else (preferably using regex) can I get the above output from input.
I don't think you can do that with String.replaceAll. The only modifications that you can make in the replacement string are to interpolate groups matched by the regex.
The javadoc for Matcher.replaceAll explains how the replacement string is handled.
You will need use a loop. Here's a simple version:
StringBuilder sb = new StringBuilder(input);
Pattern pattern = Pattern.compile("\\s\\w");
Matcher matcher = pattern.matcher(s);
int pos = 0;
while (matcher.find(pos)) {
String replacement = matcher.group().substring(1).toUpperCase();
pos = matcher.start();
sb.replace(pos, pos + 2, replacement);
pos += 1;
}
output = sb.toString();
(This could be done more efficiently, but it is complicated.)

Get the last index of a letter followed by numeric

I'm trying to parse a URL and I'd like to test for the last index of a couple characters followed by a numeric value.
Example
used-cell-phone-albany-m3359_l12201
I'm trying to determine if the last "-m" is followed by a numeric value.
So something like this, "used-cell-phone-albany-m3359_l12201".contains("m" followed by numeric)
I'm assuming it needs to be done with regular expressions, but I'm not for sure.
You could use a pattern like [a-z]\\d which searches for any numbers which appear next to a character between a-z, you can specify other characters within the group if you wish...
Pattern pattern = Pattern.compile("[a-z]\\d", Pattern.CASE_INSENSITIVE);
Matcher matcher = pattern.matcher("used-cell-phone-albany-m3359_l12201");
while (matcher.find()) {
int startIndex = matcher.start();
int endIndex = matcher.end();
String match = matcher.group();
System.out.println(startIndex + "-" + endIndex + " = " + match);
}
The problem is, your test String actually contains two matches m3 and l1
The above example will display
23-25 = m3
29-31 = l1
Updated with feedback
If you can guarantee the marker (ie -m), then it comes a lot simpler...
Pattern pattern = Pattern.compile("-m\\d", Pattern.CASE_INSENSITIVE);
Matcher matcher = pattern.matcher("used-cell-phone-albany-m3359_l12201");
if (matcher.find()) {
int startIndex = matcher.start();
int endIndex = matcher.end();
String match = matcher.group();
System.out.println(startIndex + "-" + endIndex + " = " + match);
}
In Java, convert the URL to a String if necessary and then run
URLString.match("^.*m[0-9]+$").
Only if that returns true, then the URL ends with "m" followed by a number. That can be refined with a more precise ending pattern. The reason this regex tests the pattern at the end of the string is because $ in a regex matches the end of the string; "[0-9]+" matches a sequencs of one or more numerical digits; "^" matches the beginning of the string; and ".*" matches zero or more arbitrary but printable characters including white space, letters, numbers and puctuation marks.
To determine if the last "m" is followed by a number then use
URLString.match("^.+?m[0-9].*$")
Here ".+?" greedily matches all characters up to the very last "m".

Extract every complete word that contains a certain substring

I'm trying to write a function that extracts each word from a sentence that contains a certain substring e.g. Looking for 'Po' in 'Porky Pork Chop' will return Porky Pork.
I've tested my regex on regexpal but the Java code doesn't seem to work. What am I doing wrong?
private static String foo()
{
String searchTerm = "Pizza";
String text = "Cheese Pizza";
String sPattern = "(?i)\b("+searchTerm+"(.+?)?)\b";
Pattern pattern = Pattern.compile ( sPattern );
Matcher matcher = pattern.matcher ( text );
if(matcher.find ())
{
String result = "-";
for(int i=0;i < matcher.groupCount ();i++)
{
result+= matcher.group ( i ) + " ";
}
return result.trim ();
}else
{
System.out.println("No Luck");
}
}
In Java to pass \b word boundaries to regex engine you need to write it as \\b. \b represents backspace in String object.
Judging by your example you want to return all words that contains your substring. To do this don't use for(int i=0;i < matcher.groupCount ();i++) but while(matcher.find()) since group count will iterate over all groups in single match, not over all matches.
In case your string can contain some special characters you probably should use Pattern.quote(searchTerm)
In your code you are trying to find "Pizza" in "Cheese Pizza" so I assume that you also want to find strings that same as searched substring. Although your regex will work fine for it, you can change your last part (.+?)?) to \\w* and also add \\w* at start if substring should also be matched in the middle of word (not only at start).
So your code can look like
private static String foo() {
String searchTerm = "Pizza";
String text = "Cheese Pizza, Other Pizzas";
String sPattern = "(?i)\\b\\w*" + Pattern.quote(searchTerm) + "\\w*\\b";
StringBuilder result = new StringBuilder("-").append(searchTerm).append(": ");
Pattern pattern = Pattern.compile(sPattern);
Matcher matcher = pattern.matcher(text);
while (matcher.find()) {
result.append(matcher.group()).append(' ');
}
return result.toString().trim();
}
While the regex approach is certainly a valid method, I find it easier to think through when you split the words up by whitespace. This can be done with String's split method.
public List<String> doIt(final String inputString, final String term) {
final List<String> output = new ArrayList<String>();
final String[] parts = input.split("\\s+");
for(final String part : parts) {
if(part.indexOf(term) > 0) {
output.add(part);
}
}
return output;
}
Of course it is worth nothing that doing this will effectively be doing two passes through your input String. The first pass to find the characters that are whitespace to split on, and the second pass looking through each split word for your substring.
If one pass is necessary though, the regex path is better.
I find nicholas.hauschild's answer to be the best.
However if you really wanted to use regex, you could do it as such:
String searchTerm = "Pizza";
String text = "Cheese Pizza";
Pattern pattern = Pattern.compile("\\b" + Pattern.quote(searchTerm)
+ "\\b", Pattern.CASE_INSENSITIVE);
Matcher matcher = pattern.matcher(text);
while (matcher.find()) {
System.out.println(matcher.group());
}
Output:
Pizza
The pattern should have been
String sPattern = "(?i)\\b("+searchTerm+"(?:.+?)?)\\b";
You want to capture the whole (pizza)string.?: ensures you don't capture a part of the string twice.
Try this pattern:
String searchTerm = "Po";
String text = "Porky Pork Chop oPod zzz llPo";
Pattern p = Pattern.compile("\\p{Alpha}+" + substring + "|\\p{Alpha}+" + substring + "\\p{Alpha}+|" + substring + "\\p{Alpha}+");
Matcher m = p.matcher(myString);
while(m.find()) {
System.out.println(">> " + m.group());
}
Ok, I give you a pattern in raw style (not java style, you must double escape yourself):
(?i)\b[a-z]*po[a-z]*\b
And that's all.

Why isn't this lookahead assertion working in Java?

I come from a Perl background and am used to doing something like the following to match leading digits in a string and perform an in-place increment by one:
my $string = '0_Beginning';
$string =~ s|^(\d+)(?=_.*)|$1+1|e;
print $string; # '1_Beginning'
With my limited knowledge of Java, things aren't so succinct:
String string = "0_Beginning";
Pattern p = Pattern.compile( "^(\\d+)(?=_.*)" );
String digit = string.replaceFirst( p.toString(), "$1" ); // To get the digit
Integer oneMore = Integer.parseInt( digit ) + 1; // Evaluate ++digit
string.replaceFirst( p.toString(), oneMore.toString() ); //
The regex doesn't match here... but it did in Perl.
What am I doing wrong here?
Actually it matches. You can find out by printing
System.out.println(p.matcher(string).find());
The issue is with line
String digit = string.replaceFirst( p.toString(), "$1" );
which is actually a do-nothing, because it replaces the first group (which is all you match, the lookahead is not part of the match) with the content of the first group.
You can get the desired result (namely the digit) via the following code
Matcher m = p.matcher(string);
String digit = m.find() ? m.group(1) : "";
Note: you should check m.find() anyways if nothing matches. In this case you may not call parseInt and you'll get an error. Thus the full code looks something like
Pattern p = Pattern.compile("^(\\d+)(?=_.*)");
String string = "0_Beginning";
Matcher m = p.matcher(string);
if (m.find()) {
String digit = m.group(1);
Integer oneMore = Integer.parseInt(digit) + 1;
string = m.replaceAll(oneMore.toString());
System.out.println(string);
} else {
System.out.println("No match");
}
Let's see what you are doing here.
String string = "0_Beginning";
Pattern p = Pattern.compile( "^(\\d+)(?=_.*)" );
You declare and initialize String and pattern objects.
String digit = string.replaceFirst( p.toString(), "$1" ); // To get the digit
(You are converting the pattern back into a string, and replaceFirst creates a new Pattern from this. Is this intentional?)
As Howard says, this replaces the first match of the pattern in the string with the contents of the first group, and the match of the pattern is just 0 here, as the first group. Thus digit is equal to string, ...
Integer oneMore = Integer.parseInt( digit ) + 1; // Evaluate ++digit
... and your parsing fails here.
string.replaceFirst( p.toString(), oneMore.toString() ); //
This would work (but convert the pattern again to string and back to pattern).
Here how I would do this:
String string = "0_Beginning";
Pattern p = Pattern.compile( "^(\\d+)(?=_.*)" );
Matcher matcher = p.matcher(string);
StringBuffer result = new StringBuffer();
while(matcher.find()) {
int number = Integer.parseInt(matcher.group());
m.appendReplacement(result, String.valueOf(number + 1));
}
m.appendTail(result);
return result.toString(); // 1_Beginning
(Of course, for your regex the loop will only execute once, since the regex is anchored.)
Edit: To clarify my statement about string.replaceFirst:
This method does not return a pattern, but uses one internally. From the documentation:
Replaces the first substring of this string that matches the given regular expression with the given replacement.
An invocation of this method of the form str.replaceFirst(regex, repl) yields exactly the same result as the expression
Pattern.compile(regex).matcher(str).replaceFirst(repl)
Here we see that a new pattern is compiled from the first argument.
This also shows us another way to do what you did want to do:
String string = "0_Beginning";
Pattern p = Pattern.compile( "^(\\d+)(?=_.*)" );
Matcher m = p.matcher(string);
if(m.find()) {
digit = m.group();
int oneMore = Integer.parseInt( digit ) + 1
return m.replaceFirst(string, String.valueOf(oneMore));
}
This only compiles the pattern once, instead of thrice like in your original program - but still does the matching twice (once for find, once for replaceFirst), instead of once like in my program.

Categories