How to get indexOf multiple delimiters? - java

I am looking for an elegant way to find the first appearance of one of a set of delimiters.
For example, let's assume my delimiter set is composed of {";",")","/"}.
If my String is
"aaa/bbb;ccc)"
I would like to get the result 3 (the index of the "/", since it is the first to appear).
If my String is
"aa;bbbb/"
I would like to get the result 2 (the index of the ";", since it is the first to appear).
and so on.
If the String does not contain any delimiter, I would like to return -1.
I know I can do it by first finding the index of each delimiter, then calculating the minimum of the indices, disregarding the -1's. This code becomes very cumbersome. I am looking for a shorter and more generic way.

Through regex , it woud be done like this,
String s = "aa;bbbb/";
Matcher m = Pattern.compile("[;/)]").matcher(s); // [;/)] would match a forward slash or semicolon or closing bracket.
if(m.find()) // if there is a match found, note that it would find only the first match because we used `if` condition not `while` loop.
{
System.out.println(m.start()); // print the index where the match starts.
}
else
{
System.out.println("-1"); // else print -1
}

Search in list of delimiter each character from the input string. If found then print the index.
You can also use Set to store delimiters

Below program will gives the result. this is done using RegEx.
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class FindIndexUsingRegex {
/**
* #param args
*/
public static void main(String[] args) {
// TODO Auto-generated method stub
findMatches("aaa/bbb;ccc\\)",";|,|\\)|/");
}
public static void findMatches(String source, String regex) {
Pattern pattern = Pattern.compile(regex);
Matcher matcher = pattern.matcher(source);
while (matcher.find()) {
System.out.print("First index: " + matcher.start()+"\n");
System.out.print("Last index: " + matcher.end()+"\n");
System.out.println("Delimiter: " + matcher.group()+"\n");
break;
}
}
}
Output:
First index: 3
Last index: 4
Delimiter: /

Related

Find the longest section of repeating characters

Imagine a string like this:
#*****~~~~~~**************~~~~~~~~***************************#
I am looking for an elegant way to find the indices of the longest continues section that contains a specific character. Let's assume we are searching for the * character, then I expect the method to return the start and end index of the last long section of *.
I am looking for the elegant way, I know I could just bruteforce this by checking something like
indexOf(*)
lastIndexOf(*)
//Check if in between the indices is something else if so, remember length start from new
//substring and repeat until lastIndex reached
//Return saved indices
This is so ugly brute-force - Any more elegant way of doing this? I thought about regular expression groups and comparing their length. But how to get the indices with that?
Regex-based solution
If you don't want to hardcode a specific character like * and find "Find the longest section of repeating characters" as the title of the question states, then the proper regular expression for the section of repeated characters would be:
"(.)\\1*"
Where (.) a group that consists from of a single character, and \\1 is a backreference that refers to that group. * is greedy quantifier, which means that presiding backreference could be repeated zero or more times.
Finally, "(.)\\1*" captures a sequence of subsequent identical characters.
Now to use it, we need to compile the regex into Pattern. This action has a cost, hence if the regex would be used multiple times it would be wise to declare a constant:
public static final Pattern REPEATED_CHARACTER_SECTION =
Pattern.compile("(.)\\1*");
Using features of modern Java, the longest sequence that matches the above pattern could be found literally with a single line of code.
Since Java 9 we have method Matcher.results() which return a stream of MatchResult objects, describe a matching group.
MatchResult.start() MatchResult.end() expose the way of accessing start and end indices of the group. To extract the group itself, we need to invoke MatchResult.group().
That how an implementation might look like:
public static void printLongestRepeatedSection(String str) {
String longestSection = REPEATED_CHARACTER_SECTION.matcher(str).results() // Stream<MatchResult>
.map(MatchResult::group) // Stream<String>
.max(Comparator.comparingInt(String::length)) // find the longest string in the stream
.orElse(""); // or orElseThrow() if you don't want to allow an empty string to be received as an input
System.out.println("Longest section:\t" + longestSection);
}
main()
public static void printLongestRepeatedSection(String str) {
MatchResult longestSection = REPEATED_CHARACTER_SECTION.matcher(str).results() // Stream<MatchResult>
.max(Comparator.comparingInt(m -> m.group().length())) // find the longest string in the stream
.orElseThrow(); // would throw an exception is an empty string was received as an input
System.out.println("Section start: " + longestSection.start());
System.out.println("Section end: " + longestSection.end());
System.out.println("Longest section: " + longestSection.group());
}
Output:
Section start: 34
Section end: 61
Longest section: ***************************
Links:
Official tutorials on Lambda expressions and Stream API provided by Oracle
A quick tutorial on Regular expressions
Simple and Performant Iterative solution
You can do it without regular expressions by manually iterating over the indices of the given string and checking if the previous character matches the current one.
You just need to maintain a couple of variables denoting the start and the end of the longest previously encountered section, and a variable to store the starting index of the section that is being currently examined.
That's how it might be implemented:
public static void printLongestRepeatedSection(String str) {
if (str.isEmpty()) throw new IllegalArgumentException();
int maxStart = 0;
int maxEnd = 1;
int curStart = 0;
for (int i = 1; i < str.length(); i++) {
if (str.charAt(i) != str.charAt(i - 1)) { // current and previous characters are not equal
if (maxEnd - maxStart < i - curStart) { // current repeated section is longer then the maximum section discovered previously
maxStart = curStart;
maxEnd = i;
}
curStart = i;
}
}
if (str.length() - curStart > maxEnd - maxStart) { // checking the very last section
maxStart = curStart;
maxEnd = str.length();
}
System.out.println("Section start: " + maxStart);
System.out.println("Section end: " + maxEnd);
System.out.println("Section: " + str.substring(maxStart, maxEnd));
}
main()
public static void main(String[] args) {
String source = "#*****~~~~~~**************~~~~~~~~***************************#";
printLongestRepeatedSection(source);
}
Output:
Section start: 34
Section end: 61
Section: ***************************
Use methods of class Matcher.
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class Solution {
public static void main(String args[]) {
String str = "#*****~~~~~~**************~~~~~~~~***************************#";
Pattern pattern = Pattern.compile("\\*+");
Matcher matcher = pattern.matcher(str);
int max = 0;
while (matcher.find()) {
int length = matcher.end() - matcher.start();
if (length > max) {
max = length;
}
}
System.out.println(max);
}
}
The regular expression searches for occurrences of one or more asterisk (*) characters.
Method end returns the index of the first character after the last character matched and method start returns the index of the first character matched. Hence the length is simply the value returned by method end minus the value returned by method start.
Each subsequent call to method find starts searching from the end of the previous match.
The only thing left is to get the longest string of asterisks.
The solution based on the regular expression may use some features from Stream API to get an array of indexes of the longest sequence of a given character:
Pattern.quote should be used to safely wrap the input character for the search within a regular expression
Stream<MatchResult> returned by Matcher::results provides necessary information about the start and end of the match
Stream::max allows to select the longest matching sequence
Optional::map and Optional::orElseGet help convert the match into the desired array of indexes
public static int[] getIndexes(char c, String str) {
return Pattern.compile(Pattern.quote(Character.toString(c)) + "+")
.matcher(str)
.results()
.max(Comparator.comparing(mr -> mr.end() - mr.start()))
.map(mr -> new int[]{mr.start(), mr.end()})
.orElseGet(() -> new int[]{-1, -1});
}
// Test:
System.out.println(Arrays.toString(getIndexes('*', "#*****~~~~~~**************~~~~~~~~***************************#")));
// -> [34, 61]

Split String using multiple delimiters in one step

My question is on splitting a string initially based on one criteria and then splitting the remaining part of the string with another criteria. I want to split the email address below into 3 parts in Java:
String email = "blah.blah_blah#mail.com";
// After splitting i want 3 separate strings (can be array or accessed via an Iterable)
string1.equals("blah.blah_blah");
string2.equals("mail");
string3.equals("com");
I know I can first split it into two based on # and then later split the second string based on ., but is there anyway of doing this in one step? I don't mind either the String#split method or regex method using Pattern and Matcher.
Use this regex in your split:
#|[.](?!.*[#.])
It will split at an # or at the very last . after the # (the one before "com"). Regex101 Tested
Use it like this:
String[] emailParts = email.split("#|[.](?!.*[#.])");
Then emailParts will be an array of the 3 strings that you want, in order.
As a bonus, if you want it to split at every dot after the # (including the ones between subdomains), then remove the . from the character class at the end of the regex. It will become #|[.](?!.*#)
You can use this regex:
([^#]*)#([^#]*)\.([^#\.]*)
Here is the demo
Here is the example Java code:
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class JavaRegex
{
public static void main(String args[])
{
// String to be scanned to find the pattern.
String line = "blah.blah_blah#mail.mail2.com";
String pattern = "([^#]*)#([^#]*)\\.([^#\\.]*)";
// Create a Pattern object
Pattern r = Pattern.compile(pattern);
// Now create matcher object.
Matcher m = r.matcher(line);
if (m.find())
{
System.out.println("Found value: " + m.group(1));
System.out.println("Found value: " + m.group(2));
System.out.println("Found value: " + m.group(3));
} else
{
System.out.println("NO MATCH");
}
}
}
Thanks for Pshemo for pointing out that look-aheads were unnecessary.
You seem to want to split on
- #
or
- any dot that is after # (in other words has # somewhere before it).
If that is the case you can use email.split("#|(?<=#.{0,1000})[.]"); which will return String[] array containing separated tokens.
I used .{0,1000} instead of .* because look-behind needs to have obvious max length in Java which excludes * quantifier. But assuming that # and . will not be separated by more than 1000 characters we can use {0,1000} instead.
String str = "blah.blah_blah#mail.com";
String[] tempMailSplitted;
String[] tempHostSplitted;
String delimiter = "#";
tempMailSplitted = str.split(delimiter);
System.out.println(temp[1]); //mail.com
String hostMailDelimiter = "."
tempHostSplitted = temp[1].split(hostMailDelimiter);
You can also do it in a regex if you want that ask me. :)

Multiple matches with delimiter

this is my regex:
([+-]*)(\\d+)\\s*([a-zA-Z]+)
group no.1 = sign
group no.2 = multiplier
group no.3 = time unit
The thing is, I would like to match given input but it can be "chained". So my input should be valid if and only if the whole pattern is repeating without anything between those occurrences (except of whitespaces). (Only one match or multiple matches next to each other with possible whitespaces between them).
valid examples:
1day
+1day
-1 day
+1day-1month
+1day +1month
+1day +1month
invalid examples:
###+1day+1month
+1day###+1month
+1day+1month###
###+1day+1month###
###+1day+1month###
I my case I can use matcher.find() method, this would do the trick but it will accept input like this: +1day###+1month which is not valid for me.
Any ideas? This can be solved with multiple IF conditions and multiple checks for start and end indexes but I'm searching for elegant solution.
EDIT
The suggested regex in comments below ^\s*(([+-]*)(\d+)\s*([a-zA-Z]+)\s*)+$ will partially do the trick but if I use it in the code below it returns different result than the result I'm looking for.
The problem is that I cannot use (*my regex*)+ because it will match the whole thing.
The solution could be to match the whole input with ^\s*(([+-]*)(\d+)\s*([a-zA-Z]+)\s*)+$and then use ([+-]*)(\\d+)\\s*([a-zA-Z]+)with matcher.find() and matcher.group(i) to extract each match and his groups. But I was looking for more elegant solution.
This should work for you:
^\s*(([+-]*)(\d+)\s*([a-zA-Z]+)\s*)+$
First, by adding the beginning and ending anchors (^ and $), the pattern will not allow invalid characters to occur anywhere before or after the match.
Next, I included optional whitespace before and after the repeated pattern (\s*).
Finally, the entire pattern is enclosed in a repeater so that it can occur multiple times in a row ((...)+).
On a side, note, I'd also recommend changing [+-]* to [+-]? so that it can only occur once.
Online Demo
You could use ^$ for that, to match the start/end of string
^\s*(?:([+-]?)(\d+)\s*([a-z]+)\s*)+$
https://regex101.com/r/lM7dZ9/2
See the Unit Tests for your examples. Basically, you just need to allow the pattern to repeat and force that nothing besides whitespace occurs in between the matches.
Combined with line start/end matching and you're done.
You can use String.matches or Matcher.matches in Java to match the entire region.
Java Example:
public class RegTest {
public static final Pattern PATTERN = Pattern.compile(
"(\\s*([+-]?)(\\d+)\\s*([a-zA-Z]+)\\s*)+");
#Test
public void testDays() throws Exception {
assertTrue(valid("1 day"));
assertTrue(valid("-1 day"));
assertTrue(valid("+1day-1month"));
assertTrue(valid("+1day -1month"));
assertTrue(valid(" +1day +1month "));
assertFalse(valid("+1day###+1month"));
assertFalse(valid(""));
assertFalse(valid("++1day-1month"));
}
private static boolean valid(String s) {
return PATTERN.matcher(s).matches();
}
}
You can proceed like this:
String p = "\\G\\s*(?:([-+]?)(\\d+)\\s*([a-z]+)|\\z)";
Pattern RegexCompile = Pattern.compile(p, Pattern.CASE_INSENSITIVE);
String s = "+1day 1month";
ArrayList<HashMap<String, String>> results = new ArrayList<HashMap<String, String>>();
Matcher m = RegexCompile.matcher(s);
boolean validFormat = false;
while( m.find() ) {
if (m.group(1) == null) {
// if the capture group 1 (or 2 or 3) is null, it means that the second
// branch of the pattern has succeeded (the \z branch) and that the end
// of the string has been reached.
validFormat = true;
} else {
// otherwise, this is not the end of the string and the match result is
// "temporary" stored in the ArrayList 'results'
HashMap<String, String> result = new HashMap<String, String>();
result.put("sign", m.group(1));
result.put("multiplier", m.group(2));
result.put("time_unit", m.group(3));
results.add(result);
}
}
if (validFormat) {
for (HashMap item : results) {
System.out.println("sign: " + item.get("sign")
+ "\nmultiplier: " + item.get("multiplier")
+ "\ntime_unit: " + item.get("time_unit") + "\n");
}
} else {
results.clear();
System.out.println("Invalid Format");
}
The \G anchor matches the start of the string or the position after the previous match. In this pattern, it ensures that all matches are contigous. If the end of the string is reached, it's a proof that the string is valid from start to end.

Regular Expressions - Find hexadecimal numbers' matches excluding the 0 of the next hexadecimal number

Goal - I need to retrieve all hexadecimal numbers from my input String.
Example inputs and matches --
1- Input = "0x480x8600x89dfh0x89BABCE" (The "" are not included in the input).
should produce following matches:
0x48 ( as opposed to 0x480)
0x860
0x89df
0x89BABCE
I have tried this Pattern:
"0[xX][\\da-fA-F]+"
But it results in the following matches:
0x480
0x89df
0x89BABCE
2- Input = "0x0x8600x89dfh0x89BABCE" (The "" are not included in the input).
Should produce following matches:
0x860
0x89df
0x89BABCE
Is such a regex possible?
I know that I can first split my input using the String.split("0[xX]"), and then for each String I can write logic to retrieve the first valid match, if there is one.
But I want to know if I can achieve the desired result using just a Pattern and a Matcher.
Here's my current code.
package toBeDeleted;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class RegexTest {
public static void main(String[] args) {
String pattern = "0[xX][\\da-fA-F]+";
String input = "0x480x860x89dfh0x89BABCE";
Pattern p = Pattern.compile(pattern);
Matcher m = p.matcher(input);
while (m.find()) {
System.out.println("Starting at : " + m.start()
+ ", Ending at : " + m.end()
+ ", element matched : " + m.group());
}
}
}
Some people, when confronted with a problem, think "I know, I'll use
regular expressions." Now they have two problems.
-- Jamie Zawinski
If you just use .split("0x"), and tack 0x back on each (non-empty) result, you'll be done.
You can use a lookahead to check that the next character is not an "x"
String pattern = "0[xX]([1-9a-fA-F]|0(?![xX]))+";
Doesn't provide a match as 0x0 for the second example, though. However, you did state that matches should exclude the "0" preceding the next hex number, so not really sure why that would be matched.

Reusing the consumed characters in pattern matching in java?

Consider the following Pattern :-
aba
And the foll. source string :-
abababbbaba
01234567890 //Index Positions
Using Pattern and Matcher classes from java.util.regex package, finds this pattern only two times since regex does not consider already consumed characters.
What if I want to reuse a part of already consumed characters. That is, I want 3 matches here, one at position 0, one at 2 (which is ignored previously), and one at 8.
How do I do it??
I think you can use the indexOf() for something like that.
String str = "abababbbaba";
String substr = "aba";
int location = 0;
while ((location = str.indexOf(substr, location)) >= 0)
{
System.out.println(location);
location++;
}
Prints:
0, 2 and 8
You can use a look ahead for that. Now what you have is the first position in group(1) and the second match in group(2). Both making each String of length 3 in the sentence you are searching in.
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class Question8968432 {
public static void main(String args[]) {
final String needle = "aba";
final String sentence = "abababbbaba";
final Matcher m = Pattern.compile("(.)(?=(..))").matcher(sentence);
while (m.find()) {
final String match = m.group(1) + m.group(2);
final String hint = String.format("%s[%s]%s",
sentence.substring(0, m.start()), match,
sentence.substring(m.start() + match.length()));
if (match.equals(needle)) {
System.out.printf("Found %s starting at %d: %s\n",
match, m.start(), hint);
}
}
}
}
Output:
Found aba starting at 0: [aba]babbbaba
Found aba starting at 2: ab[aba]bbbaba
Found aba starting at 8: abababbb[aba]
You can skip the final String hint part, this is just to show you what it matches and where.
If you can change the regexp, then you can simply use something like:
a(?=ba)

Categories