Java regular expression find() matching a zero-length string - java

The Javadoc for java.util.regex.Matcher.find() says:
This method starts at the beginning of this matcher's region, or, if a
previous invocation of the method was successful and the matcher has
not since been reset, at the first character not matched by the
previous match.
My own experiments suggest that if the regex matches a zero length string, then the method starts one character beyond the end of the previous (zero-length) match. For example, given an input string "abcabc" and a regex "a|b?", successive matches will occur at positions (0, 1, 2, 3, 4, 5, 6). Following a successful match at position 6, there is an unsuccessful match at position 6; and further calls also return false with position remaining at 6.
The documentation suggests that after finding a zero-length match at position 2 (character "c"), the next call on find will start at "the first character not matched", which is still at position 2.
The documentation also suggests that after an unsuccessful match at position 6, the next call should start at position 0, which doesn't appear to be the case.
Is the documentation simply wrong?
Is there a more precise rule somewhere, for example one that covers the behaviour at position 6?
My reason for asking is that I am trying to write an emulation of the method in C#, and I'm having trouble reproducing the precise behaviour by trial and error.

This is a special case handled by Matcher.find():
public boolean find() {
int nextSearchIndex = last;
if (nextSearchIndex == first)
nextSearchIndex++;
// If next search starts before region, start it at region
if (nextSearchIndex < from)
nextSearchIndex = from;
// If next search starts beyond region then it fails
if (nextSearchIndex > to) {
for (int i = 0; i < groups.length; i++)
groups[i] = -1;
return false;
}
return search(nextSearchIndex);
}
If the last match was a zero-length match (indicated by first == last) then it starts the next search at last + 1.
If you think about it this makes sense. Without this, if the Matcher once finds a zero-length match it would never proceed over this match.
So yes, the documentation is incomplete: it neither mentions the special case for zero length matches nor that it only ever passes once over the input string.
But IMHO both special cases are implied by "Attempts to find the next subsequence of the input sequence that matches the pattern."
After a zero length match the next possible subsequence must start one position after the previous match, and the next subsequence can never start at a position lower than the current position (i.e. start over from the beginning).

You can find the OpenJDK implementation of the find method here.
public boolean find() {
int nextSearchIndex = last;
if (nextSearchIndex == first)
nextSearchIndex++;
// If next search starts before region, start it at region
if (nextSearchIndex < from)
nextSearchIndex = from;
// If next search starts beyond region then it fails
if (nextSearchIndex > to) {
for (int i = 0; i < groups.length; i++)
groups[i] = -1;
return false;
}
return search(nextSearchIndex);
}
first and last store the range of the previous successful match, and from and to store the range of the range of the string to be matched.
/**
* The range within the sequence that is to be matched. Anchors
* will match at these "hard" boundaries. Changing the region
* changes these values.
*/
int from, to;
/**
* The range of string that last matched the pattern. If the last
* match failed then first is -1; last initially holds 0 then it
* holds the index of the end of the last match (which is where the
* next search starts).
*/
int first = -1, last = 0;
The behaviours that are relevant to your question are the first and third if statements.
As you can clearly see, the first if checks if the previous match was a 0-length match, and if it is, moves nextSearchIndex one char forward.
The third if checks if nextSearchIndex goes out of range of the string. This happens after you found all 7 matches of "a|b?" in "abcabc".
From what I can see, the documentation doesn't suggest that "after an unsuccessful match at position 6, the next call should start at position 0". At best, the documentation doesn't say anything about what happens after an unsuccessful match.
It could also be argued that "a previous invocation of the method" refers to an invocation of the method at any previous time, not just the immediately prior invocation, in which case what happens after an unsuccessful match is very well-defined:
if there were no previous successful matches at all, start at the beginning
if there was a previous successful match, and it doesn't matter how many failed matches you have since that match, start at the next character not matched by that match.

Related

how to use oracle instr function in java

if instr(lv_my_name, ' ', -2) != length(lv_my_name) - 2 or length(lv_my_name) = 2 then
lv_my_name:= lv_my_name;
else
lv_my_name:= trim(substr(lv_my_name, 0, length(lv_my_name)-1));
end if;
if I were to write the same logic in Java, how do I do it?
INSTR(string, substring [, position [, occurrence] ] )
The INSTR functions search string for substring. [...] If a substring that is equal to substring is found, then the function returns an integer indicating the position of the first character of this substring. If no such substring is found, then the function returns zero.
position is an nonzero integer indicating the character of string where Oracle Database begins the search—that is, the position of the first character of the first substring to compare with substring. If position is negative, then Oracle counts backward from the end of string and then searches backward from the resulting position.
occurrence is an integer indicating which occurrence of substring in string Oracle should search for. The value of occurrence must be positive. If occurrence is greater than 1, then the database does not return on the first match but continues comparing consecutive substrings of string, as described above, until match number occurrence has been found.
Compare that to:
String.indexOf(String str, int fromIndex)
Returns the index of the first occurrence of the specified substring, starting at the specified index, or -1 if there is no such occurrence.
String.lastIndexOf(String str, int fromIndex)
Returns the index of the last occurrence of the specified substring, searching backward from the specified index, or -1 if there is no such occurrence.
So, since your code doesn't use 4th parameter, and the 3rd parameter is negative, INSTR and lastIndexOf are mostly equivalent
Remember that Java indexes are 0-based, and Oracle positions are 1-based.

Explain the Code and why Subtract the substring

public boolean frontAgain(String str)
{
int len = str.length();
if(len >= 2)
return str.substring(0, 2).equals(str.substring(len-2, len));
else
return false;
}
Can someone please explain why the second substring statement using a word example and a step by step analysis. The program checks if first two letters match the last two letters. for example the word was edited.
str.substring(len-2, len) asks for the last two letters of a string.
To get the last two letters, you need the beginning value of the substring to be the length (5) minus 2 characters, which gives you 3. This is because indexes in Java start at 0. For example, the positions for the characters in the string "horse" are 01234 (i.e. "h" is at index 0, "o" is at index 1 etc.), and the length is 5.
The second parameter of String.subString is for the ending index, which is exclusive. This means the first character position that is not part of the substring you want. In this case, it would be the length because that is 1 character higher than the end of the string.
If you put all that together, you get the following:
String str = "horse"
int length = str.length() // 5
String lastTwoChars = str.substring(length-2, length); // from position 3 to 5
System.out.println(lastTwoChars); // would show you "se"
The documentation for String.substring.

split a string in java into equal length substrings while maintaining word boundaries

How to split a string into equal parts of maximum character length while maintaining word boundaries?
Say, for example, if I want to split a string "hello world" into equal substrings of maximum 7 characters it should return me
"hello "
and
"world"
But my current implementation returns
"hello w"
and
"orld "
I am using the following code taken from Split string to equal length substrings in Java to split the input string into equal parts
public static List<String> splitEqually(String text, int size) {
// Give the list the right capacity to start with. You could use an array
// instead if you wanted.
List<String> ret = new ArrayList<String>((text.length() + size - 1) / size);
for (int start = 0; start < text.length(); start += size) {
ret.add(text.substring(start, Math.min(text.length(), start + size)));
}
return ret;
}
Will it be possible to maintain word boundaries while splitting the string into substring?
To be more specific I need the string splitting algorithm to take into account the word boundary provided by spaces and not solely rely on character length while splitting the string although that also needs to be taken into account but more like a max range of characters rather than a hardcoded length of characters.
If I understand your problem correctly then this code should do what you need (but it assumes that maxLenght is equal or greater than longest word)
String data = "Hello there, my name is not importnant right now."
+ " I am just simple sentecne used to test few things.";
int maxLenght = 10;
Pattern p = Pattern.compile("\\G\\s*(.{1,"+maxLenght+"})(?=\\s|$)", Pattern.DOTALL);
Matcher m = p.matcher(data);
while (m.find())
System.out.println(m.group(1));
Output:
Hello
there, my
name is
not
importnant
right now.
I am just
simple
sentecne
used to
test few
things.
Short (or not) explanation of "\\G\\s*(.{1,"+maxLenght+"})(?=\\s|$)" regex:
(lets just remember that in Java \ is not only special in regex, but also in String literals, so to use predefined character sets like \d we need to write it as "\\d" because we needed to escape that \ also in string literal)
\G - is anchor representing end of previously founded match, or if there is no match yet (when we just started searching) beginning of string (same as ^ does)
\s* - represents zero or more whitespaces (\s represents whitespace, * "zero-or-more" quantifier)
(.{1,"+maxLenght+"}) - lets split it in more parts (at runtime :maxLenght will hold some numeric value like 10 so regex will see it as .{1,10})
. represents any character (actually by default it may represent any character except line separators like \n or \r, but thanks to Pattern.DOTALL flag it can now represent any character - you may get rid of this method argument if you want to start splitting each sentence separately since its start will be printed in new line anyway)
{1,10} - this is quantifier which lets previously described element appear 1 to 10 times (by default will try to find maximal amout of matching repetitions),
.{1,10} - so based on what we said just now, it simply represents "1 to 10 of any characters"
( ) - parenthesis create groups, structures which allow us to hold specific parts of match (here we added parenthesis after \\s* because we will want to use only part after whitespaces)
(?=\\s|$) - is look-ahead mechanism which will make sure that text matched by .{1,10} will have after it:
space (\\s)
OR (written as |)
end of the string $ after it.
So thanks to .{1,10} we can match up to 10 characters. But with (?=\\s|$) after it we require that last character matched by .{1,10} is not part of unfinished word (there must be space or end of string after it).
Non-regex solution, just in case someone is more comfortable (?) not using regular expressions:
private String justify(String s, int limit) {
StringBuilder justifiedText = new StringBuilder();
StringBuilder justifiedLine = new StringBuilder();
String[] words = s.split(" ");
for (int i = 0; i < words.length; i++) {
justifiedLine.append(words[i]).append(" ");
if (i+1 == words.length || justifiedLine.length() + words[i+1].length() > limit) {
justifiedLine.deleteCharAt(justifiedLine.length() - 1);
justifiedText.append(justifiedLine.toString()).append(System.lineSeparator());
justifiedLine = new StringBuilder();
}
}
return justifiedText.toString();
}
Test:
String text = "Long sentence with spaces, and punctuation too. And supercalifragilisticexpialidocious words. No carriage returns, tho -- since it would seem weird to count the words in a new line as part of the previous paragraph's length.";
System.out.println(justify(text, 15));
Output:
Long sentence
with spaces,
and punctuation
too. And
supercalifragilisticexpialidocious
words. No
carriage
returns, tho --
since it would
seem weird to
count the words
in a new line
as part of the
previous
paragraph's
length.
It takes into account words that are longer than the set limit, so it doesn't skip them (unlike the regex version which just stops processing when it finds supercalifragilisticexpialidosus).
PS: The comment about all input words being expected to be shorter than the set limit, was made after I came up with this solution ;)

Length of specific substring

I check if my string begins with number using
if(RegEx(IsMatch(myString, #"\d+"))) ...
If this condition holds I want to get the length of this "numeric" substring that my string begins with.
I can find the length checking if every next character is a digit beginning from the first one and increasing some counter. Is there any better way to do this?
Well instead of using IsMatch, you should find the match:
// Presumably you'll be using the same regular expression every time, so
// we might as well just create it once...
private static readonly Regex Digits = new Regex(#"\d+");
...
Match match = Digits.Match(text);
if (match.Success)
{
string value = match.Value;
// Take the length or whatever
}
Note that this doesn't check that the digits occur at the start of the string. You could do that using #"^\d+" which will anchor the match to the beginning. Or you could check that match.Index was 0 if you wanted...
To check if my string begins with number, you need to use pattern ^\d+.
string pattern = #"^\d+";
MatchCollection mc = Regex.Matches(myString, pattern);
if(mc.Count > 0)
{
Console.WriteLine(mc[0].Value.Length);
}
Your regex checks if your string contains a sequence of one or more numbers. If you want to check that it starts with it you need to anchor it at the beginning:
Match m = Regex.Match(myString, #"^\d+");
if (m.Success)
{
int length = m.Length;
}
As an alternative to a regular expression, you can use extension methods:
int cnt = myString.TakeWhile(Char.IsDigit).Count();
If there are no digits in the beginning of the string you will naturally get a zero count. Otherwise you have the number of digits.
Instead of just checking IsMatch, get the match so you can get info about it, like the length:
var match = Regex.Match(myString, #"^\d+");
if (match.Success)
{
int count = match.Length;
}
Also, I added a ^ to the beginning of your pattern to limit it to the beginning of the string.
If you break out your code a bit more, you can take advantage of Regex.Match:
var length = 0;
var myString = "123432nonNumeric";
var match = Regex.Match(myString, #"\d+");
if(match.Success)
{
length = match.Value.Length;
}

Java using Matcher to fail when the immediate sequence is not matchable

Matcher.find finds the next subsequence, starting at a given index, which is compliant with the regex.
How can I make it so that it fails if the next character sequence is not compliant?
Ex:
String input = "123456text123";
Matcher mat1 = Pattern.compile("\\d+").matcher(input);
mat1.find();
System.out.println(mat1.group()); //123456
mat1.find(mat1.end());
System.out.println(mat1.group()); //123
I want to know if there's a way to make the second find fail, since the next sequence does not match the mat1 pattern.
I want to be able to 'compose' matchers, in such a way that they MUST always be found in sequence.
Is it possible at all?
You can check that the previous mat1.end() equals the next mat1.start().
int lastEnd = -1;
while (mat1.find()) {
// Was there any junk between last two matches?
if (mat1.start() != lastEnd+1) {
System.out.println("Fail.");
break;
}
System.out.println(mat1.group());
lastEnd = mat1.end();
}

Categories