Java using Matcher to fail when the immediate sequence is not matchable - java

Matcher.find finds the next subsequence, starting at a given index, which is compliant with the regex.
How can I make it so that it fails if the next character sequence is not compliant?
Ex:
String input = "123456text123";
Matcher mat1 = Pattern.compile("\\d+").matcher(input);
mat1.find();
System.out.println(mat1.group()); //123456
mat1.find(mat1.end());
System.out.println(mat1.group()); //123
I want to know if there's a way to make the second find fail, since the next sequence does not match the mat1 pattern.
I want to be able to 'compose' matchers, in such a way that they MUST always be found in sequence.
Is it possible at all?

You can check that the previous mat1.end() equals the next mat1.start().
int lastEnd = -1;
while (mat1.find()) {
// Was there any junk between last two matches?
if (mat1.start() != lastEnd+1) {
System.out.println("Fail.");
break;
}
System.out.println(mat1.group());
lastEnd = mat1.end();
}

Related

Counting comma and any text in java String

I'm trying to write a function to count specific Strings.
The Strings to count look like the following:
first any character except comma at least once -
the comma -
any chracter but at least once
example string:
test, test, test,
should count to 3
I've tried do that by doing the following:
int countSubstrings = 0;
final Pattern pattern = Pattern.compile("[^,]*,.+");
final Matcher matcher = pattern.matcher(commaString);
while (matcher.find()) {
countSubstrings++;
}
Though my solution doesn't work. It always ends up counting to one and no further.
Try this pattern instead: [^,]+
As you can see in the API, find() will give you the next subsequence that matches the pattern. So this will find your sequences of "non-commas" one after the other.
Your regex, especially the .+ part will match any char sequence of at least length 1. You want the match to be reluctant/lazy so add a ?: [^,]*,.+?
Note that .+? will still match a comma that directly follows a comma so you might want to replace .+? with [^,]+ instead (since commas can't match with this lazyness is not needed).
Besides that an easier solution might be to split the string and get the length of the array (or loop and check the elements if you don't want to allow for empty strings):
countSubstrings = commaString.split(",").length;
Edit:
Since you added an example that clarifies your expectations, you need to adjust your regex. You seem to want to count the number of strings followed by a comma so your regex can be simplified to [^,]+,. This matches any char sequence consisting of non-comma chars which is followed by a comma.
Note that this wouldn't match multiple commas or text at the end of the input, e.g. test,,test would result in a count of 1. If you have that requirement you need to adjust your regex.
So, quite good answers are already given. Very readable. Something like this should work, beware, it's not clean and probably not the fastest way to do this. But is is quite readable. :)
public int countComma(String lots_of_words) {
int count = 0;
for (int x = 0; x < lots_of_words.length(); x++) {
if (lots_of_words.charAt(x) == ',') {
count++;
}
}
return count;
}
Or even better:
public int countChar(String lots_of_words, char the_chosen_char) {
int count = 0;
for (int x = 0; x < lots_of_words.length(); x++) {
if (lots_of_words.charAt(x) == the_chosen_char) {
count++;
}
}
return count;
}

Java String- How to get a part of package name in android?

Its basically about getting string value between two characters. SO has many questions related to this. Like:
How to get a part of a string in java?
How to get a string between two characters?
Extract string between two strings in java
and more.
But I felt it quiet confusing while dealing with multiple dots in the string and getting the value between certain two dots.
I have got the package name as :
au.com.newline.myact
I need to get the value between "com." and the next "dot(.)". In this case "newline". I tried
Pattern pattern = Pattern.compile("com.(.*).");
Matcher matcher = pattern.matcher(beforeTask);
while (matcher.find()) {
int ct = matcher.group();
I tried using substrings and IndexOf also. But couldn't get the intended answer. Because the package name in android varies by different number of dots and characters, I cannot use fixed index. Please suggest any idea.
As you probably know (based on .* part in your regex) dot . is special character in regular expressions representing any character (except line separators). So to actually make dot represent only dot you need to escape it. To do so you can place \ before it, or place it inside character class [.].
Also to get only part from parenthesis (.*) you need to select it with proper group index which in your case is 1.
So try with
String beforeTask = "au.com.newline.myact";
Pattern pattern = Pattern.compile("com[.](.*)[.]");
Matcher matcher = pattern.matcher(beforeTask);
while (matcher.find()) {
String ct = matcher.group(1);//remember that regex finds Strings, not int
System.out.println(ct);
}
Output: newline
If you want to get only one element before next . then you need to change greedy behaviour of * quantifier in .* to reluctant by adding ? after it like
Pattern pattern = Pattern.compile("com[.](.*?)[.]");
// ^
Another approach is instead of .* accepting only non-dot characters. They can be represented by negated character class: [^.]*
Pattern pattern = Pattern.compile("com[.]([^.]*)[.]");
If you don't want to use regex you can simply use indexOf method to locate positions of com. and next . after it. Then you can simply substring what you want.
String beforeTask = "au.com.newline.myact.modelact";
int start = beforeTask.indexOf("com.") + 4; // +4 since we also want to skip 'com.' part
int end = beforeTask.indexOf(".", start); //find next `.` after start index
String resutl = beforeTask.substring(start, end);
System.out.println(resutl);
You can use reflections to get the name of any class. For example:
If I have a class Runner in com.some.package and I can run
Runner.class.toString() // string is "com.some.package.Runner"
to get the full name of the class which happens to have a package name inside.
TO get something after 'com' you can use Runner.class.toString().split(".") and then iterate over the returned array with boolean flag
All you have to do is split the strings by "." and then iterate through them until you find one that equals "com". The next string in the array will be what you want.
So your code would look something like:
String[] parts = packageName.split("\\.");
int i = 0;
for(String part : parts) {
if(part.equals("com")
break;
}
++i;
}
String result = parts[i+1];
private String getStringAfterComDot(String packageName) {
String strArr[] = packageName.split("\\.");
for(int i=0; i<strArr.length; i++){
if(strArr[i].equals("com"))
return strArr[i+1];
}
return "";
}
I have done heaps of projects before dealing with websites scraping and I
just have to create my own function/utils to get the job done. Regex might
be an overkill sometimes if you just want to extract a substring from
a given string like the one you have. Below is the function I normally
use to do this kind of task.
private String GetValueFromText(String sText, String sBefore, String sAfter)
{
String sRetValue = "";
int nPos = sText.indexOf(sBefore);
if ( nPos > -1 )
{
int nLast = sText.indexOf(sAfter,nPos+sBefore.length()+1);
if ( nLast > -1)
{
sRetValue = sText.substring(nPos+sBefore.length(),nLast);
}
}
return sRetValue;
}
To use it just do the following:
String sValue = GetValueFromText("au.com.newline.myact", ".com.", ".");

Regex to match only letters and numbers

Can you help with this code?
It seems easy, but always fails.
#Test
public void normalizeString(){
StringBuilder ret = new StringBuilder();
//Matcher matches = Pattern.compile( "([A-Z0-9])" ).matcher("P-12345678-P");
Matcher matches = Pattern.compile( "([\\w])" ).matcher("P-12345678-P");
for (int i = 1; i < matches.groupCount(); i++)
ret.append(matches.group(i));
assertEquals("P12345678P", ret.toString());
}
Constructing a Matcher does not automatically perform any matching. That's in part because Matcher supports two distinct matching behaviors, differing in whether the match is implicitly anchored to the beginning of the Matcher's region. It appears that you could achieve your desired result like so:
#Test
public void normalizeString(){
StringBuilder ret = new StringBuilder();
Matcher matches = Pattern.compile( "[A-Z0-9]+" ).matcher("P-12345678-P");
while (matches.find()) {
ret.append(matches.group());
}
assertEquals("P12345678P", ret.toString());
}
Note in particular the invocation of Matcher.find(), which was a key omission from your version. Also, the nullary Matcher.group() returns the substring matched by the last find().
Furthermore, although your use of Matcher.groupCount() isn't exactly wrong, it does lead me suspect that you have the wrong idea about what it does. In particular, in your code it will always return 1 -- it inquires about the pattern, not about matches to it.
First of all you don't need to add any group because entire match can be always accessed by group 0, so instead of
(regex) and group(1)
you can use
regex and group(0)
Next thing is that \\w is already character class so you don't need to surround it with another [ ], because it will be similar to [[a-z]] which is same as [a-z].
Now in your
for (int i = 1; i < matches.groupCount(); i++)
ret.append(matches.group(i));
you will iterate over all groups from 1 but you will exclude last group, because they are indexed from 1 so n so i<n will not include n. You would need to use i <= matches.groupCount() instead.
Also it looks like you are confusing something. This loop will not find all matches of regex in input. Such loop is used to iterate over groups in used regex after match for regex was found.
So if regex would be something like (\w(\w))c and your match would be like abc then
for (int i = 1; i < matches.groupCount(); i++)
System.out.println(matches.group(i));
would print
ab
b
because
first group contains two characters (\w(\w)) before c
second group is the one inside first one, right after first character.
But to print them you actually would need to first let regex engine iterate over your input and find() match, or check if entire input matches() regex, otherwise you would get IllegalStateException because regex engine can't know from which match you want to get your groups (there can be many matches of regex in input).
So what you may want to use is something like
StringBuilder ret = new StringBuilder();
Matcher matches = Pattern.compile( "[A-Z0-9]" ).matcher("P-12345678-P");
while (matches.find()){//find next match
ret.append(matches.group(0));
}
assertEquals("P12345678P", ret.toString());
Other way around (and probably simpler solution) would be actually removing all characters you don't want from your input. So you could just use replaceAll and negated character class [^...] like
String input = "P-12345678-P";
String result = input.replaceAll("[^A-Z0-9]+", "");
which will produce new string in which all characters which are not A-Z0-9 will be removed (replaced with "").

Java split a CSV ignoring HTML characteres

I need to split a string by semicolon ignoring the semicolons that may come as HTML characters.
For instance, given the string:
id=com.google.android;keywords=Android;Operating System;Phone;versions=Gingerbread;ICS;JB
I need to split it into:
id = com.google.android
keywords=Android;Operating System;Phone
versions=Gingerbread;ICS;JB
any ideia how to do this?
A regex like (?<!&#?[0-9a-zA-Z]+); would probably do it. This would prevent matching a semicolon that terminates an entity reference or character reference, though it also catches a few cases that are not technically either by the specs (e.g. it wouldn't match the semicolon at the end of &#foo; or &123;).
(?<!...) is a "negative lookbehind", so you can read this regex as matching a semicolon that is not preceded by a substring that matches &#?[0-9a-zA-Z]+ (i.e. ampersand, optional hash, and one or more alphanumerics). However lookbehinds must have an upper bound on the number of characters they can match, which + doesn't, so you'll have to use a bounded repetition count, like {1,5} rather than the unbounded +. The upper bound needs to be at least as long as the longest entity reference you might see, and if your data might contain arbitrary entity references then you'll have to use something like the length of the string as the upper bound.
String[] keyValuePairs = theString.split(
"(?<!&#?[0-9a-zA-Z]{1," + theString.length() + "});");
If you can specify a smaller bound then that would probably be more efficient.
Edit: Android apparently doesn't like this lookbehind, even with bounded repetition, so you probably won't be able to use a single regex with String.split to do what you're after, you'll have to do the looping yourself, e.g.
Pattern p = Pattern.compile("(?:&#?[0-9a-zA-Z]+)?;");
Matcher m = p.matcher(theString);
List<String> splits = new ArrayList<String>();
int lastEltStart = 0;
while(m.find()) {
if(m.end() - m.start() > 1) {
// this match was an entity/character reference so don't split here
continue;
}
if(m.start() > lastEltStart) {
// non-empty part
splits.add(theString.substring(lastEltStart, m.start()));
}
lastEltStart = m.end();
}
if(lastEltStart < theString.length()) {
// non-empty final part
splits.add(theString.substring(lastEltStart));
}
Since the HTML entites have only two or three numbers between the '&#' and ';' I used the following regex:
(?<!&#\d{2,3});

Length of specific substring

I check if my string begins with number using
if(RegEx(IsMatch(myString, #"\d+"))) ...
If this condition holds I want to get the length of this "numeric" substring that my string begins with.
I can find the length checking if every next character is a digit beginning from the first one and increasing some counter. Is there any better way to do this?
Well instead of using IsMatch, you should find the match:
// Presumably you'll be using the same regular expression every time, so
// we might as well just create it once...
private static readonly Regex Digits = new Regex(#"\d+");
...
Match match = Digits.Match(text);
if (match.Success)
{
string value = match.Value;
// Take the length or whatever
}
Note that this doesn't check that the digits occur at the start of the string. You could do that using #"^\d+" which will anchor the match to the beginning. Or you could check that match.Index was 0 if you wanted...
To check if my string begins with number, you need to use pattern ^\d+.
string pattern = #"^\d+";
MatchCollection mc = Regex.Matches(myString, pattern);
if(mc.Count > 0)
{
Console.WriteLine(mc[0].Value.Length);
}
Your regex checks if your string contains a sequence of one or more numbers. If you want to check that it starts with it you need to anchor it at the beginning:
Match m = Regex.Match(myString, #"^\d+");
if (m.Success)
{
int length = m.Length;
}
As an alternative to a regular expression, you can use extension methods:
int cnt = myString.TakeWhile(Char.IsDigit).Count();
If there are no digits in the beginning of the string you will naturally get a zero count. Otherwise you have the number of digits.
Instead of just checking IsMatch, get the match so you can get info about it, like the length:
var match = Regex.Match(myString, #"^\d+");
if (match.Success)
{
int count = match.Length;
}
Also, I added a ^ to the beginning of your pattern to limit it to the beginning of the string.
If you break out your code a bit more, you can take advantage of Regex.Match:
var length = 0;
var myString = "123432nonNumeric";
var match = Regex.Match(myString, #"\d+");
if(match.Success)
{
length = match.Value.Length;
}

Categories