why pattern/matcher find one match twice - java

I want to find < a > tags in a StringBuilder (result) and insert a word (INSERTED-WORD/) before their href attribute.
code:
Pattern pattern = Pattern.compile("<a [a-zA-Z0-9=\":.;\\s&%_#/\\\\()\\-']*href=['\"]");
Matcher matcher = pattern.matcher(result);
while (matcher.find()) {
int index2 = result.indexOf(matcher.group(0))+ matcher.group(0).length();
result.insert(index2, "INSERTED-WORD/");
}
But some of tags are found twice (or more) and INSERTED-WORD/ is inserted before their href attribute twice or more.
for example,I want to find this tag :
< a class="link" href="www.example.com">link< /a>
and then change it to
< a class="link" href="INSERTED-WORD/www.example.com">link< /a>
.
but this code changes it to
< a class="link"
href="INSERTED-WORD/INSERTED-WORD/INSERTED-WORD/www.example.com">link<
/a>
How can I fix it?

The behavior you see is caused by the use of indexOf. When something is found more than once indexOf will search for the same matched string and always return the index of the first match.
This is not the only problem with your code. You also modify result while it is used by matcher, java's Matcher was not designed to deal with that and will not work correctly. An obvious problem is that it will think result is shorter than it actually is and there might be other problems.
The following will fix your code:
Pattern pattern = Pattern.compile("<a [a-zA-Z0-9=\":.;\\s&%_#/\\\\()\\-']*href=['\"]");
Matcher matcher = pattern.matcher(result.toString()); // Create new String instead of using result
int found = 0;
while (matcher.find()) {
int index2 = matcher.end();
result.insert(index2 + found++ * "INSERTED-WORD/".length(), "INSERTED-WORD/");
}
I will leave it to you to figure out why found is required, run the code without it and see what happens.
Notes
This is not a good way to solve your problem anubhava offered a much simpler solution in his comment: result = new StringBuilder(result.toString().replaceAll("<a [^>]*?href=\"(?!INSERTED-WORD/)", "$0INSERTED-WORD/"));
The recommended way to parse html is with an html parser https://jsoup.org/ is a good one.

Related

Pattern Matching inside brackets with % sybol

I am a newbie to Java and have been trying to pattern match some data inside a TD tag and brackets with a percentage symbol, but for the life of me cannot get it to work.
I am sure it is very simple and I Just want to extract the numbers before the % symbol in here :
<td>0 items (0%)</td>
I have tried quite a number of suggestions but none seem to work.
linecache = readercache.readLine();
System.out.println(linecache);
Pattern patterncf1 = Pattern.compile("\\((.*?)\\)");
tried
Pattern patterncf1 = Pattern.compile("<td>\\d+ \\w+ \\((\\d+)?%\\)</td>");
tried
Pattern patterncf1 = Pattern.compile("<td>\\((\\d+)?%\\)</td>");
tried
Pattern patterncf1 = Pattern.compile("\\((\\d+)?%\\)");
but am always getting
<td>0 items (0%)</td>
Exception in thread "Thread-0" java.lang.IllegalStateException: No match found
I also tried the suggestion below but still erroring out and I would assume that this is the right group in this case.
linecache = readercache.readLine();
System.out.println(linecache);
String pattern = "\\d+(?=%)";
Pattern patterncf1 = Pattern.compile(pattern)
Matcher matchercf1 = patterncf1.matcher(linecache);
String passedvalue = matchercf1.group(1);
System.out.println(passedvalue);
This part in a different section of code works fine.
Pattern patternmb1 = Pattern.compile("<td>(.+?) GB</td>");
Matcher matchermb1 = patternmb1.matcher(line);
if (matchermb1.find()) {
String passedvalue = matchermb1.group(1);
String[] tmpStr = passedvalue.split("\\.") ;
String withoutDecStr = tmpStr[0];
Float passedvalue2 = Float.valueOf(withoutDecStr);
System.out.println("MIU: " + passedvalue2);
JVMinusearray.add(passedvalue2);
I would appreciate if someone could offer some advice please.
Thanks
You can use the following:
Pattern pattern = Pattern.compile("<td>.*\\((\\d+)%\\)</td>");
Matcher matcher = pattern.matcher("<td>0 items (2000%)</td>");
if(matcher.matches()) {
System.out.println(matcher.group(1));
}
You will get the number appended to %.
if you want to extract numbers before %, the following will match
(\\d+(?=%))
demo
Edit:
from your comment, i understood that the problem is in identifying the correct group to pick. in this regex, what you want in goup 1, you have to use group1 to make it work.
linecache = readercache.readLine();
System.out.println(linecache);
String pattern = "(\\d+(?=%))"; // just include ()
Pattern patterncf1 = Pattern.compile(pattern)
Matcher matchercf1 = patterncf1.matcher(linecache);
String passedvalue = matchercf1.group(1);
System.out.println(passedvalue);
Thanks for your help. It seems to work with a static string of text but not from the reading in of the data from the html file, so I will take this offline and see what's going on, but both suggestions have worked fine.
Thank you for your time. I appreciate it.
Regards,
Paul

Look for a certain String inside another and count how many times it appears

I am trying to search for a String inside a file content which I got into a String.
I've tried to use Pattern and Matcher, which worked for this case:
Pattern p = Pattern.compile("(</machine>)");
Matcher m = p.matcher(text);
while(m.find()) //if the text "(</machine>)" was found, enter
{
Counter++;
}
return Counter;
Then, I tried to use the same code to find how many tags I have:
Pattern tagsP = Pattern.compile("(</");
Matcher tagsM = tagsP.matcher(text);
while(tagsM.find()) //if the text "(</" was found, enter
{
CounterTags++;
}
return CounterTags;
which in this case, the return value was always 0.
Try using the below code , btw not using Pattern:-
String actualString = "hello hi how(</machine>) are you doing. Again hi (</machine>) friend (</machine>) hope you are (</machine>)doing good.";
//actualString which you get from file content
String toMatch = Pattern.quote("(</machine>)");// for coverting to regex literal
int count = actualString .split(toMatch, -1).length - 1; // split the actualString to array based on toMatch , so final match count should be -1 than array length.
System.out.println(count);
Output :- 4
You can use Apache commons-lang util library, there is a function countMatches exactly for you:
int count = StringUtils.countMatches(text, "substring");
Also this function is null-safe.
I recommend you to explore Apache commons libraries, they provide a lot of useful common util methods.

Discard the leading and trailing series of a character, but retain the same character otherwise

I have to process a string with the following rules:
It may or may not start with a series of '.
It may or may not end with a series of '.
Whatever is enclosed between the above should be extracted. However, the enclosed string also may or may not contain a series of '.
For example, I can get following strings as input:
''''aa''''
''''aa
aa''''
''''aa''bb''cc''''
For the above examples, I would like to extract the following from them (respectively):
aa
aa
aa
aa''bb''cc
I tried the following code in Java:
Pattern p = Pattern.compile("[^']+(.+'*.+)[^']*");
Matcher m = p.matcher("''''aa''bb''cc''''");
while (m.find()) {
int count = m.groupCount();
System.out.println("count = " + count);
for (int i = 0; i <= count; i++) {
System.out.println("-> " + m.group(i));
}
But I get the following output:
count = 1
-> aa''bb''cc''''
-> ''bb''cc''''
Any pointers?
EDIT: Never mind, I was using a * at the end of my regex, instead of +. Doing this change gives me the desired output. But I would still welcome any improvements for the regex.
This one works for me.
String str = "''''aa''bb''cc''''";
Pattern p = Pattern.compile("^'*(.*?)'*$");
Matcher m = p.matcher(str);
if (m.find()) {
System.out.println(m.group(1));
}
have a look at the boundary matcher of Java's Pattern class (http://docs.oracle.com/javase/6/docs/api/java/util/regex/Pattern.html). Especially $ (=end of a line) might be interesting. I also recommend the following eclipse plugin for regex testing: http://sourceforge.net/projects/quickrex/ it gives you the possibilty to exactly see what will be the match and the group of your regex for a given test string.
E.g. try the following pattern: [^']+(.+'*.+)+[^'$]
I'm not that good in Java, so I hope the regex is sufficient. For your examples, it works well.
s/^'*(.+?)'*$/$1/gm

Java Find word in a String

I need to find a word in a HTML source code. Also I need to count occurrence. I am trying to use regular expression. But it says 0 match found.
I am using regular expression as I thought its the best way. In case of any better way, please let me know.
I need to find the occurrence of the word "hsw.ads" in HTML source code.
I have taken following steps.
int count = 0;
{
Pattern p = Pattern.compile(".*(hsw.ads).*");
Matcher m = p.matcher(SourceCode);
while(m.find())count++;
}
But the count is 0;
Please let me know your solutions.
Thank you.
Help Seeker
You are not matching any "expression", so probably a simple string search would be better. commons-lang has StringUtils.countMatches(source, "yourword").
If you don't want to include commons-lang, you can write that manually. Simply use source.indexOf("yourword", x) multiple times, each time supplying a greater value of x (which is the offset), until it gets -1
You should try this.
private int getWordCount(String word,String source){
int count = 0;
{
Pattern p = Pattern.compile(word);
Matcher m = p.matcher(source);
while(m.find()) count++;
}
return count;
}
Pass the word (Not pattern) you want to search in a string.
To find a string in Java you can use String methods indexOf which tells you the index of the first character of the string you searched for. To find all of them and count them you can do this (there might be a faster way but this should work). I would recommend using StringUtils CountMatches method.
String temp = string; //Copy to save the string
int count = 0;
String a = "hsw.ads";
int i = 0;
while(temp.indexOf(a, i) != -1) {
count++;
i = temp.indexof(a, i) + a.length() + 1;
}
StringUtils.countMatches(SourceCode, "hsw.ads") ought to work, however sticking with the approach you have above (which is valid), I'd recommend a few things:
1. As John Haager mentioned, remove the opening/closing .* will help, becuase you're looking for that exact substring
2. You want to escape the '.' because you're searching for a literal '.' and not a wildcard
3. I would make this Pattern a constant and re-use it rather than re-creating it each time.
That said, I'd still suggest using the approaches above, but I thought I'd just point out your current approach isn't conceptually flawed; just a few implementation details missing.
Your code and regular expression is valid. You don't need to include the .* at the beginning and the end of your regex. For example:
String t = "hsw.ads hsw.ads hsw.ads";
int count = 0;
Matcher m = Pattern.compile("hsw\\.ads").matcher(t);
while (m.find()){ count++; }
In this case, count is 3. And another thing, if you're going to use a regex, if you REALLY want to specifically look for a '.' period between hsw and ads, you need to escape it.

Java regex matches only if matching is checked

Inside of a class I have a pattern private Pattern lossWer = Pattern.compile("^\\d+ \\d+ (\\d+).*"). One of the functions looks like this:
public double[] getWer(){
double[] wer = new double[someStrings.size()];
Matcher m;
for(int i = 0; i < wer.length; i++){
m = lossWer.matcher(someStrings.get(i));
wer[i] = Double.parseDouble(m.group(1));
}
return wer;
}
Calling this fails with java.lang.IllegalStateException: No match found. When I change it to this, though, it works:
public double[] getWer(){
double[] wer = new double[someStrings.size()];
Matcher m;
for(int i = 0; i < wer.length; i++){
m = lossWer.matcher(someStrings.get(i));
if(!m.matches())
;
wer[i] = Double.parseDouble(m.group(1));
}
return wer;
}
Of course my application doesn't just use a blank semi-colon for that line, but I'm illustrating that the line here does nothing but allows the program to proceed without error. Why are lines matched without errors in the second example but not in the first?
You can't use group() without first calling either find() or matches(). That's how regexes work. First you create a pattern, then a matcher, then you either find() instances of the regex or check if it matches().
Check this for the IllegalStateException
The explicit state of a matcher is initially undefined; attempting to
query any part of it before a successful match will cause an
IllegalStateException to be thrown. The explicit state of a matcher is
recomputed by every match operation.
This combined with Ryan's answer should give you what you need.
Until you call m.matches() you haven't tested the regex so there are no groups.
You say in that line, test against the regex. If there are no matches do nothing, then you continue on to check the group(1) of the match (and since there was a match with a group it works).
It would be best to change:
if(!m.matches())
;
wer[i] = Double.parseDouble(m.group(1));
To:
if(m.matches())
wer[i] = Double.parseDouble(m.group(1));
Or use !m.matches() to return an error or something. Your choice :)

Categories