Finding whole word only in Java string search - java

I'm running into the problem of finding a searched pattern within a larger pattern in my Java program. For example, I'll try and find all for loops, but will stumble upon formula. Most of the suggestions I've found talk about using regular expression searches like
String regex = "\\b"+keyword+"\\b";
Pattern pattern = Pattern.compile(regex);
Matcher matcher = pattern.matcher(searchString);
or some variant of this. The issue I'm running into is that I'm crawling through code, not a book-like text where there are spaces on either side of every word. For example, this will miss for(, which I would like to find. Is there another clever way to find whole words only?
Edit: Thanks for the suggestions. How about cases in which there the keyword starts on the first entry of the string? For example,
class Vec {
public:
...
};
where I'm searching for class (or alternatively public). The patterns suggested by Thanga, Austin Lee, npinti, and Kai Iskratsch do not work in this case. Any ideas?

In your case, the issue is that the \b flag will look for punctuation marks, white spaces and the beginning or end of the string. An opening bracket does not fall within any of these categories, and is thus omitted.
The easiest way to fix this would be to replace "\\b"+keyword+"\\b" with "[\\b(]"+keyword+"[\\b)]".
In regex syntax, the square brackets denote a set of which the regex engine will attempt to match any character it contains.
As per this previous SO question, it would seem that \b and [\b] are not the same. Whilst \b represents a word boundary, [\b] represents a backspace character. To fix this, simply replace "\\b"+keyword+"\\b" with "(\b|\()"+keyword+"(\b|\))".

Regex should match 0 or more chars. The below code change will fix the issue
String regex = ".*("+keyword+").*";

You could modify your regex to search for multiple characters afterwords, for example
[^\w]+"for"+[^\w] using the Pattern class in Java.
For your reference:
https://docs.oracle.com/javase/7/docs/api/java/util/regex/Pattern.html

Basically you will have to adapt your regex to all the possible patterns it can find. But considering your actually dealing with code, you are better of building a parser/tokenizer for that language, or using one that already exists. Then all you have to do is run through the tokens to find the the ones you want.

Related

Regex extract string in java

I'm trying to extract a string from a String in Regex Java
Pattern pattern = Pattern.compile("((.|\\n)*).{4}InsurerId>\\S*.{5}InsurerId>((.|\\n)*)");
Matcher matcher = pattern.matcher(abc);
I'm trying to extract the value between
<_1:InsurerId>F2021633_V1</_1:InsurerId>
I'm not sure where am I going wrong but I don't get output for
if (matcher.find())
{
System.out.println(matcher.group(1));
}
You can use:
Pattern pattern = Pattern.compile("<([^:]+:InsurerId)>([^<]*)</\\1>");
Matcher matcher = pattern.matcher(abc);
if (matcher.find()) {
System.out.println(matcher.group(2));
}
RegEx Demo
You may want to use the totally awesome page http://regex101.com/ to test your regular expressions. As you can see at https://regex101.com/r/rV8uM3/1, you only have empty capturing groups, but let me explain to you what you did. :D
((.|\n)*) This matches any character, or a new line, unimportant how often. It is capturing, so your first matching group will always be everything before <_1:InsurerId>, or an empty string. You can match any character instead, it will include new lines: .*. You can even leave it away as it isn't actually part of the String you want to match - using anything here will actually be a problem if you have multiple InsurerIds in your file and want to get them all.
.{4}InsurerId> This matches "InsurerId>" with any four characters in front of it and is exactly what you want. As the first character is probably always an opening angle bracket (and you don't want stuff like "<ExampleInsurerId>"), I'd suggest using <.{3}InsurerId> instead. This still could have some problems (<Test id="<" xInsurerId>), so if you know exactly that it's "_<a digit>:", why not use <_\d:InsurerId>?
\S* matches everything except for whitespaces - probably not the best idea as XML and similar files can be written to not contain any space at all. You want to have everything to the next tag, so use [^<]* - this matches everything except for an opening angle bracket. You also want to get this value later, so you have to use a capturing group: ([^<]*)
.{5}InsurerId> The same thing here: use <\/.{3}InsurerId> or <\/_\d:InsurerId> (forward slashes are actually characters interpreted by other RegEx implementations, so I suggest escaping them)
((.|\n)*) Again the same thing, just leave it away
The resulting Regular Expression would then be the following:
<_\d:InsurerId>([^<]*)<\/_\d:InsurerId>
And as you can see at https://regex101.com/r/mU6zZ3/1 - you have exactly one match, and it's even "F2021633_V1" :D
For Java, you have to escape the backslashes, so the resulting code would look like this:
Pattern pattern = Pattern.compile("<_\\d:InsurerId>([^<]*)<\\/_\\d:InsurerId>");
If you are using Java 7 and above, you can use naming groups to make the Regex a little bit more readable (also see the backreference group \k for close tag to match the openning tag):
Pattern pattern = Pattern.compile("(?:<(?<InsurancePrefix>.+)InsurerId>)(?<id>[A-Z0-9_]+)</\\k<InsurancePrefix>InsurerId>");
Matcher matcher = pattern.matcher("<_1:InsurerId>F2021633_V1</_1:InsurerId>");
if (matcher.matches()) {
System.out.println(matcher.group("id"));
}
Using back reference the matches() fails, for example, on this text
<_1:InsurerId>F2021633_V1</_2:InsurerId>
which is correct
Javadoc has a good explanation: https://docs.oracle.com/javase/8/docs/api/
Also you might consider using a different tool (XML parser) instead of Regex, as well, as other people have to support your code, and complex Regex is usually difficult to understand.

get the last portion of the link using java regex

I have an arraylist links. All links having same format abc.([a-z]*)/\\d{4}/
List<String > links= new ArrayList<>();
links.add("abc.com/2012/aa");
links.add("abc.com/2014/dddd");
links.add("abc.in/2012/aa");
I need to get the last portion of every link. ie, the part after domain name. Domain name can be anything(.com, .in, .edu etc).
/2012/aa
/2014/dddd
/2012/aa
This is the output i want. How can i get this using regex?
Thanks
Some people, when confronted with a problem, think “I know, I'll use
regular expressions.” Now they have two problems.
(see here for background)
Why use regex ? Perhaps a simpler solution is to use String.split("/") , which gives you an array of substrings of the original string, split by /. See this question for more info.
Note that String.split() does in fact take a regex to determine the boundaries upon which to split. However you don't need a regex in this case and a simple character specification is sufficient.
Try with below regex and use regex grouping feature that is grouped based on parenthesis ().
\.[a-zA-Z]{2,3}(/.*)
Pattern description :
dot followed by two or three letters followed by forward slash then any characters
DEMO
Sample code:
Pattern pattern = Pattern.compile("\\.[a-zA-Z]{2,3}(/.*)");
Matcher matcher = pattern.matcher("abc.com/2012/aa");
if (matcher.find()) {
System.out.println(matcher.group(1));
}
output:
/2012/aa
Note:
You can make it more precise by using \\.[a-zA-Z]{2,3}(/\\d{4}/.*) if there are always 4 digits in the pattern.
String result = s.replaceAll("^[^/]*","");
s would be the string in your list.
Some people, when confronted with a problem, think “I know, I'll use regular expressions.” Now they have two problems.
Why not just use the URI class?
output = new URI(link).getPath()
Try this one and use the second capturing group
(.*?)(/.*)
Use foreach loop to iterate over list.
Use substring and indexOf('/').
FOR EXAMPLE
String s="abc.com/2014/dddd";
System.out.println(s.substring(s.indexOf('/')));
OUTPUT
/2014/dddd
Or you can go for split method.
System.out.println(s.split("/",2)[1]);//OUTPUT:2014/dddd --->you need to add /

Regex which matches a string containing at least the specified characters

I have a huge dictionary which I'm trying to look through using a regex. What I would like to do is to find all the words in the dictionary which contain at least one occurrences of each character I provide in no particular order.
Right now I can find words which only contain the specified characters but like I said that is not exactly what I want.
Example:
I want at least one occurrence of each of the following characters {b, a, d}
astring.matches(regex)
I would expect words like:
badder,
baddest,
baffled
Notice they all contain at least one occurence of each character but in no particular order and other characters are present in the strings.
Anyone know how to do this? Other suggestions are also welcome!
You need a series of look-aheads:
^(?=.*b)(?=.*a)(?=.*d).*
which is a pain to construct. However, you can ease the pain by using regex to build it:
String regex = "^" + "bad".replaceAll(".", "(?=.*$0)") + ".*";
If using repeatedly with String.matches(), you would be better to use the following code, because every call to String.matches() compiles the regex again (there is no caching):
// do this once
Pattern pattern = Pattern.compile(regex);
// reuse the pattern many times
if (pattern.matcher(input).matches())
You can use a lookahead to do this if it's available
(?=.*b)(?=.*a)(?=.*d)
However this is quite inefficient. Any reason you can't use multiple String.indexOf checks?

How to bound +/* for a regex group?

Say I have the regex:
(CC|NP)*
As such it creates problems in look-before regexes in Java. How shall I write it to avoid those problem?
I thought of re-writing it as:
(CC|NP){1,9}
Testing on regexr it seems like the upperbound is ignored completely.
In Java those quantitiers {} seem to work only on non-group regex elements as in:
\w+\[\S{1,9}\]
Sorry, look behind patterns usually have restrictions on the sub pattern. See f.x. Why doesn't finite repetition in lookbehind work in some flavors?p. Or search for "lookbehind pattern restrictions" on the web.
You may try to write down all fixed length variants of the look behind pattern as alternating pattern. But this might be many...
You may also simulate lookbehind by normally matching the inner pattern and match and group your actual target: (?:CC|NP)*(.*)
I'm not sure of where you percieve the problem. Quantifiers act on groups just like any entity.
So, \w+\[\S{1,9}\] could have been written \w+\[(\S){1,9}\] with the same result.
As far as your example on regexr, nothing is broken there. It matches what it's supposed to.
(PUN|CC|NP){1,3} will greedily try to match any of the alternations (in left-to-right priority). There will be no breaks in what it will match. It matches 1-3 consecutive occurances of PUN or CC or NP.
The sample string you provided had a space between CC's, so since a space does not exist in the regex, it is not matched. The only thing that is matching is a single CC.
If you want to account for a space, it can be added to the grouping like this:
(?:(?:PUN|CC|NP)\s*){1,3}
If you want to only allow spaces between the alternation's, it can be done like this:
(?:PUN|CC|NP)(?:\s*(?:PUN|CC|NP)){0,2}

Regular expression not extracting the exact pattern

I am working in Java to read a string of over 100000 characters.
I have a list of keywords, that I search the string for, and if the string is present I call a function which does some internal processing.
The kind of keyword I have is "face", for example - I wish to get all the patterns where I have matches for "faces" not "facebook". I can accept a space character behind the face in the string so if in a string I have a match like " face" or " faces" or "face " or " faces" i can accept that too. However I can not accept "duckface" or "duckface " etc.
I have written the regex
Pattern p = Pattern.compile("\\s+"+keyword+"s\\s+|\\s+");
where keyword is my list of keywords, but I am not getting the desired results. Can you read my description and please suggest what might be issue and how I can fix it?
Also if a pointer to a really good regex for Java page is shared I would appreciate that as well.
Thank you Contributers ..
Edit
The reason I know it is not working is I have used the following code:
Pattern p = Pattern.compile("\\s+"+keyword+"s\\s+|\\s+");
Matcher m = p.matcher(myInputDataSting);
if(m.find())
{
System.out.println("Its a Match: "+m.group());
}
This returns a blank string...
If keyword is "face", then your current regex is
\s+faces\s+|\s+
which matches either one or more whitespace characters, followed by faces, followed by one or more whitespace characters, or one or more whitespace characters. (The pipe | has very low precedence.)
What you really want is
\bfaces?\b
which matches a word boundary, followed by face, optionally followed by s, followed by a word boundary.
So, you can write:
Pattern p = Pattern.compile("\\b"+keyword+"s?\\b");
(though obviously this will only work for words like face that form their plurals by simply adding s).
You can find a comprehensive listing of Java's regular-expression support at http://docs.oracle.com/javase/7/docs/api/java/util/regex/Pattern.html, but it's not much of a tutorial. For that, I'd recommend just Googling "regular expression tutorial", and finding one that suits you. (It doesn't have to be Java-specific: most of the tutorials you'll find are for flavors of regular-expression that are very similar to Java's.)
You should use
Pattern p = Pattern.compile("\b"+keyword+"s?\b");
, where keyword is not plural. \\b means that keyword must be as a complete word in searched string. s? means that keyword's value may end with s.
If you are not familar enough with regular expressions I recommend reading http://docs.oracle.com/javase/tutorial/essential/regex/index.html, because there are examples and explanations.

Categories