Search substring in a string using regex - java

I'm trying to search for a set of words, contained within an ArrayList(terms_1pers), inside a string and, since the precondition is that before and after the search word there should be no letters, I thought of using expression regular.
I just don't know what I'm doing wrong using the matches operator. In the code reported, if the matching is not verified, it writes to an external file.
String url = csvRecord.get("url");
String text = csvRecord.get("review");
String var = null;
for(String term : terms_1pers)
{
if(!text.matches("[^a-z]"+term+"[^a-z]"))
{
var="true";
}
}
if(!var.equals("true"))
{
bw.write(url+";"+text+"\n");
}

In order to find regex matches, you should use the regex classes. Pattern and Matcher.
String term = "term";
ArrayList<String> a = new ArrayList<String>();
a.add("123term456"); //true
a.add("A123Term5"); //false
a.add("term456"); //true
a.add("123term"); //true
Pattern p = Pattern.compile("^[^A-Za-z]*(" + term + ")[^A-Za-z]*$");
for(String text : a) {
Matcher m = p.matcher(text);
if (m.find()) {
System.out.println("Found: " + m.group(1) );
//since the term you are adding is the second matchable portion, you're looking for group(1)
}
else System.out.println("No match for: " + term);
}
}
In the example there, we create an instance of a https://docs.oracle.com/javase/7/docs/api/java/util/regex/Pattern.html to find matches in the text you are matching against.
Note that I adjusted the regex a bit. The choice in this code excludes all letters A-Z and the lowercase versions from the initial matching part. It will also allow for situations where there are no characters at all before or after the match term. If you need to have something there, use + instead of *. I also limited the regex to force the match to only contain matches for these three groups by using ^ and $ to verify end the end of the matching text. If this doesn't fit your use case, you may need to adjust.
To demonstrate using this with a variety of different terms:
ArrayList<String> terms = new ArrayList<String>();
terms.add("term");
terms.add("the book is on the table");
terms.add("1981 was the best year ever!");
ArrayList<String> a = new ArrayList<String>();
a.add("123term456");
a.add("A123Term5");
a.add("the book is on the table456");
a.add("1##!231981 was the best year ever!9#");
for (String term: terms) {
Pattern p = Pattern.compile("^[^A-Za-z]*(" + term + ")[^A-Za-z]*$");
for(String text : a) {
Matcher m = p.matcher(text);
if (m.find()) {
System.out.println("Found: " + m.group(1) + " in " + text);
//since the term you are adding is the second matchable portion, you're looking for group(1)
}
else System.out.println("No match for: " + term + " in " + text);
}
}
Output for this is:
Found: term in 123term456
No match for: term in A123Term5
No match for: term in the book is on the table456....
In response to the question about having String term being case insensitive, here's a way that we can build a string by taking advantage of java.lang.Character to options for upper and lower case letters.
String term = "This iS the teRm.";
String matchText = "123This is the term.";
StringBuilder str = new StringBuilder();
str.append("^[^A-Za-z]*(");
for (int i = 0; i < term.length(); i++) {
char c = term.charAt(i);
if (Character.isLetter(c))
str.append("(" + Character.toLowerCase(c) + "|" + Character.toUpperCase(c) + ")");
else str.append(c);
}
str.append(")[^A-Za-z]*$");
System.out.println(str.toString());
Pattern p = Pattern.compile(str.toString());
Matcher m = p.matcher(matchText);
if (m.find()) System.out.println("Found!");
else System.out.println("Not Found!");
This code outputs two lines, the first line is the regex string that's being compiled in the Pattern. "^[^A-Za-z]*((t|T)(h|H)(i|I)(s|S) (i|I)(s|S) (t|T)(h|H)(e|E) (t|T)(e|E)(r|R)(m|M).)[^A-Za-z]*$" This adjusted regex allows for letters in the term to be matched regardless of case. The second output line is "Found!" because the mixed case term is found within matchText.

There are several things to note:
matches requires a full string match, so [^a-z]term[^a-z] will only match a string like :term.. You need to use .find() to find partial matches
If you pass a literal string to a regex, you need to Pattern.quote it, or if it contains special chars, it will not get matched
To check if a word has some pattern before or after or at the start/end, you should either use alternations with anchors (like (?:^|[^a-z]) or (?:$|[^a-z])) or lookarounds, (?<![a-z]) and (?![a-z]).
To match any letter just use \p{Alpha} or - if you plan to match any Unicode letter - \p{L}.
The var variable is more logical to set to Boolean type.
Fixed code:
String url = csvRecord.get("url");
String text = csvRecord.get("review");
Boolean var = false;
for(String term : terms_1pers)
{
Matcher m = Pattern.compile("(?<!\\p{L})" + Pattern.quote(term) + "(?!\\p{L})").matcher(text);
// If the search must be case insensitive use
// Matcher m = Pattern.compile("(?i)(?<!\\p{L})" + Pattern.quote(term) + "(?!\\p{L})").matcher(text);
if(!m.find())
{
var = true;
}
}
if (!var) {
bw.write(url+";"+text+"\n");
}

you did not consider the case where the start and end may contain letters
so adding .* at the front and end should solve your problem.
for(String term : terms_1pers)
{
if( text.matches(".*[^a-zA-Z]+" + term + "[^a-zA-Z]+.*)" )
{
var="true";
break; //exit the loop
}
}
if(!var.equals("true"))
{
bw.write(url+";"+text+"\n");
}

Related

Java Regex expression not working

I have a problem with not working REGEX. I dont know what I am doing wrong. My code:
String test = "timetable:xxxxxtimetable:; timetable: fullihhghtO;";
Pattern p = Pattern.compile("\\btimetable:(.*);");
//also tried "timetable:(.*);" and "(\\btimetable:)(.*)(;)"
Matcher m = p.matcher(test);
while(m.find()) {
System.out.println("S:" + m.start() + ", E:" + m.end());
System.out.println("x: "+ test.substring(m.start(), m.end()));
}
Expected result:
(1) "timetable:xxxxxtimetable:"
(2) "timetable: fullihhghtO"
I thanks for any help.
A non-capturing group could be handy in our case:
String test = "timetable:xxxxxtimetable:; timetable: fullihhghtO;";
Pattern p = Pattern.compile("(?:\\btimetable:(.*?);)+"); // <-- here
Matcher m = p.matcher(test);
int i = 1;
while (m.find()) {
System.out.println(i + ") "+ m.group(1));
i++;
}
OUTPUT
1) xxxxxtimetable:
2) fullihhghtO
Regex explained:
(?:\\btimetable:(.*?);)+ by using the non-capturing (?:\\btimetable:...) we'll consume the "timetable:" without capturing it, then the second matching group (.*?) captures what we want to capture (everything between \btimetable: and ;). Pay special attention to the non-greedy term: .*? which means that we'll consume the minimum possible amount of characters until the ;. If we won't use this lazy form, the regex will use "greedy" default mode and will consume all the characters until the last ; in the string!
Now, all that is relevant if you wanted to catch only the unique part, but if you wanted to catch the whole thing:
1) timetable:xxxxxtimetable:;
2) timetable: fullihhghtO;
It can be done easily by modifying the line with the regex to:
Pattern p = Pattern.compile("\\b(timetable:.*?;)+");
which is even simpler: only one capturing group (see that we still have to use the non-greedy mode!).
You don't need to use regex, a simple split would do it :
public static void main(String[] args) throws IOException {
String test = "timetable:xxxxxtimetable:; timetable: fullihhghtO;";
String[] array = test.split(";");
String str1 = array[0].trim();
String str2 = array[1].trim();
System.out.println(str1 + "\n" + str2); //timetable:xxxxxtimetable:
//timetable: fullihhghtO
}

Regular Expression - Match String Pattern

i want to print out the position of the second occurrence of zip in text, or -1 if it does not occur at least twice.
public class UdaciousSecondOccurence {
String text = "all zip files are zipped";
String text1 = "all zip files are compressed";
String REGEX = "zip{2}"; // atleast two occurences
protected void matchPattern1(){
Pattern p = Pattern.compile(REGEX);
Matcher m = p.matcher(text);
while(m.find()){
System.out.println("start index p" +m.start());
System.out.println("end index p" +m.end());
// System.out.println("Found a " + m.group() + ".");
}
output for matchPattern1()
start index p18
end index p22
But it does not print anything for pattern text1 - i have used a similar method for second pattern -
text1 does not match the regex zip{2}, therefore the while loop never iterates because there are no matches.
The expression is attempting to match the literal zipp, which is contained in text but not text1. regexr
If you want to match the second occurrence, I would recommend using a capture group: .*zip.*?(zip)
Example
String text = "all zip files are zip";
String text1 = "all zip files are compressed";
String REGEX = ".*zip.*?(zip)";
Pattern p = Pattern.compile(REGEX);
Matcher m = p.matcher(text);
if(m.find()){
System.out.println("start index p" + m.start(1));
System.out.println("end index p" + m.end(1));
}else{
System.out.println("Match not found");
}
Use the below code it may work for you
public class UdaciousSecondOccurence {
String text = "all zip files are zipped";
String text1 = "all zip files are compressed";
String REGEX = "zip{2}"; // atleast two occurences
protected void matchPattern1(){
Pattern p = Pattern.compile(REGEX);
Matcher m = p.matcher(text);
if(m.find()){
System.out.println("start index p" +m.start());
System.out.println("end index p" +m.end());
// System.out.println("Found a " + m.group() + ".");
}else{
System.out.println("-1");
}
}
public static void main(String[] args) {
UdaciousSecondOccurence uso = new UdaciousSecondOccurence();
uso.matchPattern1();
}
}
If it must match twice, rather than using a while loop I would code it like this using regex "zip" (once, not twice):
if (m.find() && m.find()) {
// found twice, Matcher at 2nd match
} else {
// not found twice
}
p.s. text1 doesn't have two zips
zip{2} matches the string zipp -- the {2} applies only to the element immediately preceding. 'p'.
That is not what you want.
You probably just want to use zip as your regex, and leave the counting of occurrences to the code around it.
Why don't you just use String.indexOf twice?
String text = "all zip files are zipped";
String text1 = "all zip files are compressed";
int firstOccurrence = text.indexOf("zip");
int secondOccurrence = text.indexOf("zip", firstOccurrence + 1);
System.out.println(secondOccurrence);
firstOccurrence = text1.indexOf("zip");
secondOccurrence = text1.indexOf("zip", firstOccurrence + 1);
System.out.println(secondOccurrence);
Output
18
-1
The second time, statements inside while(m.find()) are never executed. because find() will not be able to find any match
You need one or 2 pattern matching. Try with regex zip{1,2},
String REGEX = "zip{1,2}";
There could be two reasons:
1st: Text1 doesn't contain two 'zip'.
2nd: You need to add the piece of code that would print '-1' upon finding no match. e.g. if m.find = true then print index
else print -1

Java Regex finding operators

I'm trying to use regex to get numbers and operators from a string containing an expression. It finds the numbers but i doesn't find the operators. After every match (number or operator) at the beginning of the string it truncates the expression in order to find the next one.
String expression = "23*12+11";
Pattern intPattern;
Pattern opPattern;
Matcher intMatch;
Matcher opMatch;
intPattern = Pattern.compile("^\\d+");
intMatch = intPattern.matcher(expression);
opPattern = Pattern.compile("^[-+*/()]+");
opMatch = opPattern.matcher(expression);
while ( ! expression.isEmpty()) {
System.out.println("New expression: " + expression);
if (intMatch.find()) {
String inputInt = intMatch.group();
System.out.println(inputInt);
System.out.println("Found at index: " + intMatch.start());
expression = expression.substring(intMatch.end());
intMatch = intPattern.matcher(expression);
System.out.println("Truncated expression: " + expression);
} else if (opMatch.find()) {
String nextOp = opMatch.group();
System.out.println(nextOp);
System.out.println("Found at index: " + opMatch.start());
System.out.println("End index: " + opMatch.end());
expression = expression.substring(opMatch.end());
opMatch = opPattern.matcher(expression);
System.out.println("Truncated expression: " + expression);
} else {
System.out.println("Last item: " + expression);
break;
}
}
The output is
New expression: 23*12+11
23
Found at index: 0
Truncated expression: *12+11
New expression: *12+11
Last item: *12+11
As far as I have been able to investigate there is no need to escape the special characters *, + since they are inside a character class. What's the problem here?
First, your debugging output is confusing, because it's exactly the same in both branches. Add something to distinguish them, such as an a and b prefix:
System.out.println("a.Found at index: " + intMatch.start());
Your problem is that you're not resetting both matchers to the updated string. At the end of both branches in your if-else (or just once, after the entire if-else block), you need to do this:
intMatch = intPattern.matcher(expression);
opMatch = opPattern.matcher(expression);
One last thing: Since you're creating a new matcher over and over again via Pattern.matcher(s), you might want to consider creating the matcher only once, with a dummy-string, at the top of your code
//"": Unused string so matcher object can be reused
intMatch = Pattern.compile(...).matcher("");
and then resetting it in each loop iteration
intMatch.reset(expression);
You can implement the reusable Matchers like this:
//"": Unused to-search strings, so the matcher objects can be reused.
Matcher intMatch = Pattern.compile("^\\d+").matcher("");
Matcher opMatch = Pattern.compile("^[-+*/()]+").matcher("");
String expression = "23*12+11";
while ( ! expression.isEmpty()) {
System.out.println("New expression: " + expression);
intMatch.reset(expression);
opMatch.reset(expression);
if(intMatch.find()) {
...
The
Pattern *Pattern = ...
lines can be removed from the top, and the
*Match = *Pattern.matcher(expression)
lines can be removed from both if-else branches.
Your main problem is that when you found int you or operator you are reassigning only intMatch or opMatch. So if you find int operator is still try to find match on old version of expression. So you need to place this lines in both your positive cases
intMatch = intPattern.matcher(expression);
opMatch = opPattern.matcher(expression);
But maybe instead of your approach with two Patterns and recreating expression just use one regex which will find ints or operators and place them in different group categories? I mean something like
String expression = "23*12+11";
Pattern p = Pattern.compile("(\\d+)|([-+*/()]+)");
Matcher m = p.matcher(expression);
while (m.find()){
if (m.group(1)==null){//group 1 is null so match must come from group 2
System.out.println("opperator found: "+m.group(2));
}else{
System.out.println("integer found: "+m.group(1));
}
}
Also if you don't need to separately handle integers and operators you can just split on places before and after operators using look-around mechanisms
String expression = "23*12+11";
for (String s : expression.split("(?<=[-+*/()])|(?=[-+*/()])"))
System.out.println(s);
Output:
23
*
12
+
11
Try this one
Note:You have missed modulus % operator
String expression = "2/3*1%(2+11)";
Pattern pt = Pattern.compile("[-+*/()%]");
Matcher mt = pt.matcher(expression);
int lastStart = 0;
while (mt.find()) {
if (lastStart != mt.start()) {
System.out.println("number:" + expression.substring(lastStart, mt.start()));
}
lastStart = mt.start() + 1;
System.out.println("operator:" + mt.group());
}
if (lastStart != expression.length()) {
System.out.println("number:" + expression.substring(lastStart));
}
output
number:2
operator:/
number:3
operator:*
number:1
operator:%
operator:(
number:2
operator:+
number:11
operator:)

Extract every complete word that contains a certain substring

I'm trying to write a function that extracts each word from a sentence that contains a certain substring e.g. Looking for 'Po' in 'Porky Pork Chop' will return Porky Pork.
I've tested my regex on regexpal but the Java code doesn't seem to work. What am I doing wrong?
private static String foo()
{
String searchTerm = "Pizza";
String text = "Cheese Pizza";
String sPattern = "(?i)\b("+searchTerm+"(.+?)?)\b";
Pattern pattern = Pattern.compile ( sPattern );
Matcher matcher = pattern.matcher ( text );
if(matcher.find ())
{
String result = "-";
for(int i=0;i < matcher.groupCount ();i++)
{
result+= matcher.group ( i ) + " ";
}
return result.trim ();
}else
{
System.out.println("No Luck");
}
}
In Java to pass \b word boundaries to regex engine you need to write it as \\b. \b represents backspace in String object.
Judging by your example you want to return all words that contains your substring. To do this don't use for(int i=0;i < matcher.groupCount ();i++) but while(matcher.find()) since group count will iterate over all groups in single match, not over all matches.
In case your string can contain some special characters you probably should use Pattern.quote(searchTerm)
In your code you are trying to find "Pizza" in "Cheese Pizza" so I assume that you also want to find strings that same as searched substring. Although your regex will work fine for it, you can change your last part (.+?)?) to \\w* and also add \\w* at start if substring should also be matched in the middle of word (not only at start).
So your code can look like
private static String foo() {
String searchTerm = "Pizza";
String text = "Cheese Pizza, Other Pizzas";
String sPattern = "(?i)\\b\\w*" + Pattern.quote(searchTerm) + "\\w*\\b";
StringBuilder result = new StringBuilder("-").append(searchTerm).append(": ");
Pattern pattern = Pattern.compile(sPattern);
Matcher matcher = pattern.matcher(text);
while (matcher.find()) {
result.append(matcher.group()).append(' ');
}
return result.toString().trim();
}
While the regex approach is certainly a valid method, I find it easier to think through when you split the words up by whitespace. This can be done with String's split method.
public List<String> doIt(final String inputString, final String term) {
final List<String> output = new ArrayList<String>();
final String[] parts = input.split("\\s+");
for(final String part : parts) {
if(part.indexOf(term) > 0) {
output.add(part);
}
}
return output;
}
Of course it is worth nothing that doing this will effectively be doing two passes through your input String. The first pass to find the characters that are whitespace to split on, and the second pass looking through each split word for your substring.
If one pass is necessary though, the regex path is better.
I find nicholas.hauschild's answer to be the best.
However if you really wanted to use regex, you could do it as such:
String searchTerm = "Pizza";
String text = "Cheese Pizza";
Pattern pattern = Pattern.compile("\\b" + Pattern.quote(searchTerm)
+ "\\b", Pattern.CASE_INSENSITIVE);
Matcher matcher = pattern.matcher(text);
while (matcher.find()) {
System.out.println(matcher.group());
}
Output:
Pizza
The pattern should have been
String sPattern = "(?i)\\b("+searchTerm+"(?:.+?)?)\\b";
You want to capture the whole (pizza)string.?: ensures you don't capture a part of the string twice.
Try this pattern:
String searchTerm = "Po";
String text = "Porky Pork Chop oPod zzz llPo";
Pattern p = Pattern.compile("\\p{Alpha}+" + substring + "|\\p{Alpha}+" + substring + "\\p{Alpha}+|" + substring + "\\p{Alpha}+");
Matcher m = p.matcher(myString);
while(m.find()) {
System.out.println(">> " + m.group());
}
Ok, I give you a pattern in raw style (not java style, you must double escape yourself):
(?i)\b[a-z]*po[a-z]*\b
And that's all.

Java regex skipping matches

I have some text; I want to extract pairs of words that are not separated by punctuation. This is the code:
//n-grams
Pattern p = Pattern.compile("[a-z]+");
if (n == 2) {
p = Pattern.compile("[a-z]+ [a-z]+");
}
if (n == 3) {
p = Pattern.compile("[a-z]+ [a-z]+ [a-z]+");
}
Matcher m = p.matcher(text.toLowerCase());
ArrayList<String> result = new ArrayList<String>();
while (m.find()) {
String temporary = m.group();
System.out.println(temporary);
result.add(temporary);
}
The problem is that it skips some matches. For example
"My name is James"
, for n = 3, must match
"my name is" and "name is james"
, but instead it matches just the first. Is there a way to solve this?
You can capture it using groups in lookahead
(?=(\b[a-z]+\b \b[a-z]+\b \b[a-z]+\b))
This causes it to capture in two groups..So in your case it would be
Group1->my name is
Group2->name is james
In regular expression pattern defined by regex is applied on the String from left to right and once a source character is used in a match, it can’t be reused.
For example, regex “121″ will match “31212142121″ only twice as “121___121″.
I tend to use the argument to the find() method of Matcher:
Matcher m = p.matcher(text);
int position = 0;
while (m.find(position)) {
String temporary = m.group();
position = m.start();
System.out.println(position + ":" + temporary);
position++;
}
So after each iteration, it searches again based on the last start index.
Hope that helped!

Categories