java Matcher need to know if subsequence equals whole sequence - java

I need to figure out a way to determine when the thing being match isn't a subseq but is the whole sequence. ex. "this" not "is".
while (in.hasNextLine()) {
count++;
String patternInLine = in.nextLine().toString();
m = p.matcher(patternInLine);
if (m.find() && searchPattern.equals(m.group())) {
System.out.println("matches group: " + m.group());
m.toString();
System.out.println(patternInLine);
foundLinePattern.get(file).add(patternInLine);
}
}
in.close();
}

Use m.matches() instead of m.find().
find looks for any substring that matches your regex.
matches tries to match only the whole string against the regex. It will not look for substrings.

Related

Search substring in a string using regex

I'm trying to search for a set of words, contained within an ArrayList(terms_1pers), inside a string and, since the precondition is that before and after the search word there should be no letters, I thought of using expression regular.
I just don't know what I'm doing wrong using the matches operator. In the code reported, if the matching is not verified, it writes to an external file.
String url = csvRecord.get("url");
String text = csvRecord.get("review");
String var = null;
for(String term : terms_1pers)
{
if(!text.matches("[^a-z]"+term+"[^a-z]"))
{
var="true";
}
}
if(!var.equals("true"))
{
bw.write(url+";"+text+"\n");
}
In order to find regex matches, you should use the regex classes. Pattern and Matcher.
String term = "term";
ArrayList<String> a = new ArrayList<String>();
a.add("123term456"); //true
a.add("A123Term5"); //false
a.add("term456"); //true
a.add("123term"); //true
Pattern p = Pattern.compile("^[^A-Za-z]*(" + term + ")[^A-Za-z]*$");
for(String text : a) {
Matcher m = p.matcher(text);
if (m.find()) {
System.out.println("Found: " + m.group(1) );
//since the term you are adding is the second matchable portion, you're looking for group(1)
}
else System.out.println("No match for: " + term);
}
}
In the example there, we create an instance of a https://docs.oracle.com/javase/7/docs/api/java/util/regex/Pattern.html to find matches in the text you are matching against.
Note that I adjusted the regex a bit. The choice in this code excludes all letters A-Z and the lowercase versions from the initial matching part. It will also allow for situations where there are no characters at all before or after the match term. If you need to have something there, use + instead of *. I also limited the regex to force the match to only contain matches for these three groups by using ^ and $ to verify end the end of the matching text. If this doesn't fit your use case, you may need to adjust.
To demonstrate using this with a variety of different terms:
ArrayList<String> terms = new ArrayList<String>();
terms.add("term");
terms.add("the book is on the table");
terms.add("1981 was the best year ever!");
ArrayList<String> a = new ArrayList<String>();
a.add("123term456");
a.add("A123Term5");
a.add("the book is on the table456");
a.add("1##!231981 was the best year ever!9#");
for (String term: terms) {
Pattern p = Pattern.compile("^[^A-Za-z]*(" + term + ")[^A-Za-z]*$");
for(String text : a) {
Matcher m = p.matcher(text);
if (m.find()) {
System.out.println("Found: " + m.group(1) + " in " + text);
//since the term you are adding is the second matchable portion, you're looking for group(1)
}
else System.out.println("No match for: " + term + " in " + text);
}
}
Output for this is:
Found: term in 123term456
No match for: term in A123Term5
No match for: term in the book is on the table456....
In response to the question about having String term being case insensitive, here's a way that we can build a string by taking advantage of java.lang.Character to options for upper and lower case letters.
String term = "This iS the teRm.";
String matchText = "123This is the term.";
StringBuilder str = new StringBuilder();
str.append("^[^A-Za-z]*(");
for (int i = 0; i < term.length(); i++) {
char c = term.charAt(i);
if (Character.isLetter(c))
str.append("(" + Character.toLowerCase(c) + "|" + Character.toUpperCase(c) + ")");
else str.append(c);
}
str.append(")[^A-Za-z]*$");
System.out.println(str.toString());
Pattern p = Pattern.compile(str.toString());
Matcher m = p.matcher(matchText);
if (m.find()) System.out.println("Found!");
else System.out.println("Not Found!");
This code outputs two lines, the first line is the regex string that's being compiled in the Pattern. "^[^A-Za-z]*((t|T)(h|H)(i|I)(s|S) (i|I)(s|S) (t|T)(h|H)(e|E) (t|T)(e|E)(r|R)(m|M).)[^A-Za-z]*$" This adjusted regex allows for letters in the term to be matched regardless of case. The second output line is "Found!" because the mixed case term is found within matchText.
There are several things to note:
matches requires a full string match, so [^a-z]term[^a-z] will only match a string like :term.. You need to use .find() to find partial matches
If you pass a literal string to a regex, you need to Pattern.quote it, or if it contains special chars, it will not get matched
To check if a word has some pattern before or after or at the start/end, you should either use alternations with anchors (like (?:^|[^a-z]) or (?:$|[^a-z])) or lookarounds, (?<![a-z]) and (?![a-z]).
To match any letter just use \p{Alpha} or - if you plan to match any Unicode letter - \p{L}.
The var variable is more logical to set to Boolean type.
Fixed code:
String url = csvRecord.get("url");
String text = csvRecord.get("review");
Boolean var = false;
for(String term : terms_1pers)
{
Matcher m = Pattern.compile("(?<!\\p{L})" + Pattern.quote(term) + "(?!\\p{L})").matcher(text);
// If the search must be case insensitive use
// Matcher m = Pattern.compile("(?i)(?<!\\p{L})" + Pattern.quote(term) + "(?!\\p{L})").matcher(text);
if(!m.find())
{
var = true;
}
}
if (!var) {
bw.write(url+";"+text+"\n");
}
you did not consider the case where the start and end may contain letters
so adding .* at the front and end should solve your problem.
for(String term : terms_1pers)
{
if( text.matches(".*[^a-zA-Z]+" + term + "[^a-zA-Z]+.*)" )
{
var="true";
break; //exit the loop
}
}
if(!var.equals("true"))
{
bw.write(url+";"+text+"\n");
}

Java Pattern / Matcher not finding word break

I am having trouble with Java Pattern and Matcher. I've included a very simplified example of what I'm trying to do.
I had expected the pattern ".\b" to find the last character of the first word (or "4" in the example), but as I step through the code, m.find() always returns false. What am I missing here?
Why does the following Java code always print out "Not Found"?
Pattern p = Pattern.compile(".\b");
Matcher m = p.matcher("102939384 is a word");
int ixEndWord = 0;
if (m.find()) {
ixEndWord = m.end();
System.out.println("Found: " + ixEndWord);
} else {
System.out.println("Not Found");
}
You need to escape special characters in the regex: ".\\b"
Basically, in a String the backslash has to be escaped. So "\\" becomes the character '\'.
So the String ".\\b" becomes the litteral String ".\b", which will be used by the Pattern.
To expand upton AntonH's comment, whenever you want the "\" character to appear in a regex expression, you have to escape it so that it first appears in the string you are passing in.
As is, ".\b" is the string of a dot . followed by the special backspace character represented by \b, compared to ".\\b", which is the regex .\b.

Java: String.contains matches exact word

In Java
String term = "search engines"
String subterm_1 = "engine"
String subterm_2 = "engines"
If I do term.contains(subterm_1) it returns true. I don't want that. I want the subterm to exactly match one of the words in term
Therefore something like term.contains(subterm_1) returns false and term.contains(subterm_2) returns true
\b Matches a word boundary where a word character is [a-zA-Z0-9_].
This should work for you, and you could easily reuse this method.
public class testMatcher {
public static void main(String[] args){
String source1="search engines";
String source2="search engine";
String subterm_1 = "engines";
String subterm_2 = "engine";
System.out.println(isContain(source1,subterm_1));
System.out.println(isContain(source2,subterm_1));
System.out.println(isContain(source1,subterm_2));
System.out.println(isContain(source2,subterm_2));
}
private static boolean isContain(String source, String subItem){
String pattern = "\\b"+subItem+"\\b";
Pattern p=Pattern.compile(pattern);
Matcher m=p.matcher(source);
return m.find();
}
}
Output:
true
false
false
true
I would suggest using word boundaries. If you compile a pattern like \bengines\b, your regular expression will only match on complete words.
Here is an explanation of word boundaries, as well as some examples.
http://www.regular-expressions.info/wordboundaries.html
Also, here is the java API for the pattern, which does include word boundaries
http://docs.oracle.com/javase/7/docs/api/java/util/regex/Pattern.html
Here is an example using your requirements above
Pattern p = Pattern.compile("\\bengines\\b");
Matcher m = p.matcher("search engines");
System.out.println("matches: " + m.find());
p = Pattern.compile("\\bengine\\b");
m = p.matcher("search engines");
System.out.println("matches: " + m.find());
and here is the output:
matches: true
matches: false
If the words are always separated by spaces, this is one way to go:
String string = "search engines";
String[] parts = string.split(" ");
for(int i = 0; i < parts.length; i++) {
if(parts[i].equals("engine")) {
//do whatever you want
}
I want the subterm to exactly match one of the words in term
Then you can't use contains(). You could split the term into words and check equality (with or without case sensitivity).
boolean hasTerm = false;
for (String word : term.split("\\s+") {
if (word.equals("engine")) {
hasTerm = true;
break;
}
}
Use indexOf instead and then check whether char at the poistion
index + length of string plus +1 == ` ` or EOS
or I am sure there is a regex way as well.
Since the contains method verify if does exist that array of char in the string, it will aways return true, you will have to use Regex to make this validation.
If the words are aways separed by space it is easier, you can use the \s regex to get it.
Here is a good tutorial: http://www.vogella.com/tutorials/JavaRegularExpressions/article.html
One approach could be to split the string by spaces, convert it to a list, and then use the contains method to check for exact matches, like so:
String[] results = term.split("\\s+");
Boolean matchFound = Arrays.asList(results).contains(subterm_1);
Demo

delete all text after found word java

I need to cut the tail of the string in some cases - I have done this with indexOf and substring, but it slowed my code(( I have thought about regular expressions but this tails have only similar beginnings - this is not "stable" word
For example I have such string
aaaaa bbb cc (bb) (r-1hh)
and I need a result
aaaaa bbb cc (bb)
but there also could be such string
aaaaa bbb cc (bb) (r3-34fff)
or
aaaaa bbb cc (bb) [tagBB- na]
So, the question is - could I use regex to find an index of tail ?
The other question - is IndexOf or Substring uses regex in java?
How to find regex match position:
Pattern p = Pattern.compile("i.*t");
String s = "my input string";
Matcher m = p.matcher(s);
if (m.find()) {
System.out.println("match begins at " + m.start()); // 3
System.out.println("match ends at " + m.end()); // 11
} else {
System.out.println("no match found");
}
But you can remove trailing text this way:
String res = s.replaceFirst("^(.* input).*", "$1");
System.out.println("'" + res + "'");
Or use an exact match without escaping each special char this way:
String res = s.replaceFirst("^(.* " + Pattern.quote("^something$wierd^") + ").*", "$1");
System.out.println("'" + res + "'");
You may write a regex which contains anything but ) and ends on ), so you avoid matching anything after the first ).
You could use $ to match the end of the string and then find a common pattern for your tail. Is it always going to be an alphanumeric/dash/space character situated between [] or ()? Then that's your pattern.
Then just substring everything between the beginning of your initial string and the beginning of the substring you found using the pattern for the tail.
You asked:
Can regex be used to find the index of the String?
You can use a Pattern and Matcher to acheive this.
Just noticed someone else has commented this so I won't give an example.
Do the String methods IndexOf or Substring use regex in Java?
No, String in java uses Character parsing. You can see the Javadoc or source for more detail on this.
You can acheive this with Java fairly easily, this example may be similar to your existing implementation:
public String truncate(String str, String tail) {
int lengthOfTail = tail.length();
int indexOfTail = str.indexOf(tail);
return str.substring(0, indexOfTail + lengthOfTail);
}
(error handling omitted for clarity)

regular expression for file name

I have files in the format *C:\Temp\myfile_124.txt*
I need a regular expression which will give me just the number "124" that is whatever is there after the underscore and before the extension.
I tried a number of ways, latest is
(.+[0-9]{18,})(_[0-9]+)?\\.txt$
I am not getting the desired output. Can someone tell me what is wrong?
Matcher matcher = FILE_NAME_PATTERN.matcher(filename);
if (matcher.matches() && matcher.groupCount() == 2) {
try {
String index = matcher.group(2);
if (index != null) {
return Integer.parseInt(index.substring(1));
}
}
catch (NumberFormatException e) {
}
The first part [0-9]{18,} states you have atleast 18 digits which you don't have.
Usually with regex its a good idea to make the expression as simple as possible. I suggest trying
_([0-9]+)?\\.txt$
Note: you have to call find() to make it perform the lookup, otherwise it says "No match found"
This example
String s = "C:\\Temp\\myfile_124.txt";
Pattern p = Pattern.compile("_(\\d+)\\.txt$");
Matcher matcher = p.matcher(s);
if (matcher.find())
for (int i = 0; i <= matcher.groupCount(); i++)
System.out.println(i + ": " + matcher.group(i));
prints
0: _124.txt
1: 124
This may work for you: (?:.*_)(\d+)\.txt
The result is in the match group.
This one uses positive lookahead and will only match the number: \d+(?=\.txt)
.*_([1-9]+)\.[a-zA-Z0-9]+
The group 1 will contain the desired output.
Demo
You can do this
^.*_\([^\.]*\)\..*$
.*_([0-9]+)\.txt
This should work too. Of course you should double escape for Java.

Categories