Why this code:
String keyword = "pattern";
String text = "sometextpatternsometext";
String patternStr = "^.*" + keyword + ".*$"; //
Pattern pattern = Pattern.compile(patternStr, Pattern.CASE_INSENSITIVE);
Matcher matcher = pattern.matcher(text);
while (matcher.find()) {
int start = matcher.start();
int end = matcher.end();
System.out.println("start = " + start + ", end = " + end);
}
start = 0, end = 23
don't work properly.
But, this code:
String keyword = "pattern";
String text = "sometext pattern sometext";
String patternStr = "\\b" + keyword + "\\b"; //
Pattern pattern = Pattern.compile(patternStr, Pattern.CASE_INSENSITIVE);
Matcher matcher = pattern.matcher(text);
while (matcher.find()) {
int start = matcher.start();
int end = matcher.end();
System.out.println("start = " + start + ", end = " + end);
}
start = 9, end = 16
work fine.
It does work. Your pattern
^.*pattern.*$
says to match:
start at the beginning
accept any number of characters
followed by the string pattern
followed by any number of characters
until the end of the string
The result is the entire input string. If you wanted to find only the word pattern, then the regex would be just the word by itself, or as you found, bracketed with word-boundary metacharacters.
It is not that the first example didn't work, it is that you inadvertently asked it to match more than you meant.
The .* expressions expand to contain all the characters before "pattern" and all the characters after pattern, so the whole expression matches the whole line.
With your second example, you only specify that it match a blank space before and after "pattern" so the expression matches mostly pattern, plus a couple of spaces.
The problem is in your regex: "^.*" + keyword + ".*$"
The expression .* matches as many characters as there are in the string. It means that it actually matches whole string. After the whole string it cannot find your keyword.
To make it working you have to make it greedy, i.e. add question sign after .*:
"^.*?" + keyword + ".*$"
This time .*? matches minimum characters followed by your keyword.
Related
I have a problem with not working REGEX. I dont know what I am doing wrong. My code:
String test = "timetable:xxxxxtimetable:; timetable: fullihhghtO;";
Pattern p = Pattern.compile("\\btimetable:(.*);");
//also tried "timetable:(.*);" and "(\\btimetable:)(.*)(;)"
Matcher m = p.matcher(test);
while(m.find()) {
System.out.println("S:" + m.start() + ", E:" + m.end());
System.out.println("x: "+ test.substring(m.start(), m.end()));
}
Expected result:
(1) "timetable:xxxxxtimetable:"
(2) "timetable: fullihhghtO"
I thanks for any help.
A non-capturing group could be handy in our case:
String test = "timetable:xxxxxtimetable:; timetable: fullihhghtO;";
Pattern p = Pattern.compile("(?:\\btimetable:(.*?);)+"); // <-- here
Matcher m = p.matcher(test);
int i = 1;
while (m.find()) {
System.out.println(i + ") "+ m.group(1));
i++;
}
OUTPUT
1) xxxxxtimetable:
2) fullihhghtO
Regex explained:
(?:\\btimetable:(.*?);)+ by using the non-capturing (?:\\btimetable:...) we'll consume the "timetable:" without capturing it, then the second matching group (.*?) captures what we want to capture (everything between \btimetable: and ;). Pay special attention to the non-greedy term: .*? which means that we'll consume the minimum possible amount of characters until the ;. If we won't use this lazy form, the regex will use "greedy" default mode and will consume all the characters until the last ; in the string!
Now, all that is relevant if you wanted to catch only the unique part, but if you wanted to catch the whole thing:
1) timetable:xxxxxtimetable:;
2) timetable: fullihhghtO;
It can be done easily by modifying the line with the regex to:
Pattern p = Pattern.compile("\\b(timetable:.*?;)+");
which is even simpler: only one capturing group (see that we still have to use the non-greedy mode!).
You don't need to use regex, a simple split would do it :
public static void main(String[] args) throws IOException {
String test = "timetable:xxxxxtimetable:; timetable: fullihhghtO;";
String[] array = test.split(";");
String str1 = array[0].trim();
String str2 = array[1].trim();
System.out.println(str1 + "\n" + str2); //timetable:xxxxxtimetable:
//timetable: fullihhghtO
}
I'm trying to parse a URL and I'd like to test for the last index of a couple characters followed by a numeric value.
Example
used-cell-phone-albany-m3359_l12201
I'm trying to determine if the last "-m" is followed by a numeric value.
So something like this, "used-cell-phone-albany-m3359_l12201".contains("m" followed by numeric)
I'm assuming it needs to be done with regular expressions, but I'm not for sure.
You could use a pattern like [a-z]\\d which searches for any numbers which appear next to a character between a-z, you can specify other characters within the group if you wish...
Pattern pattern = Pattern.compile("[a-z]\\d", Pattern.CASE_INSENSITIVE);
Matcher matcher = pattern.matcher("used-cell-phone-albany-m3359_l12201");
while (matcher.find()) {
int startIndex = matcher.start();
int endIndex = matcher.end();
String match = matcher.group();
System.out.println(startIndex + "-" + endIndex + " = " + match);
}
The problem is, your test String actually contains two matches m3 and l1
The above example will display
23-25 = m3
29-31 = l1
Updated with feedback
If you can guarantee the marker (ie -m), then it comes a lot simpler...
Pattern pattern = Pattern.compile("-m\\d", Pattern.CASE_INSENSITIVE);
Matcher matcher = pattern.matcher("used-cell-phone-albany-m3359_l12201");
if (matcher.find()) {
int startIndex = matcher.start();
int endIndex = matcher.end();
String match = matcher.group();
System.out.println(startIndex + "-" + endIndex + " = " + match);
}
In Java, convert the URL to a String if necessary and then run
URLString.match("^.*m[0-9]+$").
Only if that returns true, then the URL ends with "m" followed by a number. That can be refined with a more precise ending pattern. The reason this regex tests the pattern at the end of the string is because $ in a regex matches the end of the string; "[0-9]+" matches a sequencs of one or more numerical digits; "^" matches the beginning of the string; and ".*" matches zero or more arbitrary but printable characters including white space, letters, numbers and puctuation marks.
To determine if the last "m" is followed by a number then use
URLString.match("^.+?m[0-9].*$")
Here ".+?" greedily matches all characters up to the very last "m".
I'm trying to write a function that extracts each word from a sentence that contains a certain substring e.g. Looking for 'Po' in 'Porky Pork Chop' will return Porky Pork.
I've tested my regex on regexpal but the Java code doesn't seem to work. What am I doing wrong?
private static String foo()
{
String searchTerm = "Pizza";
String text = "Cheese Pizza";
String sPattern = "(?i)\b("+searchTerm+"(.+?)?)\b";
Pattern pattern = Pattern.compile ( sPattern );
Matcher matcher = pattern.matcher ( text );
if(matcher.find ())
{
String result = "-";
for(int i=0;i < matcher.groupCount ();i++)
{
result+= matcher.group ( i ) + " ";
}
return result.trim ();
}else
{
System.out.println("No Luck");
}
}
In Java to pass \b word boundaries to regex engine you need to write it as \\b. \b represents backspace in String object.
Judging by your example you want to return all words that contains your substring. To do this don't use for(int i=0;i < matcher.groupCount ();i++) but while(matcher.find()) since group count will iterate over all groups in single match, not over all matches.
In case your string can contain some special characters you probably should use Pattern.quote(searchTerm)
In your code you are trying to find "Pizza" in "Cheese Pizza" so I assume that you also want to find strings that same as searched substring. Although your regex will work fine for it, you can change your last part (.+?)?) to \\w* and also add \\w* at start if substring should also be matched in the middle of word (not only at start).
So your code can look like
private static String foo() {
String searchTerm = "Pizza";
String text = "Cheese Pizza, Other Pizzas";
String sPattern = "(?i)\\b\\w*" + Pattern.quote(searchTerm) + "\\w*\\b";
StringBuilder result = new StringBuilder("-").append(searchTerm).append(": ");
Pattern pattern = Pattern.compile(sPattern);
Matcher matcher = pattern.matcher(text);
while (matcher.find()) {
result.append(matcher.group()).append(' ');
}
return result.toString().trim();
}
While the regex approach is certainly a valid method, I find it easier to think through when you split the words up by whitespace. This can be done with String's split method.
public List<String> doIt(final String inputString, final String term) {
final List<String> output = new ArrayList<String>();
final String[] parts = input.split("\\s+");
for(final String part : parts) {
if(part.indexOf(term) > 0) {
output.add(part);
}
}
return output;
}
Of course it is worth nothing that doing this will effectively be doing two passes through your input String. The first pass to find the characters that are whitespace to split on, and the second pass looking through each split word for your substring.
If one pass is necessary though, the regex path is better.
I find nicholas.hauschild's answer to be the best.
However if you really wanted to use regex, you could do it as such:
String searchTerm = "Pizza";
String text = "Cheese Pizza";
Pattern pattern = Pattern.compile("\\b" + Pattern.quote(searchTerm)
+ "\\b", Pattern.CASE_INSENSITIVE);
Matcher matcher = pattern.matcher(text);
while (matcher.find()) {
System.out.println(matcher.group());
}
Output:
Pizza
The pattern should have been
String sPattern = "(?i)\\b("+searchTerm+"(?:.+?)?)\\b";
You want to capture the whole (pizza)string.?: ensures you don't capture a part of the string twice.
Try this pattern:
String searchTerm = "Po";
String text = "Porky Pork Chop oPod zzz llPo";
Pattern p = Pattern.compile("\\p{Alpha}+" + substring + "|\\p{Alpha}+" + substring + "\\p{Alpha}+|" + substring + "\\p{Alpha}+");
Matcher m = p.matcher(myString);
while(m.find()) {
System.out.println(">> " + m.group());
}
Ok, I give you a pattern in raw style (not java style, you must double escape yourself):
(?i)\b[a-z]*po[a-z]*\b
And that's all.
I am a new to Java. I want to search for a string in text file. Suppose the file contains:
Hi, I am learning Java.
I am using this below pattern to search through every exact word.
Pattern p = Pattern.compile("\\b"+search string+"\\b", Pattern.CASE_INSENSITIVE);
It works fine but it doesn't find "java." How to find both patterns. i.e with boundary symbols and with "." at end of the string. Does anyone have any ideas on how I can solve this problem?
You should parse your search string in order to change the dot . into a RegEx dot: \\.. Note that a single dot is a metacharacter in Regular Expressions and means any character. For example, you can replace all the dots in your String for \\.
If you don't want to do all that job, then just send java\\. as your search string
More info:
Using Regular Expressions in Java
Java Regex Tutorial
Java Regular Expressions
Code example:
public static void main(String[] args) {
String fileContent = "Hi i am learning java.";
String searchString = "java";
Pattern p = Pattern.compile(searchString);
Matcher m = p.matcher(fileContent );
while(m.find()) {
System.out.println(m.start() + " " + m.group());
}
}
It would print: 17 java
public static void main(String[] args) {
String fileContent = "Hi i am learning java.";
String searchString = "java\\.";
Pattern p = Pattern.compile(searchString);
Matcher m = p.matcher(fileContent );
while(m.find()) {
System.out.println(m.start() + " " + m.group());
}
}
It would print: 17 java. (note the dot in the end)
EDIT: As a very basic solution, since the only problem you have is with the dot, you can replace all the dots in your string with \\.
public static void main(String[] args) {
String fileContent = "Hi i am learning java.";
String searchString = "java.";
//this will do the trick even if the "searchString" doesn't contain a dot inside
searchString = searchString.replaceAll("\\.", "\\.");
Pattern p = Pattern.compile(searchString);
Matcher m = p.matcher(fileContent );
while(m.find()) {
System.out.println(m.start() + " " + m.group());
}
}
"\\b" + searchstring + "(?:\\.|\\b)"
If you want to stipulate that the dot must be followed by a non-word character or the end of the string, you could add a positive look-ahead
"\\b" + searchstring + "(?:\\.(?=\\W|$)|\\b)"
Pattern p = Pattern.compile(".*\\W*" + searchWord + "\\W*.*", Pattern.CASE_INSENSITIVE);
To be absolutely sure, the above says "find me a bit of text that starts with 0 or more characters, followed by 0 or more non-word characters specifically (\W* - the word boundary) followed by the search word, followed by the next word boundary followed by anything else".
This will caters for situations where the search word is at the beginning of the file, at the very end, or between punctuation eg: "hi,I am learning,java.".
Hope this helps...
I'm am tottaly lost when coming to regular expressions.
I get generated strings like:
Your number is (123,456,789)
How can I filter out 123,456,789?
You can use this regex for extracting the number including the commas
\(([\d,]*)\)
The first captured group will have your match. Code will look like this
String subjectString = "Your number is (123,456,789)";
Pattern regex = Pattern.compile("\\(([\\d,]*)\\)");
Matcher regexMatcher = regex.matcher(subjectString);
if (regexMatcher.find()) {
String resultString = regexMatcher.group(1);
System.out.println(resultString);
}
Explanation of the regex
"\\(" + // Match the character “(” literally
"(" + // Match the regular expression below and capture its match into backreference number 1
"[\\d,]" + // Match a single character present in the list below
// A single digit 0..9
// The character “,”
"*" + // Between zero and unlimited times, as many times as possible, giving back as needed (greedy)
")" +
"\\)" // Match the character “)” literally
This will get you started http://www.regular-expressions.info/reference.html
String str="Your number is (123,456,789)";
str = str.replaceAll(".*\\((.*)\\).*","$1");
or you can make the replacement a bit faster by doing:
str = str.replaceAll(".*\\(([\\d,]*)\\).*","$1");
try
"\\(([^)]+)\\)"
or
int start = text.indexOf('(')+1;
int end = text.indexOf(')', start);
String num = text.substring(start, end);
private void showHowToUseRegex()
{
final Pattern MY_PATTERN = Pattern.compile("Your number is \\((\\d+),(\\d+),(\\d+)\\)");
final Matcher m = MY_PATTERN.matcher("Your number is (123,456,789)");
if (m.matches()) {
Log.d("xxx", "0:" + m.group(0));
Log.d("xxx", "1:" + m.group(1));
Log.d("xxx", "2:" + m.group(2));
Log.d("xxx", "3:" + m.group(3));
}
}
You'll see the first group is the whole string, and the next 3 groups are your numbers.
String str = "Your number is (123,456,789)";
str = new String(str.substring(16,str.length()-1));