What's wrong with this regex? - java

I need to match Twitter-Hashtags within an Android-App, but my code doesn't seem to do what it's supposed to.
What I came up with is:
ArrayList<String> tags = new ArrayList<String>(0);
Pattern p = Pattern.compile("\b#[a-z]+", Pattern.CASE_INSENSITIVE);
Matcher m = p.matcher(tweet); // tweet contains the tweet as a String
while(m.find()){
tags.add(m.group());
}
The variable tweet contains a regular tweet including hashtags - but find() doesn't trigger. So I guess my regular expression is wrong.

Your regex fails because of the \b word boundary anchor. This anchor only matches between a non-word character and a word-character (alphanumeric character). So putting it directly in front of the # causes the regex to fail unless there is an alphanumeric character before the #! Your regex would match a hashtag in foobarfoo#hashtag blahblahblah but not in foobarfoo #hashtag blahblahblah.
Use #\w+ instead, and remember, inside a string, you need to double the backslashes:
Pattern p = Pattern.compile("#\\w+");

Your pattern should be "#(\\w+)" if you are trying to just match the hash tag. Using this and the tweet "retweet pizza to #pizzahut", doing m.group() would give "#pizzahut" and m.group(1) would give "pizzahut".
Edit: Note, the html display is messing with the backslashes for escape, you'll need to have two for the w in your string literal in Java.

Related

Match starting and ending character using Java Matcher class

I want to get words from string that starts with # and end with space. I've tried using this Pattern.compile("#\\s*(\\w+)") but it doesn't include characters like ' or :.
I want the solution with only Pattern Matching method.
We can try matching using the pattern (?<=\\s|^)#\\S+, which would match any word starting with #, followed by any number of non whitespace characters.
String line = "Here is a #hashtag and here is #another has tag.";
String pattern = "(?<=\\s|^)#\\S+";
Pattern r = Pattern.compile(pattern);
Matcher m = r.matcher(line);
while (m.find()) {
System.out.println(m.group(0));
}
#hashtag
#another
Demo
Note: The above solution might give you an edge case of pulling in punctuation which appears at the end of a hashtag. If you don't want this, then we can rephrase the regex to only match positive certain characters, e.g. letters and numbers. But, maybe this is not a concern for you.
The opposite of \s is \S, so you can use a regex like this:
#\s*(\S+)
Or for Java:
Pattern.compile("#\\s*(\\S+)")
It will capture anything that is not a white space.
See demo here.
If you want to stop on the space character and not any white space change the \S to [^ ].
The ^ inside the brackets means it will negate whatever comes after it.
Pattern.compile("#\\s*([^ ]+)")
See demo here.

Java Regex for multiline text

I need to match a string against a regex in Java. The string is multiline and therefore contains multiple \n like the followings
String text = "abcde\n"
+ "fghij\n"
+ "klmno\n";
String regex = "\\S*";
System.out.println(text.matches(regex));
I only want to match whether the text contains at least a non-whitespace character. The output is false. I have also tried \\S*(\n)* for the regex, which also returns false.
In the real program, both the text and regex are not hard-coded. What is the right regex to check is a multiline string contains any non-whitespace character?
The problem is not to do with the multi lines, directly. It is that matches matches the whole string, not just a part of it.
If you want to check for at least one non-whitespace character, use:
"\\s*\\S[\\s\\S]*"
Which means
Zero or more whitespace characters at the start of the string
One non-whitespace character
Zero or more other characters (whitespace or non-whitespace) up to the end of the string
If you just want to check whether there is at least one non white space character in the string, you can just trim the text and check the size without involving regex at all.
String text = "abcde\n"
+ "fghij\n"
+ "klmno\n";
if (!text.trim().isEmpty()){
//your logic here
}
If you really want to use regex, you can use a simple regex like below.
String text = "abcde\n"
+ "fghij\n"
+ "klmno\n";
String regex = ".*\\S+.*";
Pattern pattern = Pattern.compile(regex, Pattern.MULTILINE);
Matcher matcher = pattern.matcher(string);
if (matcher.find()){
// your logic here
}
Using String.matches()
!text.matches("\\s*")
Check if the input text consist solely of whitespace characters (this includes newlines), invert the match result with !
Using Matcher.find()
Pattern regexp = Pattern.compile("\\S");
regexp.matcher(text).find()
Will search for the first non-whitespace character, which is more efficient as it will stop on the first match and also uses a pre-compiled pattern.

detect $character java regular expression

i have to find a word like ${test} from text file. and will replace the based on some criteria. in the regular express '$' have meaning of search till the end of the line.
what is the regular expression to detect like ${\w+}.
You can try using this regex:
"\\$\\{\\w+\\}"
and the method String#replaceAll(String regex, String replacement):
String s = "abc ${test}def"; // for example
s = s.replaceAll("\\$\\{\\w+\\}", "STACKOVERFLOW");
[^}]* rather than \w+ ?
You might want to consider using [^}]* rather than \w+. The former matches any chars that are not a closing brace, so it would allow test-123, which the second would reject. Of course that may just be what you want.
Let's assume this is the raw regex (see what matches in the demo):
\$\{[^}]*\}
In Java, we need to further escape the backslashes, yielding \\$\\{[^}]*.
Likewise \$\{\w+\} would have to be used as \\$\\{\\w+\}
Replacing the Matches in Java
String resultString = subjectString.replaceAll("\\$\\{[^}]*\}", "Your Replacement");
Iterating through the matches in Java
Pattern regex = Pattern.compile("\\$\\{[^}]*\}");
Matcher regexMatcher = regex.matcher(subjectString);
while (regexMatcher.find()) {
// the current match is regexMatcher.group()
}
Explanation
\$ matches the literal $
\{ matches an opening brace
[^}]* matches any chars that are not a closing brace
\} a closing brace

How to escape characters in a regular expression

When I use the following code I've got an error:
Matcher matcher = pattern.matcher("/Date\(\d+\)/");
The error is :
invalid escape sequence (valid ones are \b \t \n \f \r \" \' \\ )
I have also tried to change the value in the brackets to('/Date\(\d+\)/'); without any success.
How can i avoid this error?
You need to double-escape your \ character, like this: \\.
Otherwise your String is interpreted as if you were trying to escape (.
Same with the other round bracket and the d.
In fact it seems you are trying to initialize a Pattern here, while pattern.matcher references a text you want your Pattern to match.
Finally, note that in a Pattern, escaped characters require a double escape, as such:
\\(\\d+\\)
Also, as Rohit says, Patterns in Java do not need to be surrounded by forward slashes (/).
In fact if you initialize a Pattern like that, it will interpret your Pattern as starting and ending with literal forward slashes.
Here's a small example of what you probably want to do:
// your input text
String myText = "Date(123)";
// your Pattern initialization
Pattern p = Pattern.compile("Date\\(\\d+\\)");
// your matcher initialization
Matcher m = p.matcher(myText);
// printing the output of the match...
System.out.println(m.find());
Output:
true
Your regex is correct by itself, but in Java, the backslash character itself needs to be escaped.
Thus, this regex:
/Date\(\d+\)/
Must turn into this:
/Date\\(\\d+\\)/
One backslash is for escaping the parenthesis or d. The other one is for escaping the backslash itself.
The error message you are getting arises because Java thinks you're trying to use \( as a single escape character, like \n, or any of the other examples. However, \( is not a valid escape sequence, and so Java complains.
In addition, the logic of your code is probably incorrect. The argument to matcher should be the text to search (for example, "/Date(234)/Date(6578)/"), whereas the variable pattern should contain the pattern itself. Try this:
String textToMatch = "/Date(234)/Date(6578)/";
Pattern pattern = pattern.compile("/Date\\(\\d+\\)/");
Matcher matcher = pattern.matcher(textToMatch);
Finally, the regex character class \d means "one single digit." If you are trying to refer to the literal phrase \\d, you would have to use \\\\d to escape this. However, in that case, your regex would be a constant, and you could use textToMatch.indexOf and textToMatch.contains more easily.
To escape regex in java, you can also use Pattern.quote()

String class regular expression difficulty

I want to get the first word of astring containing alphanumeric field
EG.
string can be 'abc123abc' or 'abc-123abc'
i just want the first 'abc'
is there any way to get it without for loop(I want to do this using regex but i don't know much about regular expression)
actually string pattern is like
[A-Za-z]{2,5}[-]{0,1}[0-9]{1,15}[A-Za-z]{0,15}
My aim is to get the first word
Wrap the part of the expression that you would like to capture in a capturing group, and then use group(1) of the matcher to access it:
([A-Za-z]{2,5})-?[0-9]{1,15}[A-Za-z]{0,15}
The first group will capture everything up to the optional dash:
Pattern p = Pattern.compile("([A-Za-z]{2,5})-?[0-9]{1,15}[A-Za-z]{0,15}");
Matcher m = p.matcher("abc123abc");
if (m.find()) {
System.out.println(m.group(1));
}
The above prints abc (link to ideone).
Try as
System.out.println("abc-123abc".split("[-\\d]+")[0]);
output
abc
^[A-Za-z]+
will match ASCII letters at the start of the string. Is that what you need?
You can get the matched text for ^[A-Za-z]{2,5}. This will match all the first letters.
String word = "abc-123abc".replaceFirst("[^a-zA-Z].*$", "");
This removes everything after the first non a-z character. You can also use replace with capturing groups.
String word = "abc-123abc".replaceFirst("^([a-zA-Z]+).*$", "$1");
String.replaceFirst()

Categories