I'm trying to match Strings that contain the word "#SP" (sans quotes, case insensitive) in Java. However, I'm finding using Regexes very difficult!
Strings I need to match:
"This is a sample #sp string",
"#SP string text...",
"String text #Sp"
Strings I do not want to match:
"Anything with #Spider",
"#Spin #Spoon #SPORK"
Here's what I have so far: http://ideone.com/B7hHkR .Could someone guide me through building my regexp?
I've also tried: "\\w*\\s*#sp\\w*\\s*" to no avail.
Edit: Here's the code from IDEone:
java.util.regex.Pattern p =
java.util.regex.Pattern.compile("\\b#SP\\b",
java.util.regex.Pattern.CASE_INSENSITIVE);
java.util.regex.Matcher m = p.matcher("s #SP s");
if (m.find()) {
System.out.println("Match!");
}
(edit: positive lookbehind not needed, only matching is done, not replacement)
You are yet another victim of Java's misnamed regex matching methods.
.matches() quite unfortunately so tries to match the whole input, which is a clear violation of the definition of "regex matching" (a regex can match anywhere in the input). The method you need to use is .find().
This is a braindead API, and unfortunately Java is not the only language having such misguided method names. Python also pleads guilty.
Also, you have the problem that \\b will detect on word boundaries and # is not part of a word. You need to use an alternation detecting either the beginning of input or a space.
Your code would need to look like this (non fully qualified classes):
Pattern p = Pattern.compile("(^|\\s)#SP\\b", Pattern.CASE_INSENSITIVE);
Matcher m = p.matcher("s #SP s");
if (m.find()) {
System.out.println("Match!");
}
You're doing fine, but the \b in front of the # is misleading. \b is a word boundary, but # is already not a word character (i.e. it isn't in the set [0-9A-Za-z_]). Therefore, the space before the # isn't considered a word boundary. Change to:
java.util.regex.Pattern p =
java.util.regex.Pattern.compile("(^|\\s)#SP\\b",
java.util.regex.Pattern.CASE_INSENSITIVE);
The (^|\s) means: match either ^ OR \s, where ^ means the beginning of your string (e.g. "#SP String"), and \s means a whitespace character.
The regular expression "\\w*\\s*#sp\\w*\s*" will match 0 or more words, followed by 0 or more spaces, followed by #sp, followed by 0 or more words, followed by 0 or more spaces. My suggestion is to not use \s* to break words up in your expression, instead, use \b.
"(^|\b)#sp(\b|$)"
Related
I am having some Java Pattern problems. This is my pattern:
"^[\\p{L}\\p{Digit}~._-]+$"
It matches any letter of the US-ASCII, numerals, some special characters, basically anything that wouldn't scramble an URL.
What I would like is to find the first letter in a word that does not match this pattern. Basically the user sends a text as an input and I have to validate it and to throw an exception if I find an illegal character.
I tried negating this pattern, but it wouldn't compile properly. Also find() didn't help out much.
A legal input would be hello while ?hello should not be, and my exception should point out that ? is not proper.
I would prefer a suggestion using Java's Matcher, Pattern or something using util.regex. Its not a necessity, but checking each character in the string individually is not a solution.
Edit: I came up with a better regex to match unreserved URI characters
Try this :
^[\\p{L}\\p{Digit}.'-.'_]*([^\\p{L}\\p{Digit}.'-.'_]).*$
The first character non matching is the group n°1
I made a few try here : http://fiddle.re/gkkzm61
Explanation :
I negate your pattern, so i built this :
[^\\p{L}\\p{Digit}.'-.'_] [^...] means every character except for
^ ^ the following ones.
| your pattern inside |
The pattern has 3 parts :
^[\\p{L}\\p{Digit}.'-.'_]*
Checks the regex from the first character until he meets a non matching character
([^\\p{L}\\p{Digit}.'-.'_])
The non-matching character (negation) inside a capturing group
.*$
Any character until the end of the string.
Hope it helps you
EDIT :
The correct regex shoud be :
^[\\p{L}\\p{Digit}~._-]*([^\\p{L}\\p{Digit}~._-]).*$
It is the same method, i only change the contents of the first and second part.
I tried and it seems to work.
The "^[\\p{L}\\p{Digit}.'-.'_]+$" pattern matches any string containing 1+ characters defined inside the character class. Note that double ' and . are suspicious and you might be unaware of the fact that '-. creates a range and matches '()*+,-.. If it is not on purpose, I think you meant to use .'_-.
To check if a string starts with a character other than the one defined in the character class, you can negated the character class, and check the first character in the string only:
if (str.matches("[^\\p{L}\\p{Digit}.'_-].*")) {
/* String starts with the disallowed character */
}
I also think you can shorten the regex to "(?U)[^\\w.'-].*". At any rate, \\p{Digit} can be replaced with \\d.
Try out this one to find the first non valid char:
Pattern negPattern = Pattern.compile(".*?([^\\p{L}^\\p{Digit}^.^'-.'^_]+).*");
Matcher matcher = negPattern.matcher("hel?lo");
if (matcher.matches())
{
System.out.println("'" + matcher.group(1).charAt(0) + "'");
}
I am currently learning how to write regular expressions in Java by trying to match simple Hashtag pattern. The Hashtags obey the following conditions:
It starts with a hashtag: #
It has to contain at least 1 letter: [a-zA-Z]
It can contain any of the characters from the class [a-zA-Z0-9_]
It cannot be preceded by a character of the class [a-zA-Z0-9_]
Based on this, I thought that the correct regular expression is:
PATTERN = "(?<![a-zA-Z0-9_])#(?=.*[a-zA-Z])[a-zA-Z0-9_]+"
Here I'm using a lookahead (?=.*[a-zA-Z]) to make sure Condition 2 holds and using a lookbehind (?<![a-zA-Z0-9_]) to make sure Condition 4 holds. I'm less certain about ending with a +.
This works on simple test cases but fails on complicated ones such as:
String text = "####THIS_IS_A_HASHTAG; ;#This_1_2...#12_and_this but not #123 or #this# #or#that";
where does not match #THIS_IS_A_HASHTAG, #This_1_2 and 12_and_this
Could someone explain what I'm doing wrong?
This lookahead:
(?=.*[a-zA-Z])
may produce wrong results for the cases when input is like this:
####12345...#12_and_this
by giving you 2 matches #12345 and #12_and_this. Whereas as per your rules only 2nd should be valid match.
To fix this you can use this regex:
(?<![a-zA-Z0-9_])#(?=[0-9_]*[a-zA-Z])[a-zA-Z0-9_]+
Where lookahead (?=[0-9_]*[a-zA-Z]) means assert presence of a letter after # with optional presence of a digit or underscore in between.
Here is a regex demo for you
How about this?
(example here)
String text = "####THIS_IS_A_HASHTAG;;;#This_1_2...#12_and_this ";
String regex = "#[A-Za-z0-9_]+";
Matcher m = Pattern.compile(regex).matcher(text);
while (m.find()) {
System.out.println(m.group());
}
It looks like it meets your criteria as stated:
#THIS_IS_A_HASHTAG
#This_1_2
#12_and_this
I have a series of strings that I am searching for a particular combination of characters in. I am looking for a digit, following by the letter m or M, followed by a digit, then followed by the letter f or F.
An example string is - "Class (4) 1m5f Good" - The text in bold is what I want to extract from the string.
Here is the code I have, that doesn't work.
Pattern distancePattern = Pattern.compile("\\^[0-9]{1}[m|M]{1}[0-9]{1}[f|F]{1}$\\");
Matcher distanceMatcher = distancePattern.matcher(raceDetails.toString());
while (distanceMatcher.find()) {
String word= distanceMatcher.group(0);
System.out.println(word);
}
Can anyone suggest what I am doing wrong?
The ^ and $ characters at the start and end of your regex are anchors - they're limiting you to strings that only consist of the pattern you're looking for. The first step is to remove those.
You can then either use word boundaries (\b) to limit the pattern you're looking for to be an entire word, like this:
Pattern distancePattern = Pattern.compile("\\b\\d[mM]\\d[fF]\\b");
...or, if you don't mind your pattern appearing in the middle of a word, e.g., "Class (4) a1m5f Good", you can drop the word boundaries:
Pattern distancePattern = Pattern.compile("\\d[mM]\\d[fF]");
Quick notes:
You don't really need the {1}s everywhere - the default assumption
is that a character or character class is happening once.
You can
replace the [0-9] character class with \d (it means the same
thing).
Both links are to regular-expressions.info, a great resource for learning about regexes that I highly recommend you check out :)
I'd use word boundaries \b:
\b\d[mM]\d[fF]\b
for java, backslashes are to be escaped:
\\b\\d[mM]\\d[fF]\\b
{1} is superfluous
[m|M] means mor | or M
For the requirement of a digit, following by the letter m or M, followed by a digit, then followed by the letter f or F regex can be simplified to:
Pattern distancePattern = Pattern.compile("(?i)\\dm\\df");
Where:
(?i) - For ignore case
\\d - For digits [0-9]
I want to remove all the leading and trailing punctuation in a string. How can I do this?
Basically, I want to preserve punctuation in between words, and I need to remove all leading and trailing punctuation.
., #, _, &, /, - are allowed if surrounded by letters
or digits
\' is allowed if preceded by a letter or digit
I tried
Pattern p = Pattern.compile("(^\\p{Punct})|(\\p{Punct}$)");
Matcher m = p.matcher(term);
boolean a = m.find();
if(a)
term=term.replaceAll("(^\\p{Punct})", "");
but it didn't work!!
Ok. So basically you want to find some pattern in your string and act if the pattern in matched.
Doing this the naiive way would be tedious. The naiive solution could involve something like
while(myString.StartsWith("." || "," || ";" || ...)
myString = myString.Substring(1);
If you wanted to do a bit more complex task, it could be even impossible to do the way i mentioned.
Thats why we use regular expressions. Its a "language" with which you can define a pattern. the computer will be able to say, if a string matches that pattern. To learn about regular expressions, just type it into google. One of the first links: http://www.codeproject.com/Articles/9099/The-30-Minute-Regex-Tutorial
As for your problem, you could try this:
myString.replaceFirst("^[^a-zA-Z]+", "")
The meaning of the regex:
the first ^ means that in this pattern, what comes next has to be at
the start of the string.
The [] define the chars. In this case, those are things that are NOT
(the second ^) letters (a-zA-Z).
The + sign means that the thing before it can be repeated and still
match the regex.
You can use a similar regex to remove trailing chars.
myString.replaceAll("[^a-zA-Z]+$", "");
the $ means "at the end of the string"
You could use a regular expression:
private static final Pattern PATTERN =
Pattern.compile("^\\p{Punct}*(.*?)\\p{Punct}*$");
public static String trimPunctuation(String s) {
Matcher m = PATTERN.matcher(s);
m.find();
return m.group(1);
}
The boundary matchers ^ and $ ensure the whole input is matched.
A dot . matches any single character.
A star * means "match the preceding thing zero or more times".
The parentheses () define a capturing group whose value is retrieved by calling Matcher.group(1).
The ? in (.*?) means you want the match to be non-greedy, otherwise the trailing punctuation would be included in the group.
Use this tutorial on patterns. You have to create a regex that matches string starting with alphabet or number and ending with alphabet or number and do inputString.matches("regex")
How can I create a regular expression to search strings with a given pattern? For example I want to search all strings that match pattern '*index.tx?'. Now this should find strings with values index.txt,mainindex.txt and somethingindex.txp.
Pattern pattern = Pattern.compile("*.html");
Matcher m = pattern.matcher("input.html");
This code is obviously not working.
You need to learn regular expression syntax. It is not the same as using wildcards. Try this:
Pattern pattern = Pattern.compile("^.*index\\.tx.$");
There is a lot of information about regular expressions here. You may find the program RegexBuddy useful while you are learning regular expressions.
The code you posted does not work because:
dot . is a special regex character. It means one instance of any character.
* means any number of occurrences of the preceding character.
therefore, .* means any number of occurrences of any character.
so you would need something like
Pattern pattern = Pattern.compile(".*\\.html.*");
the reason for the \\ is because we want to insert dot, although it is a special regex sign.
this means: match a string in which at first there are any number of wild characters, followed by a dot, followed by html, followed by anything.
* matches zero or more occurrences of the preceding token, so if you want to match zero or more of any character, use .* instead (. matches any char).
Modified regex should look something like this:
Pattern pattern = Pattern.compile("^.*\\.html$");
^ matches the start of the string
.* matches zero or more of any char
\\. matches the dot char (if not escaped it would match any char)
$ matches the end of the string