Find string with special char using regex - java

I need to scroll a List and removing all strings that contains some special char. Using RegEx I'm able to remove all string that start with these special chars but, how can I find if this special char is in the middle of the string?
For instance:
Pattern.matches("[()<>/;\\*%$].*", "(123)")
returns true and I can remove this string
but it doesn't works with this kind of string: 12(3).
Is it correct to use \* to find the occurrence of "*" char into the string?
Thanks for the help!
Andrea

You are yet another victim of Java's ill-named .matches() which tries and match the whole input and contradicts the very definition of regex matching.
What you want is matching one character among ()<>/;\\*%$. With Java, you need to create a Pattern, a Matcher from this Pattern and use .find() on this matcher:
final Pattern p = pattern.compile("[()<>/;\\*%$]");
final Matcher m = p.matcher(yourinput);
if (m.find()) // match, proceed

Try the following:
!Pattern.matches("^[^()<>/;\\*%$]*$", "(123)")
This uses a negated character class to ensure that all the characters in the string are not any of the characters in the class.
You then obviously negate the expression since you are testing for a string that does not match.
Is it correct to use \* to find the occurrence of "*" char into the string?
Yes.

Pattern.matches() tries to match the whole input. So since your regex says that the input has to start with a "special" char, 12(3) doesn't match.

Related

Regex first character not matching

I am having some Java Pattern problems. This is my pattern:
"^[\\p{L}\\p{Digit}~._-]+$"
It matches any letter of the US-ASCII, numerals, some special characters, basically anything that wouldn't scramble an URL.
What I would like is to find the first letter in a word that does not match this pattern. Basically the user sends a text as an input and I have to validate it and to throw an exception if I find an illegal character.
I tried negating this pattern, but it wouldn't compile properly. Also find() didn't help out much.
A legal input would be hello while ?hello should not be, and my exception should point out that ? is not proper.
I would prefer a suggestion using Java's Matcher, Pattern or something using util.regex. Its not a necessity, but checking each character in the string individually is not a solution.
Edit: I came up with a better regex to match unreserved URI characters
Try this :
^[\\p{L}\\p{Digit}.'-.'_]*([^\\p{L}\\p{Digit}.'-.'_]).*$
The first character non matching is the group n°1
I made a few try here : http://fiddle.re/gkkzm61
Explanation :
I negate your pattern, so i built this :
[^\\p{L}\\p{Digit}.'-.'_] [^...] means every character except for
^ ^ the following ones.
| your pattern inside |
The pattern has 3 parts :
^[\\p{L}\\p{Digit}.'-.'_]*
Checks the regex from the first character until he meets a non matching character
([^\\p{L}\\p{Digit}.'-.'_])
The non-matching character (negation) inside a capturing group
.*$
Any character until the end of the string.
Hope it helps you
EDIT :
The correct regex shoud be :
^[\\p{L}\\p{Digit}~._-]*([^\\p{L}\\p{Digit}~._-]).*$
It is the same method, i only change the contents of the first and second part.
I tried and it seems to work.
The "^[\\p{L}\\p{Digit}.'-.'_]+$" pattern matches any string containing 1+ characters defined inside the character class. Note that double ' and . are suspicious and you might be unaware of the fact that '-. creates a range and matches '()*+,-.. If it is not on purpose, I think you meant to use .'_-.
To check if a string starts with a character other than the one defined in the character class, you can negated the character class, and check the first character in the string only:
if (str.matches("[^\\p{L}\\p{Digit}.'_-].*")) {
/* String starts with the disallowed character */
}
I also think you can shorten the regex to "(?U)[^\\w.'-].*". At any rate, \\p{Digit} can be replaced with \\d.
Try out this one to find the first non valid char:
Pattern negPattern = Pattern.compile(".*?([^\\p{L}^\\p{Digit}^.^'-.'^_]+).*");
Matcher matcher = negPattern.matcher("hel?lo");
if (matcher.matches())
{
System.out.println("'" + matcher.group(1).charAt(0) + "'");
}

How to find the exact word using a regex in Java?

Consider the following code snippet:
String input = "Print this";
System.out.println(input.matches("\\bthis\\b"));
Output
false
What could be possibly wrong with this approach? If it is wrong, then what is the right solution to find the exact word match?
PS: I have found a variety of similar questions here but none of them provide the solution I am looking for.
Thanks in advance.
When you use the matches() method, it is trying to match the entire input. In your example, the input "Print this" doesn't match the pattern because the word "Print" isn't matched.
So you need to add something to the regex to match the initial part of the string, e.g.
.*\\bthis\\b
And if you want to allow extra text at the end of the line too:
.*\\bthis\\b.*
Alternatively, use a Matcher object and use Matcher.find() to find matches within the input string:
Pattern p = Pattern.compile("\\bthis\\b");
Matcher m = p.matcher("Print this");
m.find();
System.out.println(m.group());
Output:
this
If you want to find multiple matches in a line, you can call find() and group() repeatedly to extract them all.
Full example method for matcher:
public static String REGEX_FIND_WORD="(?i).*?\\b%s\\b.*?";
public static boolean containsWord(String text, String word) {
String regex=String.format(REGEX_FIND_WORD, Pattern.quote(word));
return text.matches(regex);
}
Explain:
(?i) - ignorecase
.*? - allow (optionally) any characters before
\b - word boundary
%s - variable to be changed by String.format (quoted to avoid regex
errors)
\b - word boundary
.*? - allow (optionally) any characters after
For a good explanation, see: http://www.regular-expressions.info/java.html
myString.matches("regex") returns true or false depending whether the
string can be matched entirely by the regular expression. It is
important to remember that String.matches() only returns true if the
entire string can be matched. In other words: "regex" is applied as if
you had written "^regex$" with start and end of string anchors. This
is different from most other regex libraries, where the "quick match
test" method returns true if the regex can be matched anywhere in the
string. If myString is abc then myString.matches("bc") returns false.
bc matches abc, but ^bc$ (which is really being used here) does not.
This writes "true":
String input = "Print this";
System.out.println(input.matches(".*\\bthis\\b"));
You may use groups to find the exact word. Regex API specifies groups by parentheses. For example:
A(B(C))D
This statement consists of three groups, which are indexed from 0.
0th group - ABCD
1st group - BC
2nd group - C
So if you need to find some specific word, you may use two methods in Matcher class such as: find() to find statement specified by regex, and then get a String object specified by its group number:
String statement = "Hello, my beautiful world";
Pattern pattern = Pattern.compile("Hello, my (\\w+).*");
Matcher m = pattern.matcher(statement);
m.find();
System.out.println(m.group(1));
The above code result will be "beautiful"
Is your searchString going to be regular expression? if not simply use String.contains(CharSequence s)
System.out.println(input.matches(".*\\bthis$"));
Also works. Here the .* matches anything before the space and then this is matched to be word in the end.

regex to find substring between special characters

I am running into this problem in Java.
I have data strings that contain entities enclosed between & and ; For e.g.
&Text.ABC;, &Links.InsertSomething;
These entities can be anything from the ini file we have.
I need to find these string in the input string and remove them. There can be none, one or more occurrences of these entities in the input string.
I am trying to use regex to pattern match and failing.
Can anyone suggest the regex for this problem?
Thanks!
Here is the regex:
"&[A-Za-z]+(\\.[A-Za-z]+)*;"
It starts by matching the character &, followed by one or more letters (both uppercase and lower case) ([A-Za-z]+). Then it matches a dot followed by one or more letters (\\.[A-Za-z]+). There can be any number of this, including zero. Finally, it matches the ; character.
You can use this regex in java like this:
Pattern p = Pattern.compile("&[A-Za-z]+(\\.[A-Za-z]+)*;"); // java.util.regex.Pattern
String subject = "foo &Bar; baz\n";
String result = p.matcher(subject).replaceAll("");
Or just
"foo &Bar; baz\n".replaceAll("&[A-Za-z]+(\\.[A-Za-z]+)*;", "");
If you want to remove whitespaces after the matched tokens, you can use this re:
"&[A-Za-z]+(\\.[A-Za-z]+)*;\\s*" // the "\\s*" matches any number of whitespace
And there is a nice online regular expression tester which uses the java regexp library.
http://www.regexplanet.com/simple/index.html
You can try:
input=input.replaceAll("&[^.]+\\.[^;]+;(,\\s*&[^.]+\\.[^;]+;)*","");
See it

regular expressions using java.util.regex API- java

How can I create a regular expression to search strings with a given pattern? For example I want to search all strings that match pattern '*index.tx?'. Now this should find strings with values index.txt,mainindex.txt and somethingindex.txp.
Pattern pattern = Pattern.compile("*.html");
Matcher m = pattern.matcher("input.html");
This code is obviously not working.
You need to learn regular expression syntax. It is not the same as using wildcards. Try this:
Pattern pattern = Pattern.compile("^.*index\\.tx.$");
There is a lot of information about regular expressions here. You may find the program RegexBuddy useful while you are learning regular expressions.
The code you posted does not work because:
dot . is a special regex character. It means one instance of any character.
* means any number of occurrences of the preceding character.
therefore, .* means any number of occurrences of any character.
so you would need something like
Pattern pattern = Pattern.compile(".*\\.html.*");
the reason for the \\ is because we want to insert dot, although it is a special regex sign.
this means: match a string in which at first there are any number of wild characters, followed by a dot, followed by html, followed by anything.
* matches zero or more occurrences of the preceding token, so if you want to match zero or more of any character, use .* instead (. matches any char).
Modified regex should look something like this:
Pattern pattern = Pattern.compile("^.*\\.html$");
^ matches the start of the string
.* matches zero or more of any char
\\. matches the dot char (if not escaped it would match any char)
$ matches the end of the string

Problem replacing words using [^a-zA-Z] regex

Just could not get this one and googling did not help much either..
First something that I know: Given a string and a regex, how to replace all the occurrences of strings that matches this regular expression by a replacement string ? Use the replaceAll() method in the String class.
Now something that I am unable to do. The regex I have in my code now is [^a-zA-Z] and I know for sure that this regex is definitely going to have a range. Only some more characters might be added to the list. What I need as output in the code below is Worksheet+blah but what I get using replaceAll() is Worksheet++++blah
String homeworkTitle = "Worksheet%#5_blah";
String unwantedCharactersRegex = "[^a-zA-Z]";
String replacementString = "+";
homeworkTitle = homeworkTitle.replaceAll(unwantedCharactersRegex,replacementString);
System.out.println(homeworkTitle);
What is the way to achieve the output that I wish for? Are there any Java methods that I am missing here?
[^a-zA-Z]+
Will do it nicely.
You just need a greedy quantifier in order to match as many non-alphabetical characters you can, and replace the all match by one '+' (a - by default - greedy quantifier)
Note: [^a-zA-Z]+? would make the '+' quantifier lazy, and would have give you the same result than [^a-zA-Z], since it would only have matched only one non-alphabetical character at a time.
String unwantedCharactersRegex = "[^a-zA-Z]"
This matches a single non-letter. So each single non-letter is replaced by a +. You need to say "one or more", so try
String unwantedCharactersRegex = "[^a-zA-Z]+"

Categories