Unescaped "." still matches when used in a negation group - java

I made, what I believed to be, an error in a regular expression in Java recently but when I test my code I don't get the error I expect.
The expression I created was meant to replace a password in a string that I received from another source. The pattern I used went along the lines of: "password: [^\\s.]*", the idea being that it would match the word "password" the colon, a space, then any characters except for a space or a full-stop (period). I would then replace the instance with "password: XXXXXX" and therefore mask it.
The obvious error should be that I have forgotten to escape the full-stop. In otherwords the proper expression should have been "password: [^\\s\\.]*". Thing is, if I don't escape the full-stop the code still works!
Here's some sample code:
import java.util.regex.*;
public class SimpleRegexTest {
public static void main(String[] args) {
Pattern simplePattern = Pattern.compile("password: [^\\s.]*");
Matcher simpleMatcher = simplePattern.matcher("password: newpass. Enjoy.");
String maskedString = simpleMatcher.replaceAll("password: XXXXXX");
System.out.println(maskedString);
}
}
When I run the above code I get the following output:
password: XXXXXX. Enjoy.
Is this a special case, or have I completely missed something?
(edit: changed to "escape the full-stop")
Michael Borgwardt: I couldn't think of another term to describe what I was doing apart from "negation group", sorry for the ambiguity.
Aviator: In this case, no, a space won't be in the password. I didn't make the rules ;-).
(edit: doubled up the slashes in the non-code text so it displays properly, added the ^ which was in the code, but not the text :-/)
Sundar: Fixed the double slashes, SO seems to have it's own escape characters.

A period ('.' character) does not need to be escaped inside a character class [] in a regular expression.
From the API:
Note that a different set of metacharacters are in effect inside a character class than outside a character class. For instance, the regular expression . loses its special meaning inside a character class, while the expression - becomes a range forming metacharacter.

It looks like you got the negation operator mixed up for regex ranges.
In particular, my understanding is that you used the snippet [\s.]* to mean "any characters except for a space or a full-stop (period)." This would in fact be expressed as [^ .]*, using the caret to negate the characters in the set.
I don't know if this was just a typo in your post or what was actually in your code, but the regex as it stands in your question will match the word "password", a colon, a space, then any sequence of backslash characters, "s" characters or periods.

Related

Using regex to only match those Strings which use escape character correctly (according to Java syntax)?

take these strings for example:
"hello world\n" (correct - regex should match this)
"I'm happy \ here" (this is incorrect as the escape character is not
used correctly - regex should not match this one)
I've tried searching on google but didn't find anything helpful.
I want this one to be used in a parser which only parses string literals from a java code file.
Here is the the regex I used:
"\\\"(\\[tbnrf\'\"\\])*[a-zA-Z0-9\\`\\~\\!\\#\\#\\$\\%\\^\\&\\*\\(\\)\\_\\-\\+\\=\\|\\{\\[\\}\\]\\;\\:\\'\\/\\?\\>\\.\\<\\,]\\\""
what am I doing wrong?
I guess you gave us the regex in Java String literal form, like
String regex = \"(\[tbnrf'"\])*[a-zA-Z0-9\`\~\!\#\#\$\%\^\&\*\(\)\_\-\+\=\|\{\[\}\]\;\:\'\/\?\>\.\<\,]\";
Unpacking that from Java's String escaping syntax gives the raw regex:
\"(\[tbnrf'"\])*[a-zA-Z0-9\`\~\!\#\#\$\%\^\&\*\(\)\_\-\+\=\|\{\[\}\]\;\:\'\/\?\>\.\<\,]\"
That consists of:
\" matching a double-quote character (Java String literal begins here). Escaping the double quotes with backslash isn't necessary: " on its own is ok as well.
(\[tbnrf'"\])*: a group, repeated 0...n times. I guess you want that to match against the various Java backslash escapes, but that should read (\\[tbnrf'"\\])* with a double backslash in front and inside the character class. And maybe you want to cover the Java octal escapes as well (see the language specification), giving (\\[tbnrf01234567'"\\])*
[a-zA-Z0-9\``\~\!\#\#\$\%\^\&\*\(\)\_\-\+\=\|\{\[\}\]\;\:\'\/\?\>\.\<\,]: a character class matching one character from a selected list of alphabetic and punctuation characters. I'd replace that with [^"\\], meaning anything but double quote or backslash.
\" matching a double-quote character (string literal ends here). Once again, no need to escape the double quote.
Besides the individual elements, the overall structure of the regex probably isn't what you want: You allow only strings beginning with any number of backslash escapes, followed by exactly one non-escape character, and this enclosed in a pair of double quotes.
The overall structure should instead be "(backslash_escape|simple_character)*"
So, the complete regex would be:
"(\\[tbnrf01234567'"\\]|[^"\\])*"
or, expressed in a Java literal:
String regex = "\"(\\\\[tbnrf01234567'\"\\\\]|[^\"\\\\])*\"";
And, although this is shorter than your original attempt, I'd still not call it readable and opt for a different implementation, not using regular expressions.
P.S. Although I did some testing with my regex, I'm not at all sure that it covers all relevant cases correctly.
P.P.S. There are the \uxxxx escapes, not yet covered by the regex.

String pattern that prevents the first character being a whitespace

I'm having a bit of difficulty figuring out a pattern that will allow anything to be entered, as long as the first character isn't a whitespace.
I've tried
String pattern = "[^\\s][a-zA-Z0-9\\W ]+";"
and "([a-zA-Z0-9\\W]+)|(([a-zA-Z0-9\\W]+\\s[a-zA-Z0-9\\W]+)+)" as well as several other variants, with no success. Any help would be greatly appreciated.
I'm using Java btw
Does this work
^[^\s].*
The first caret denotes start of line, and the second negation.
Most regular expression matching defaults to searching anywhere in the string for the pattern. Since you are concerned specifically with the beginning of the string, you should prefix the entire regex with '^' to anchor the match to the beginning of the input.
String pattern = "^[^\\s][a-zA-Z0-9\\W ]+";
It can be a bit confusing since ^ has a very different meaning when it appears inside square brackets. Inside the brackets, as you know, it signals matching the complement of (ie all characters except) the set of characters listed in the brackets. Outside, it is simply an anchor for the beginning of the string.
In this non-bracketed use, it is the opposite of $ which anchors a match at the end of a string, eg /end$/ will match "friend" but not "ending" - you can read more about anchors at this URL: http://www.regular-expressions.info/anchors.html
Since you don't care about the rest, you can just use String.charAt(int) with Character.isSpaceChar(char), or String.codePointAt(int) with Character.isSpaceChar(int).
The second method is the correct way to deal with Unicode string and code point in astral plane, while the first method is broken, but usable when your input only has character from the Basic Multilingual Plane (BMP).
Code for the second method:
boolean startWithSpace = Character.isSpaceChar(input.codePointAt(0));
Character.isSpaceChar checks for any whitespace character according to Unicode. Not to be confused with Character.isWhitespace, which checks for whitespace character according to Java.

Validate string has no illegal characters

Im trying to validate a string that only allows letters, numbers and these characters :
!"#$%&'()*+,-./:;<=>?#[\]^_`{|}~
I tried doing this but its not working and allowing me to enter characters not in the regex. Im still pretty new to java and something similar was working in javascript but I cant figure out whats going on here. I think its running as if it cant find any of the characters mentioned then it will return four.
Pattern allowedCharacters = Pattern.compile("[A-Za-z0-9!\"#$%&'()*+,.\\/:;<=>?#[\\]^_`{|}~-]+$]");
if (!allowedCharacters.matcher(pw).find()){
return 4;
}
Any help is appreciated. Thanks
EDIT:
I also tried:
if (pw.matches("^[A-Za-z0-9!\"#$%&'()*+,.\\/:;<=>?#[\\]^_`{|}~-]+$]")){
return 4;
}
and
if (!pw.matches("[A-Za-z0-9!\"#$%&'()*+,.\\/:;<=>?#[\\]^_`{|}~-]+$]")){
return 4;
}
matcher.find() checks if string contains substring that matches regex, so with
!matcher.find() you are checking if there is no match of regex in tested string.
Consider using using matcher.matches() to check if entire string is matched by regex. In this case you will have to add quantifiers like *, + or {n,m} to character class to decide about passwords length. Otherwise it will only single character passwords.
Here is demo of how your code can look like
// here you place quantifier
// ↓
if (pw.matches("[A-Za-z0-9!\"#$%&'()*+,.\\/:;<=>?#[\\]^_`{|}~-]+$]+")){
System.out.println("password contains only valid characters");
} else {
System.out.println("invalid characters in password");
}
Update:
in your regex you are not escaping [ which makes [\]^_`{|}~-] separate character class which will be added to outer character class. This character class will not include \ or [. If you are really interested in accepting only alphanumeric characters and !"#$%&'()*+,-./:;<=>?#[]^_`{|}~ then consider using
"[\\w\\Q!\"#$%&'()*+,-./:;<=>?#[\\]^_`{|}~\\E]+"
as regex.
\\w represents [a-zA-Z0-9_]
and \Q and \E is quote, which is mechanism to escape metacharacters, even in character class.
It's because you're using find() and not matches(). That said, I'd try the opposite, doing find on [^<legal chars>] (note the caret) to match an illegal characters. It's faster because it'll fail as soon as it hits something illegal. Also, start with the simple legal characters, then move up from there. Regular expressions can get hard to read, and adding one char at a time that has special meaning is easier than adding them all at once.
Using other answers from this question, I found this to work for me. Nothing needs to be escaped between the \Q and \E. They do that for you.
Pattern whitelist = Pattern.compile("^[\\w\\s\\Q!\"#$%&'()*+,-.\\/:;<=>?#[]^_`{|}~\\E]+$");
if (!whitelist.matcher(pw).matches()) {
// error
}

Java - Unknown characters passing as [a-zA-z0-9]*?

I'm no expert in regex but I need to parse some input I have no control over, and make sure I filter away any strings that don't have A-z and/or 0-9.
When I run this,
Pattern p = Pattern.compile("^[a-zA-Z0-9]*$"); //fixed typo
if(!p.matcher(gottenData).matches())
System.out.println(someData); //someData contains gottenData
certain spaces + an unknown symbol somehow slip through the filter (gottenData is the red rectangle):
In case you're wondering, it DOES also display Text, it's not all like that.
For now, I don't mind the [?] as long as it also contains some string along with it.
Please help.
[EDIT] as far as I can tell from the (very large) input, the [?]'s are either white spaces either nothing at all; maybe there's some sort of encoding issue, also perhaps something to do with #text nodes (input is xml)
The * quantifier matches "zero or more", which means it will match a string that does not contain any of the characters in your class. Try the + quantifier, which means "One or more": ^[a-zA-Z0-9]+$ will match strings made up of alphanumeric characters only. ^.*[a-zA-Z0-9]+.*$ will match any string containing one or more alphanumeric characters, although the leading .* will make it much slower. If you use Matcher.lookingAt() instead of Matcher.matches, it will not require a full string match and you can use the regex [a-zA-Z0-9]+.
You have an error in your regex: instead of [a-zA-z0-9]* it should be [a-zA-Z0-9]*.
You don't need ^ and $ around the regex.
Matcher.matches() always matches the complete string.
String gottenData = "a ";
Pattern p = Pattern.compile("[a-zA-z0-9]*");
if (!p.matcher(gottenData).matches())
System.out.println("doesn't match.");
this prints "doesn't match."
The correct answer is a combination of the above answers. First I imagine your intended character match is [a-zA-Z0-9]. Note that A-z isn't as bad as you might think it include all characters in the ASCII range between A and z, which is the letters plus a few extra (specifically [,\,],^,_,`).
A second potential problem as Martin mentioned is you may need to put in the start and end qualifiers, if you want the string to only consists of letters and numbers.
Finally you use the * operator which means 0 or more, therefore you can match 0 characters and matches will return true, so effectively your pattern will match any input. What you need is the + quantifier. So I will submit the pattern you are most likely looking for is:
^[a-zA-Z0-9]+$
You have to change the regexp to "^[a-zA-Z0-9]*$" to ensure that you are matching the entire string
Looks like it should be "a-zA-Z0-9", not "a-zA-z0-9", try correcting that...
Did anyone consider adding space to the regex [a-zA-Z0-9 ]*. this should match any normal text with chars, number and spaces. If you want quotes and other special chars add them to the regex too.
You can quickly test your regex at http://www.regexplanet.com/simple/
You can check input value is contained string and numbers? by using regex ^[a-zA-Z0-9]*$
if your value just contained numberString than its show match i.e, riz99, riz99z
else it will show not match i.e, 99z., riz99.z, riz99.9
Example code:
if(e.target.value.match('^[a-zA-Z0-9]*$')){
console.log('match')
}
else{
console.log('not match')
}
}
online working example

How to replace a special character with single slash

I have a question about strings in Java. Let's say, I have a string like so:
String str = "The . startup trace ?state is info?";
As the string contains the special character like "?" I need the string to be replaced with "\?" as per my requirement. How do I replace special characters with "\"? I tried the following way.
str.replace("?","\?");
But it gives a compilation error. Then I tried the following:
str.replace("?","\\?");
When I do this it replaces the special characters with "\\". But when I print the string, it prints with single slash. I thought it is taking single slash only but when I debugged I found that the variable is taking "\\".
Can anyone suggest how to replace the special characters with single slash ("\")?
On escape sequences
A declaration like:
String s = "\\";
defines a string containing a single backslash. That is, s.length() == 1.
This is because \ is a Java escape character for String and char literals. Here are some other examples:
"\n" is a String of length 1 containing the newline character
"\t" is a String of length 1 containing the tab character
"\"" is a String of length 1 containing the double quote character
"\/" contains an invalid escape sequence, and therefore is not a valid String literal
it causes compilation error
Naturally you can combine escape sequences with normal unescaped characters in a String literal:
System.out.println("\"Hey\\\nHow\tare you?");
The above prints (tab spacing may vary):
"Hey\
How are you?
References
JLS 3.10.6 Escape Sequences for Character and String Literals
See also
Is the char literal '\"' the same as '"' ?(backslash-doublequote vs only-doublequote)
Back to the problem
Your problem definition is very vague, but the following snippet works as it should:
System.out.println("How are you? Really??? Awesome!".replace("?", "\\?"));
The above snippet replaces ? with \?, and thus prints:
How are you\? Really\?\?\? Awesome!
If instead you want to replace a char with another char, then there's also an overload for that:
System.out.println("How are you? Really??? Awesome!".replace('?', '\\'));
The above snippet replaces ? with \, and thus prints:
How are you\ Really\\\ Awesome!
String API links
replace(CharSequence target, CharSequence replacement)
Replaces each substring of this string that matches the literal target sequence with the specified literal replacement sequence.
replace(char oldChar, char newChar)
Returns a new string resulting from replacing all occurrences of oldChar in this string with newChar.
On how regex complicates things
If you're using replaceAll or any other regex-based methods, then things becomes somewhat more complicated. It can be greatly simplified if you understand some basic rules.
Regex patterns in Java is given as String values
Metacharacters (such as ? and .) have special meanings, and may need to be escaped by preceding with a backslash to be matched literally
The backslash is also a special character in replacement String values
The above factors can lead to the need for numerous backslashes in patterns and replacement strings in a Java source code.
It doesn't look like you need regex for this problem, but here's a simple example to show what it can do:
System.out.println(
"Who you gonna call? GHOSTBUSTERS!!!"
.replaceAll("[?!]+", "<$0>")
);
The above prints:
Who you gonna call<?> GHOSTBUSTERS<!!!>
The pattern [?!]+ matches one-or-more (+) of any characters in the character class [...] definition (which contains a ? and ! in this case). The replacement string <$0> essentially puts the entire match $0 within angled brackets.
Related questions
Having trouble with Splitting text. - discusses common mistakes like split(".") and split("|")
Regular expressions references
regular-expressions.info
Character class and Repetition with Star and Plus
java.util.regex.Pattern and Matcher
In case you want to replace ? with \?, there are 2 possibilities: replace and replaceAll (for regular expressions):
str.replace("?", "\\?")
str.replaceAll("\\?","\\\\?");
The result is "The . startup trace \?state is info\?"
If you want to replace ? with \, just remove the ? character from the second argument.
But when I print the string, it prints
with single slash.
Good. That's exactly what you want, isn't it?
There are two simple rules:
A backslash inside a String literal has to be specified as two to satisfy the compiler, i.e. "\". Otherwise it is taken as a special-character escape.
A backslash in a regular expresion has to be specified as two to satisfy regex, otherwise it is taken as a regex escape. Because of (1) this means you have to write 2x2=4 of them:"\\\\" (and because of the forum software I actually had to write 8!).
String str="\\";
str=str.replace(str,"\\\\");
System.out.println("New String="+str);
Out put:- New String=\
In java "\\" treat as "\". So, the above code replace a "\" single slash into "\\".

Categories