Java - Unknown characters passing as [a-zA-z0-9]*?

Java - Unknown characters passing as [a-zA-z0-9]*? - java

I'm no expert in regex but I need to parse some input I have no control over, and make sure I filter away any strings that don't have A-z and/or 0-9.
When I run this,
Pattern p = Pattern.compile("^[a-zA-Z0-9]*$"); //fixed typo
if(!p.matcher(gottenData).matches())
System.out.println(someData); //someData contains gottenData
certain spaces + an unknown symbol somehow slip through the filter (gottenData is the red rectangle):
In case you're wondering, it DOES also display Text, it's not all like that.
For now, I don't mind the [?] as long as it also contains some string along with it.
Please help.
[EDIT] as far as I can tell from the (very large) input, the [?]'s are either white spaces either nothing at all; maybe there's some sort of encoding issue, also perhaps something to do with #text nodes (input is xml)

The * quantifier matches "zero or more", which means it will match a string that does not contain any of the characters in your class. Try the + quantifier, which means "One or more": ^[a-zA-Z0-9]+$ will match strings made up of alphanumeric characters only. ^.*[a-zA-Z0-9]+.*$ will match any string containing one or more alphanumeric characters, although the leading .* will make it much slower. If you use Matcher.lookingAt() instead of Matcher.matches, it will not require a full string match and you can use the regex [a-zA-Z0-9]+.

You have an error in your regex: instead of [a-zA-z0-9]* it should be [a-zA-Z0-9]*.
You don't need ^ and $ around the regex.
Matcher.matches() always matches the complete string.
String gottenData = "a ";
Pattern p = Pattern.compile("[a-zA-z0-9]*");
if (!p.matcher(gottenData).matches())
System.out.println("doesn't match.");
this prints "doesn't match."

The correct answer is a combination of the above answers. First I imagine your intended character match is [a-zA-Z0-9]. Note that A-z isn't as bad as you might think it include all characters in the ASCII range between A and z, which is the letters plus a few extra (specifically [,\,],^,_,`).
A second potential problem as Martin mentioned is you may need to put in the start and end qualifiers, if you want the string to only consists of letters and numbers.
Finally you use the * operator which means 0 or more, therefore you can match 0 characters and matches will return true, so effectively your pattern will match any input. What you need is the + quantifier. So I will submit the pattern you are most likely looking for is:
^[a-zA-Z0-9]+$

You have to change the regexp to "^[a-zA-Z0-9]*$" to ensure that you are matching the entire string

Looks like it should be "a-zA-Z0-9", not "a-zA-z0-9", try correcting that...

Did anyone consider adding space to the regex [a-zA-Z0-9 ]*. this should match any normal text with chars, number and spaces. If you want quotes and other special chars add them to the regex too.
You can quickly test your regex at http://www.regexplanet.com/simple/

You can check input value is contained string and numbers? by using regex ^[a-zA-Z0-9]*$
if your value just contained numberString than its show match i.e, riz99, riz99z
else it will show not match i.e, 99z., riz99.z, riz99.9
Example code:
if(e.target.value.match('^[a-zA-Z0-9]*$')){
console.log('match')
}
else{
console.log('not match')
}
}
online working example

Related

Java Regex with "Joker" characters

I try to have a regex validating an input field.
What i call "joker" chars are '?' and '*'.
Here is my java regex :
"^$|[^\\*\\s]{2,}|[^\\*\\s]{2,}[\\*\\?]|[^\\*\\s]{2,}[\\?]{1,}[^\\s\\*]*[\\*]{0,1}"
What I'm tying to match is :
Minimum 2 alpha-numeric characters (other than '?' and '*')
The '*' can only appears one time and at the end of the string
The '?' can appears multiple time
No WhiteSpace at all
So for example :
abcd = OK
?bcd = OK
ab?? = OK
ab*= OK
ab?* = OK
??cd = OK
*ab = NOT OK
??? = NOT OK
ab cd = NOT OK
abcd = Not OK (space at the begining)
I've made the regex a bit complicated and I'm lost can you help me?

^(?:\?*[a-zA-Z\d]\?*){2,}\*?$
Explanation:
The regex asserts that this pattern must appear twice or more:
\?*[a-zA-Z\d]\?*
which asserts that there must be one character in the class [a-zA-Z\d] with 0 to infinity questions marks on the left or right of it.
Then, the regex matches \*?, which means an 0 or 1 asterisk character, at the end of the string.
Demo
Here is an alternative regex that is faster, as revo suggested in the comments:
^(?:\?*[a-zA-Z\d]){2}[a-zA-Z\d?]*\*?$
Demo

Here you go:
^\?*\w{2,}\?*\*?(?<!\s)$
Both described at demonstrated at Regex101.
^ is a start of the String
\?* indicates any number of initial ? characters (must be escaped)
\w{2,} at least 2 alphanumeric characters
\?* continues with any number of and ? characters
\*? and optionally one last * character
(?<!\s) and the whole String must have not \s white character (using negative look-behind)
$ is an end of the String

Other way to solve this problem could be with look-ahead mechanism (?=subregex). It is zero-length (it resets regex cursor to position it was before executing subregex) so it lets regex engine do multiple tests on same text via construct
(?=condition1)
(?=condition2)
(?=...)
conditionN
Note: last condition (conditionN) is not placed in (?=...) to let regex engine move cursor after tested part (to "consume" it) and move on to testing other things after it. But to make it possible conditionN must match precisely that section which we want to "consume" (earlier conditions didn't have that limitation, they could match substrings of any length, like lets say few first characters).
So now we need to think about what are our conditions.
We want to match only alphanumeric characters, ?, * but * can appear (optionally) only at end. We can write it as ^[a-zA-Z0-9?]*[*]?$. This also handles non-whitespace characters because we didn't include them as potentially accepted characters.
Second requirement is to have "Minimum 2 alpha-numeric characters". It can be written as .*?[a-zA-Z0-9].*?[a-zA-Z0-9] or (?:.*?[a-zA-Z0-9]){2,} (if we like shorter regexes). Since that condition doesn't actually test whole text but only some part of it, we can place it in look-ahead mechanism.
Above conditions seem to cover all we wanted so we can combine them into regex which can look like:
^(?=(?:.*?[a-zA-Z0-9]){2,})[a-zA-Z0-9?]*[*]?$

How to remove duplicate characters in a string using regex?

I need to replace the duplicate characters in a string. I tried using
outputString = str.replaceAll("(.)(?=.*\\1)", "");
This replaces the duplicate characters but the position of the characters changes as shown below.
input
haih
output
aih
But I need to get an output hai. That is the order of the characters that appear in the string should not change. Given below are the expected outputs for some inputs.
input
aaaassssddddd
output
asd
input
cdddddggggeeccc
output
cdge
How can this be achieved?

It seems like your code is leaving the last character, so how about this?
outputString = new StringBuilder(str).reverse().toString();
// outputString is now hiah
outputString = outputString.replaceAll("(.)(?=.*\\1)", "");
// outputString is now iah
outputString = new StringBuilder(outputString).reverse().toString();
// outputString is now hai

Overview
It's possible with Oracle's implementation, but I wouldn't recommend this answer for many reasons:
It relies on a bug in the implementation, which interprets *, + or {n,} as {0, 0x7FFFFFFF}, {1, 0x7FFFFFFF}, {n, 0x7FFFFFFF} respectively, which allows the look-behind to contains such quantifiers. Since it relies on a bug, there is no guarantee that it will work similarly in the future.
It is unmaintainable mess. Writing normal code and any people who have some basic Java knowledge can read it, but using the regex in this answer limits the number of people who can understand the code at a glance to people who understand the in and out of regex implementation.
Therefore, this answer is for educational purpose, rather than something to be used in production code.
Solution
Here is the one-liner replaceAll regex solution:
String output = input.replaceAll("(.)(?=(.*))(?<=(?=\\1.*?\\1\\2$).+)","")
Printing out the regex:
(.)(?=(.*))(?<=(?=\1.*?\1\2$).+)
What we want to do is to look-behind to see whether the same character has appeared before or not. The capturing group (.) at the beginning captures the current character, and the look-behind group is there to check whether the character has appeared before. So far, so good.
However, since backreferences \1 doesn't have obvious length, it can't appear in the look-behind directly.
This is where we make use of the bug to look-behind up to the beginning of the string, then use a look-ahead inside the look-behind to include the backreference, as you can see (?<=(?=...).+).
This is not the end of the problem, though. While the non-assertion pattern inside look-behind .+ can't advance past the position after the character in (.), the look-ahead inside can. As a simple test:
"haaaaaaaaa".replaceAll("h(?<=(?=(.*)).*)","$1")
> "aaaaaaaaaaaaaaaaaa"
To make sure that the search doesn't spill beyond the current character, I capture the rest of the string in a look-ahead (?=(.*)) and use it to "mark" the current position (?=\\1.*?\\1\\2$).
Can this be done in one replacement without using look-behind?
I think it is impossible. We need to differentiate the first appearance of a character with subsequent appearance of the same character. While we can do this for one fixed character (e.g. a), the problem requires us to do so for all characters in the string.
For your information, this is for removing all subsequent appearance of a fixed character (h is used here):
.replaceAll("^([^h]*h[^h]*)|(?!^)\\Gh+([^h]*)","$1$2")
To do this for multiple characters, we must keep track of whether the character has appeared before or not, across matches and for all characters. The regex above shows the across matches part, but the other condition kinda makes this impossible.
We obviously can't do this in a single match, since subsequent occurrences can be non-contiguous and arbitrary in number.

Validate string has no illegal characters

Im trying to validate a string that only allows letters, numbers and these characters :
!"#$%&'()*+,-./:;<=>?#[\]^_`{|}~
I tried doing this but its not working and allowing me to enter characters not in the regex. Im still pretty new to java and something similar was working in javascript but I cant figure out whats going on here. I think its running as if it cant find any of the characters mentioned then it will return four.
Pattern allowedCharacters = Pattern.compile("[A-Za-z0-9!\"#$%&'()*+,.\\/:;<=>?#[\\]^_`{|}~-]+$]");
if (!allowedCharacters.matcher(pw).find()){
return 4;
}
Any help is appreciated. Thanks
EDIT:
I also tried:
if (pw.matches("^[A-Za-z0-9!\"#$%&'()*+,.\\/:;<=>?#[\\]^_`{|}~-]+$]")){
return 4;
}
and
if (!pw.matches("[A-Za-z0-9!\"#$%&'()*+,.\\/:;<=>?#[\\]^_`{|}~-]+$]")){
return 4;
}

matcher.find() checks if string contains substring that matches regex, so with
!matcher.find() you are checking if there is no match of regex in tested string.
Consider using using matcher.matches() to check if entire string is matched by regex. In this case you will have to add quantifiers like *, + or {n,m} to character class to decide about passwords length. Otherwise it will only single character passwords.
Here is demo of how your code can look like
// here you place quantifier
// ↓
if (pw.matches("[A-Za-z0-9!\"#$%&'()*+,.\\/:;<=>?#[\\]^_`{|}~-]+$]+")){
System.out.println("password contains only valid characters");
} else {
System.out.println("invalid characters in password");
}
Update:
in your regex you are not escaping [ which makes [\]^_`{|}~-] separate character class which will be added to outer character class. This character class will not include \ or [. If you are really interested in accepting only alphanumeric characters and !"#$%&'()*+,-./:;<=>?#[]^_`{|}~ then consider using
"[\\w\\Q!\"#$%&'()*+,-./:;<=>?#[\\]^_`{|}~\\E]+"
as regex.
\\w represents [a-zA-Z0-9_]
and \Q and \E is quote, which is mechanism to escape metacharacters, even in character class.

It's because you're using find() and not matches(). That said, I'd try the opposite, doing find on [^<legal chars>] (note the caret) to match an illegal characters. It's faster because it'll fail as soon as it hits something illegal. Also, start with the simple legal characters, then move up from there. Regular expressions can get hard to read, and adding one char at a time that has special meaning is easier than adding them all at once.

Using other answers from this question, I found this to work for me. Nothing needs to be escaped between the \Q and \E. They do that for you.
Pattern whitelist = Pattern.compile("^[\\w\\s\\Q!\"#$%&'()*+,-.\\/:;<=>?#[]^_`{|}~\\E]+$");
if (!whitelist.matcher(pw).matches()) {
// error
}

Formulating a regex with a single dot

I am trying to formulate a regex for the following scenario :
The String to match : mName87.com
So, the string may consist of any number of alpha numeric characters , but can contain only a single dot anywhere in the string .
I formulated this regex : [a-zA-Z0-9.], but it matches even multiple dots(.)
What am i doing wrong here ?

The regex you provided matches only a single character in the whole string you're trying to validate. There are a few things to take care of in your scenario
You want to match over the whole string, so your regex must start with ^ (beginning of the string) and end with $ (end of the string).
Then you want to accept any number of alpha-numeric characters, this is done with [a-zA-Z0-9]+, here the + means one or more characters.
Then match the point: \. (you must escape it here)
Finally accept more characters again.
All together the regex would then be:
^[a-zA-Z0-9]+\.[a-zA-Z0-9]+$

You can use this regex:
\\w*\\.\\w*
You can try here

Try with:
^([a-zA-Z0-9]+\.)+[a-zA-Z]$

use this regular expression ^[a-zA-Z0-9]*\.[a-zA-Z0-9.]*$

EDITED:
Try
([a-zA-Z0-9]+\.[a-zA-Z0-9]+)|(\.[a-zA-Z0-9]+)|([a-zA-Z0-9]+\.)
That is: [a word that ends with a dot] OR [two words and the dot in the middle] OR [a word that starts with a dot]

Regex (Java) to remove all characters up to but not including (a number or a letter a-f followed by a number)

I need help constructing the regular expression to remove all characters up to but not including (a number or a letter a-f followed by a number) in Java:
Here's what I came up with (doesn't work):
string.replaceFirst(".+?(\\d|[a-f]\\d)","");
That line of code replaces the entire string with an empty string.
.+? is every character up to \\d a digit OR [a-f]\\d any of the letters a-f followed by a digit.
This doesn't work, however, can I have some help?
Thanks
EDIT: changed replace with replaceFirst

First off, replace() acts on literals, not regexes. You should use replaceFirst or replaceAll depending on what you want. Your regex problem is that you're including the suffix as part of the string to replace. You can give this a try:
input.replaceFirst(".+?(\\d|[a-f]\\d)","$1")
Here I just include the suffix in the replacement string as well. The more correct approach is to make that a zero-width assertion so that it doesn't get included in the region to replace. You can use a positive lookahead:
input.replaceFirst(".+?(?=(\\d|[a-f]\\d))", "")

The other answers given here have the problem that if the string starts with a-f followed by a number, or just a number, they will actually match and replace the first character. Not sure if that's a relevant scenario. This more convoluted pattern should work though:
"([^a-f\\d]|([a-f](?!\\d)))+"
(that is, everything that's not a digit or a-f, or a-f not followed by a digit).

I'd suggest something along the lines of
string.replaceFirst(".*?(?=(\\d|[a-f]\\d))", "");

s = s.replaceFirst(".*?(?=[a-f]?\\d)", "");
Using .*? instead of .+? insures that the first character gets checked by the lookahead, solving the problem #johusman mentioned. And while your (\\d|[a-f]\\d) isn't causing a problem, [a-f]?\\d is both more efficient and more readable.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Java - Unknown characters passing as [a-zA-z0-9]*? - java

You have to change the regexp to "^[a-zA-Z0-9]*$" to ensure that you are matching the entire string

Looks like it should be "a-zA-Z0-9", not "a-zA-z0-9", try correcting that...

Did anyone consider adding space to the regex [a-zA-Z0-9 ]*. this should match any normal text with chars, number and spaces. If you want quotes and other special chars add them to the regex too. You can quickly test your regex at http://www.regexplanet.com/simple/

Related

Java Regex with "Joker" characters

How to remove duplicate characters in a string using regex?

Validate string has no illegal characters

Formulating a regex with a single dot

Regex (Java) to remove all characters up to but not including (a number or a letter a-f followed by a number)

Categories

Resources