How does [] make a difference in Java regex?

How does [] make a difference in Java regex? - java

I have a regex for validation of UTF-8 characters.
String regex = "[\\p{L}\\p{M}\\p{N}\\p{P}\\p{Z}\\p{S}\\p{C}]*"
I wanted to do a range check too so I modified it to
String regex = "[[\\p{L}\\p{M}\\p{N}\\p{P}\\p{Z}\\p{S}\\p{C}]*]"
String rangeRegex = regex + "{0,30}"
Notice that it’s the same regex I just wrapped it with [ ].
Now I can validate with the range by using rangeRegex but regex is now not validating UTF-8 chars.
My question is: how is [] affecting regex? If I remove [] from the original regex it will validate UTF-8 chars but not with range. If I put [] it will validate with range but not without range!
sample test code -
public class Test {
static String regex = "[[\\p{L}\\p{M}\\p{N}\\p{P}\\p{Z}\\p{S}\\p{C}]*]" ;
public static void main(String[] args) {
String userId = null;
//testUserId(userId);
userId = "";
testUserId(userId);
userId = "æÆbBcCćĆčČçďĎǳǲdzsDzs";
testUserId(userId);
userId = "test123";
testUserId(userId);
userId = "abcxyzsd";
testUserId(userId);
String zip = "i«♣│axy";
testZip(zip);
zip = "331fsdfsdfasdfasd02c3";
testZip(zip);
zip = "331";
testZip(zip);
}
/**
* without range check
* #param userId
*/
static void testUserId(String userId){
boolean pass = true;
if ( !stringValidator(userId, regex)) {
pass = false;
}
System.out.println(pass);
}
/**
* with a range check
* #param zip
*/
static void testZip(String zip){
boolean pass = true;
String regex1 = regex + "{0,10}";
if (StringUtils.isNotBlank(zip) && !stringValidator(zip, regex1)) {
pass = false;
}
System.out.println(pass);
}
static boolean stringValidator(String str, String regex) {
Pattern pattern = Pattern.compile(regex);
Matcher matcher = pattern.matcher(str);
return matcher.matches();
}
}

The explanations given are rather wrong for Java regex.
In Java, unescaped paired square brackets inside a character class are not treated as literal [ and ] characters. They have a special meaning in Java character classes:
[a-d[m-p]] a through d, or m through p: [a-dm-p] (union)
[a-z&&[def]] d, e, or f (intersection)
[a-z&&[^bc]] a through z, except for b and c: [ad-z] (subtraction)
[a-z&&[^m-p]] a through z, and not m through p: [a-lq-z] (subtraction)
So, when you add a [...] to your regex, you get a union of the previous regex pattern with literal * character and means match either [\\p{L}\\p{M}\\p{N}\\p{P}\\p{Z}\\p{S}\\p{C}] or a literal *.
Also, [[\\p{L}\\p{M}\\p{N}\\p{P}\\p{Z}\\p{S}\\p{C}]*] is equal to [\\p{L}\\p{M}\\p{N}\\p{P}\\p{Z}\\p{S}\\p{C}*] as * symbol inside a character class stops being a special character (a quantifier) and becomes a literal asterisk symbol.
If you use [[]], the engine will throw an exception: Unclosed character class near index 3
See this IDEONE demo:
System.out.println("abc[]".replaceAll("[[abc]]", "")); // => []
System.out.println("abc[]".replaceAll("[[]]", "")); // => error
Whenever you need to check the length of a string with regex, you need anchors and a limiting quantifier. Anchors are automatically added when a regex is used with Matcher#matches method:
The matches method attempts to match the entire input sequence against the pattern.
Example code:
String regex = "[\\p{L}\\p{M}\\p{N}\\p{P}\\p{Z}\\p{S}\\p{C}]";
String new_regex = regex + "{0,30}";
System.out.println("Some string".matches(new_regex)); // => true
See this IDEONE demo
UPDATE
Here is commented code of yours:
String userId = "";
testUserId(userId); // false - Correct as we test an empty string with an at-least-one-char regex
userId = "æÆbBcCćĆčČçďĎǳǲdzsDzs";
testUserId(userId); // false - Correct as we only match 1 character string, others fail
userId = "test123";
testUserId(userId); // false - see above
userId = "abcxyzsd";
testUserId(userId); // false - see above
String zip = "i«♣│axy";
testZip(zip); // true - OK, 7-symbol string matches against [...]{0,10} regex
zip = "331fsdfsdfasdfasd02c3";
testZip(zip); // false - OK, 21-symbol string does not match a regex that requires only 0 to 10 characters
zip = "331";
testZip(zip); // true - OK, 3-symbol string matches against [...]{0,10} regex

* means 0 or more, so it is almost like {0,}. i.e. you can replace the * with {0,30} and it should do what you want:
[\p{L}\p{M}\p{N}\p{P}\p{Z}\p{S}\p{C}]{0,30}
[] creates a character class, so [[]] would be "a character class of just [ followed by ] since the first ] closes the character class prematurely and doesn't really do what you want.
Also correct me if I'm wrong, but the character list you are generating is pretty much everything, so you could go with .{0,30} for the same effect.

Related

Replacing all regex matches with masking characters in Java

Java 11 here. I have a huge String that will contain 0+ instances of the following "fizz token":
the substring "fizz"
followed by any integer 0+
followed by an equals sign ("=")
followed by another string of any kind, a.k.a. the "fizz value"
terminated by the first whitespace (included tabs, newlines, etc.)
So some examples of a valid fizz token:
fizz0=fj49jc49fj59
fizz39=f44kk5k59
fizz101023=jjj
Some examples of invalid fizz tokens:
fizz=9d94dj49j4 <-- missing an integer after "fizz" and before "="
fizz2= <-- missing a fizz value after "="
I am trying to write a Java method that will:
Find all instances of matching fizz tokens inside my huge input String
Obtain each fizz token's value
Replace each character of the token value with an upper-case X ("X")
So for example:
| Fizz Token | Token Value | Final Result |
|--------------------|--------------|--------------------|
| fizz0=fj49jc49fj59 | fj49jc49fj59 | fizz0=XXXXXXXXXXXX |
| fizz39=f44kk5k59 | f44kk5k59 | fizz39=XXXXXXXXX |
| fizz101023=jjj | jjj | fizz101023=XXX |
I need the method to do this replacement with the token values for all fizz tokens found in the input sting, hence:
String input = "Some initial text fizz0=fj49jc49fj59 then some more fizz101023=jjj";
String masked = mask(input);
// Outputs: Some initial text fizz0=XXXXXXXXXXXX then some more fizz101023=XXX
System.out.println(masked);
My best attempt thus far is a massive WIP:
public class Masker {
private Pattern fizzTokenPattern = Pattern.compile("fizz{d*}=*");
public String mask(String input) {
Matcher matcher = fizzTokenPattern.matcher(input);
int numMatches = matcher.groupCount();
for (int i = 0; i < numMatches; i++) {
// how to get the token value from the group?
String tokenValue = matcher.group(i); // ex: fj49jc49fj59
// how to replace each character with an X?
// ex: fj49jc49fj59 ==> XXXXXXXXXXXX
String masked = tokenValue.replaceAll("*", "X");
// how to grab the original (matched) token and replace it with the new
// 'masked' string?
String entireTokenWithValue = input.substring(matcher.group(i));
}
}
}
I feel like I'm in the ballpark but missing some core concepts. Anybody have any ideas?

According to requirements
the substring "fizz"
followed by any integer 0+
followed by an equals sign ("=")
followed by another string of any kind, a.k.a. the "fizz value"
terminated by the first whitespace (included tabs, newlines, etc.)
regex which fulfill it can look like
fizz
\d+
=
-5. \S+ - one or more of any NON-whitespace characters.
which gives us "fizz\\d+=\\S+".
But since you want to only modify some part of that match, and reuse other we can wrap those parts in groups like "(fizz\\d+=)(\\S+)". This way our replacement will need to
assign back what was found in "(fizz\\d+=)
modify what was found in "(\\S+)"
this modification is simply assigning X repeated n times where n is length of what is found in group "(\\S+)".
In other words your code can look like
class Masker {
private static Pattern p = Pattern.compile("(fizz\\d+=)(\\S+)");
public static String mask(String input) {
return p.matcher(input)
.replaceAll(match -> match.group(1)+"X".repeat(match.group(2).length()));
}
//DEMO
public static void main(String[] args) throws Exception {
String input = "Some initial text fizz0=fj49jc49fj59 then some more fizz101023=jjj";
String masked = Masker.mask(input);
System.out.println(input);
System.out.println(masked);
}
}
Output:
Some initial text fizz0=fj49jc49fj59 then some more fizz101023=jjj
Some initial text fizz0=XXXXXXXXXXXX then some more fizz101023=XXX
Version 2 - with named-groups so more readable/easier to maintain
class Masker {
private static Pattern p = Pattern.compile("(?<token>fizz\\d+=)(?<value>\\S+)");
public static String mask(String input) {
StringBuilder sb = new StringBuilder();
Matcher m = p.matcher(input);
while(m.find()){
String token = m.group("token");
String value = m.group("value");
String maskedValue = "X".repeat(value.length());
m.appendReplacement(sb, token+maskedValue);
}
m.appendTail(sb);
return sb.toString();
}
//DEMO
public static void main(String[] args) throws Exception {
String input = "Some initial text fizz0=fj49jc49fj59 then some more fizz101023=jjj";
String masked = Masker.mask(input);
System.out.println(input);
System.out.println(masked);
}
}

Java regex convert string to valid json string

I have a pretty long string that looks something like
{abc:\"def\", ghi:\"jkl\"}
I want to convert this to a valid json string like
{\"abc\":\"def\", \"ghi\":\"jkl\"}
I started looking at the replaceAll(String regex, String replacement) method on the string object but i'm struggling to find the correct regex for it.
Can someone please help me with this.

In this particular case the regex should look for a word that is proceeded with {, space, or , and not followed by "
String str = "{abc:\"def\", ghi:\"jkl\"}";
String regex = "(?:[{ ,])(\\w+)(?!\")";
System.out.println(str.replaceAll(regex, "\\\"$1\\\""));
DEMO and regex explanation

I have to make an assumption that the "key" and "value" consist of only
"word characters" (\w) and there are no spaces in them.
Here is my program. Please also see the comments in-line:
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class RegexJson {
public static void main(String[] args) {
/*
* Note that the input string, when expressed in a Java program, need escape
* for backslash (\) and double quote ("). If you read directly
* from a file then these escapes are not needed
*/
String input = "{abc:\\\"def\\\", ghi:\\\"jkl\\\"}";
// regex for one pair of key-value pair. Eg: abc:\"edf\"
String keyValueRegex = "(?<key>\\w+):(?<value>\\\\\\\"\\w+\\\\\\\")";
// regex for a list of key-value pair, separated by a comma (,) and a space ( )
String pairsRegex = "(?<pairs>(,*\\s*"+keyValueRegex+")+)";
// regex include the open and closing braces ({})
String regex = "\\{"+pairsRegex+"\\}";
StringBuilder sb = new StringBuilder();
sb.append("{");
Pattern p1 = Pattern.compile(regex);
Matcher m1 = p1.matcher(input);
while (m1.find()) {
String pairs = m1.group("pairs");
Pattern p2 = Pattern.compile(keyValueRegex);
Matcher m2 = p2.matcher(pairs);
String comma = ""; // first time special
while (m2.find()) {
String key = m2.group("key");
String value = m2.group("value");
sb.append(String.format(comma + "\\\"%s\\\":%s", key, value));
comma = ", "; // second time and onwards
}
}
sb.append("}");
System.out.println("input is: " + input);
System.out.println(sb.toString());
}
}
The print out of this program is:
input is: {abc:\"def\", ghi:\"jkl\"}
{\"abc\":\"def\", \"ghi\":\"jkl\"}

java regex match any integer or double then replace non number/decimal characters

I am trying to match a string to any integer or double then, if it does not match, I want to remove all invalid characters to make the string a valid integer or double (or empty string). So far, this is what I have but it will print 15- which is not valid
String anchorGuyField = "15-";
if(!anchorGuyField.matches("-?\\d+(.\\d+)?")){ //match integer or double
anchorGuyField = anchorGuyField.replaceAll("[^-?\\d+(.\\d+)?]", ""); //attempt to replace invalid chars... failing here
}

You can use Pattern() and Matcher() to validate if string is suitable for covertion to int or double:
public class Match{
public static void main(String[] args){
String anchorGuyField = "asdasda-15.56757-asdasd";
if(!anchorGuyField.matches("(-?\\d+(\\.\\d+)?)")){ //match integer or double
Pattern pattern = Pattern.compile("(-?\\d+(\\.\\d+)?)");
Matcher matcher = pattern.matcher(anchorGuyField);
if(matcher.find()){
anchorGuyField = anchorGuyField.substring(matcher.start(),matcher.end());
}
}
System.out.println(anchorGuyField);
}
}
with:
anchorGuyField = anchorGuyField.replaceAll("[^-?\\d+(.\\d+)?]", "");
you actually delete content you wanted to match from string, insted of 15 from 15-, you should get just -

The negation checks that none of the given character matches. 15- only contains digits or commas, hence nothing matches the second regex. Maybe you could use something else than a regex to filter out characters.
Check first character is either a minus sign or a number, else remove it, then remove all non numbers characters.

How to remove invalid characters from a string?

I have no idea how to remove invalid characters from a string in Java. I'm trying to remove all the characters that are not numbers, letters, or ( ) [ ] . How can I do this?
Thanks

String foo = "this is a thing with & in it";
foo = foo.replaceAll("[^A-Za-z0-9()\\[\\]]", "");
Javadocs are your friend. Regular expressions are also your friend.
Edit:
That being siad, this is only for the Latin alphabet; you can adjust accordingly. \\w can be used for a-zA-Z to denote a "word" character if that works for your case though it includes _.

Using Guava, and almost certainly more efficient (and more readable) than regexes:
CharMatcher desired = CharMatcher.JAVA_DIGIT
.or(CharMatcher.JAVA_LETTER)
.or(CharMatcher.anyOf("()[]"))
.precomputed(); // optional, may improve performance, YMMV
return desired.retainFrom(string);

Try this:
String s = "123abc&^%[]()";
s = s.replaceAll("[^A-Za-z0-9()\\[\\]]", "");
System.out.println(s);
The above will remove characters "&^%" in the sample string, leaving in s only "123abc[]()".

public static void main(String[] args) {
String c = "hjdg$h&jk8^i0ssh6+/?:().,+-#";
System.out.println(c);
Pattern pt = Pattern.compile("[^a-zA-Z0-9/?:().,'+/-]");
Matcher match = pt.matcher(c);
if (!match.matches()) {
c = c.replaceAll(pt.pattern(), "");
}
System.out.println(c);
}

Use this code:
String s = "Test[]"
s = s.replaceAll("[");
s = s.replaceAll("]");

myString.replaceAll("[^\\w\\[\\]\\(\\)]", "");
replaceAll method takes a regex as first parameter and replaces all matches in string. This regex matches all characters which are not digit, letter or underscore (\\w) and braces you need (\\[\\]\\(\\)])

You can remove specials characters from your String/Url or any request parameters you have get from user side
public static String removeSpecialCharacters(String inputString){
final String[] metaCharacters = {"../","\\..","\\~","~/","~"};
String outputString="";
for (int i = 0 ; i < metaCharacters.length ; i++){
if(inputString.contains(metaCharacters[i])){
outputString = inputString.replace(metaCharacters[i],"");
inputString = outputString;
}else{
outputString = inputString;
}
}
return outputString;
}

You can specify the range of characters to keep/remove based on the order of characters in the ASCII table. The regex can use actual characters or character hex codes:
// Example - remove characters outside of the range of "space to tilde".
// 1) using characters
someString.replaceAll("[^ -~]", "");
// 2) using hex codes for "space" and "tilde"
someString.replaceAll("[^\\u0020-\\u007E]", "");

Explain working Regex expression

Found this code that breaks out CSV fields if contains double-quotes
But I don't really understand the pattern matching from regex
If someone can give me an step by step explanation of how this expression evaluates a pattern it would be appreciated
"([^\"]*)"|(?<=,|^)([^,]*)(?:,|$)
Thanks
====
Old posting
This is working well for me - either it matches on "two quotes and whatever is between them", or "something between the start of the line or a comma and the end of the line or a comma". Iterating through the matches gets me all the fields, even if they are empty. For instance,
the quick, "brown, fox jumps", over, "the",,"lazy dog" breaks down into
the quick "brown, fox jumps" over "the" "lazy dog"
import java.util.ArrayList;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class CSVParser {
/*
* This Pattern will match on either quoted text or text between commas, including
* whitespace, and accounting for beginning and end of line.
*/
private final Pattern csvPattern = Pattern.compile("\"([^\"]*)\"|(?<=,|^)([^,]*)(?:,|$)");
private ArrayList<String> allMatches = null;
private Matcher matcher = null;
private String match = null;
private int size;
public CSVParser() {
allMatches = new ArrayList<String>();
matcher = null;
match = null;
}
public String[] parse(String csvLine) {
matcher = csvPattern.matcher(csvLine);
allMatches.clear();
String match;
while (matcher.find()) {
match = matcher.group(1);
if (match!=null) {
allMatches.add(match);
}
else {
allMatches.add(matcher.group(2));
}
}
size = allMatches.size();
if (size > 0) {
return allMatches.toArray(new String[size]);
}
else {
return new String[0];
}
}
public static void main(String[] args) {
String lineinput = "the quick,\"brown, fox jumps\",over,\"the\",,\"lazy dog\"";
CSVParser myCSV = new CSVParser();
System.out.println("Testing CSVParser with: \n " + lineinput);
for (String s : myCSV.parse(lineinput)) {
System.out.println(s);
}
}
}

I try to give you hints and the needed vocabulary to find very good explanations on regular-expressions.info
"([^\"]*)"|(?<=,|^)([^,])(?:,|$)
() is a group
* is a quantifier
If there is a ? right after the opening bracket then it's a special group, here (?<=,|^) is a lookbehind assertion.
Square brackets declare a character class e.g. [^\"]. This one is a special one, because of the ^ at the start. It is a negated character class.
| denotes an alternation, i.e. an OR operator.
(?:,|$) is a non capturing group
$ is a special character in regex, it is an anchor (which matches the end of the string)

"([^\"]*)"|(?<=,|^)([^,]*)(?:,|$)
() capture group
(?:) non-capture group
[] any character within the bracket matches
\ escape character used to match operators aka "
(?<=) positive lookbehind (looks to see if the contained matches before the marker)
| either or operator (matches either side of the pipe)
^ beginning of line operator
* zero or more of the preceding character
$ or \z end of line operator
For future reference please bookmark a a good regex reference it can explain each part quite well.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

How does [] make a difference in Java regex? - java

Related

Replacing all regex matches with masking characters in Java

Java regex convert string to valid json string

java regex match any integer or double then replace non number/decimal characters

How to remove invalid characters from a string?

Explain working Regex expression

Categories

Resources