Replacing all regex matches with masking characters in Java

Replacing all regex matches with masking characters in Java - java

Java 11 here. I have a huge String that will contain 0+ instances of the following "fizz token":
the substring "fizz"
followed by any integer 0+
followed by an equals sign ("=")
followed by another string of any kind, a.k.a. the "fizz value"
terminated by the first whitespace (included tabs, newlines, etc.)
So some examples of a valid fizz token:
fizz0=fj49jc49fj59
fizz39=f44kk5k59
fizz101023=jjj
Some examples of invalid fizz tokens:
fizz=9d94dj49j4 <-- missing an integer after "fizz" and before "="
fizz2= <-- missing a fizz value after "="
I am trying to write a Java method that will:
Find all instances of matching fizz tokens inside my huge input String
Obtain each fizz token's value
Replace each character of the token value with an upper-case X ("X")
So for example:
| Fizz Token | Token Value | Final Result |
|--------------------|--------------|--------------------|
| fizz0=fj49jc49fj59 | fj49jc49fj59 | fizz0=XXXXXXXXXXXX |
| fizz39=f44kk5k59 | f44kk5k59 | fizz39=XXXXXXXXX |
| fizz101023=jjj | jjj | fizz101023=XXX |
I need the method to do this replacement with the token values for all fizz tokens found in the input sting, hence:
String input = "Some initial text fizz0=fj49jc49fj59 then some more fizz101023=jjj";
String masked = mask(input);
// Outputs: Some initial text fizz0=XXXXXXXXXXXX then some more fizz101023=XXX
System.out.println(masked);
My best attempt thus far is a massive WIP:
public class Masker {
private Pattern fizzTokenPattern = Pattern.compile("fizz{d*}=*");
public String mask(String input) {
Matcher matcher = fizzTokenPattern.matcher(input);
int numMatches = matcher.groupCount();
for (int i = 0; i < numMatches; i++) {
// how to get the token value from the group?
String tokenValue = matcher.group(i); // ex: fj49jc49fj59
// how to replace each character with an X?
// ex: fj49jc49fj59 ==> XXXXXXXXXXXX
String masked = tokenValue.replaceAll("*", "X");
// how to grab the original (matched) token and replace it with the new
// 'masked' string?
String entireTokenWithValue = input.substring(matcher.group(i));
}
}
}
I feel like I'm in the ballpark but missing some core concepts. Anybody have any ideas?

According to requirements
the substring "fizz"
followed by any integer 0+
followed by an equals sign ("=")
followed by another string of any kind, a.k.a. the "fizz value"
terminated by the first whitespace (included tabs, newlines, etc.)
regex which fulfill it can look like
fizz
\d+
=
-5. \S+ - one or more of any NON-whitespace characters.
which gives us "fizz\\d+=\\S+".
But since you want to only modify some part of that match, and reuse other we can wrap those parts in groups like "(fizz\\d+=)(\\S+)". This way our replacement will need to
assign back what was found in "(fizz\\d+=)
modify what was found in "(\\S+)"
this modification is simply assigning X repeated n times where n is length of what is found in group "(\\S+)".
In other words your code can look like
class Masker {
private static Pattern p = Pattern.compile("(fizz\\d+=)(\\S+)");
public static String mask(String input) {
return p.matcher(input)
.replaceAll(match -> match.group(1)+"X".repeat(match.group(2).length()));
}
//DEMO
public static void main(String[] args) throws Exception {
String input = "Some initial text fizz0=fj49jc49fj59 then some more fizz101023=jjj";
String masked = Masker.mask(input);
System.out.println(input);
System.out.println(masked);
}
}
Output:
Some initial text fizz0=fj49jc49fj59 then some more fizz101023=jjj
Some initial text fizz0=XXXXXXXXXXXX then some more fizz101023=XXX
Version 2 - with named-groups so more readable/easier to maintain
class Masker {
private static Pattern p = Pattern.compile("(?<token>fizz\\d+=)(?<value>\\S+)");
public static String mask(String input) {
StringBuilder sb = new StringBuilder();
Matcher m = p.matcher(input);
while(m.find()){
String token = m.group("token");
String value = m.group("value");
String maskedValue = "X".repeat(value.length());
m.appendReplacement(sb, token+maskedValue);
}
m.appendTail(sb);
return sb.toString();
}
//DEMO
public static void main(String[] args) throws Exception {
String input = "Some initial text fizz0=fj49jc49fj59 then some more fizz101023=jjj";
String masked = Masker.mask(input);
System.out.println(input);
System.out.println(masked);
}
}

Related

Split String by | and numbers

Let's imagine I have the following strings:
String one = "123|abc|123abc";
String two = "123|ab12c|abc|456|abc|def";
String three = "123|1abc|1abc1|456|abc|wer";
String four = "123|abc|def|456|ghi|jkl|789|mno|pqr";
If I do a split on them I expect the following output:
one = ["123|abc|123abc"];
two = ["123|ab12c|abc", "456|abc|def"];
three = ["123|1abc|1abc1", "456|abc|wer"];
four = ["123|abc|def", "456|ghi|jkl", "789|mno|pqr"];
The string has the following structure:
Starts with 1 or more digits followed by a random number of (| followed by random number of characters).
When after a | it's only numbers is considered a new value.
More examples:
In - 123456|xxxxxx|zzzzzzz|xa2314|xzxczxc|1234|qwerty
Out - ["123456|xxxxxx|zzzzzzz|xa2314|xzxczxc", "1234|qwerty"]
Tried multiple variations of the following but does not work:
value.split( "\\|\\d+|\\d+" )

You may split on \|(?=\d+(?:\||$)):
List<String> nums = Arrays.asList(new String[] {
"123|abc|123abc",
"123|ab12c|abc|456|abc|def",
"123|1abc|1abc1|456|abc|wer",
"123|abc|def|456|ghi|jkl|789|mno|pqr"
});
for (String num : nums) {
String[] parts = num.split("\\|(?=\\d+(?:\\||$))");
System.out.println(num + " => " + Arrays.toString(parts));
}
This prints:
123|abc|123abc => [123|abc|123abc]
123|ab12c|abc|456|abc|def => [123|ab12c|abc, 456|abc|def]
123|1abc|1abc1|456|abc|wer => [123|1abc|1abc1, 456|abc|wer]
123|abc|def|456|ghi|jkl|789|mno|pqr => [123|abc|def, 456|ghi|jkl, 789|mno|pqr]

Instead of splitting, you can match the parts in the string:
\b\d+(?:\|(?!\d+(?:$|\|))[^|\r\n]+)*
\b A word boundary
\d+ Match 1+ digits
(?: Non capture group
\|(?!\d+(?:$|\|)) Match | and assert not only digits till either the next pipe or the end of the string
[^|\r\n]+ Match 1+ chars other than a pipe or a newline
)* Close the non capture group and optionally repeat (use + to repeat one or more times to match at least one pipe char)
Regex demo | Java demo
String regex = "\\b\\d+(?:\\|(?!\\d+(?:$|\\|))[^|\\r\\n]+)+";
String string = "123|abc|def|456|ghi|jkl|789|mno|pqr";
Pattern pattern = Pattern.compile(regex);
Matcher m = pattern.matcher(string);
List<String> matches = new ArrayList<String>();
while (m.find())
matches.add(m.group());
for (String s : matches)
System.out.println(s);
Output
123|abc|def
456|ghi|jkl
789|mno|pqr

Is there a regex expression suitable for splitting two one digit numbers separated by a whitespace?

I'm prompting the user for input using the java scanner. The input needs to be two single digits from 0-2 separated by a whitespace (eg. "1 2").
When I try do \s to split "1 2" i get an arrayoutofbounds exception
whereas when i split "1-2" with \- it works perfectly fine.
I'm completely new to regex and would really appreciate some help :)
My code:
public void x() {
int n = -1;
Scanner scanner = new Scanner(System.in);
System.out.println("Pick your coordinates. X goes first. Eg. 1 1");
String input = scanner.nextLine();
for (int i = 0; i <= 2; i++) {
// input = input.replaceAll("\\s", "-").toLowerCase();
parts = input.split("\\d{1}\\s\\d{1}");
String x = parts[0];
String y = parts[1];
Pattern pattern = Pattern.compile(String.valueOf(i));
Matcher matcher = pattern.matcher(x);
boolean matchFound = matcher.find();
if (matchFound) {
break;
} else {
n = i;
System.out.println("match N");
}
}
System.out.println("t");
}

In IntelliJ IDEA, if you keep your mouse on the split you can see the documentation of the method:
public String[] split(String regex)
Splits this string around matches of the given regular expression.
This method works as if by invoking the two-argument split method with
the given expression and a limit argument of zero. Trailing empty
strings are therefore not included in the resulting array.
The string "boo:and:foo", for example, yields the following results
with these expressions:
+-------+-------------------------+
| Regex | Result |
+-------+-------------------------+
| : | { "boo", "and", "foo" } |
| o | { "b", "", ":and:f" } |
+-------+-------------------------+
Parameters:
regex - the delimiting regular expression
Returns:
the array of strings computed by splitting this string around matches
of the given regular expression
Therefore, you only have to pass in \\s to split by any space character.
Try
parts = input.split("\\s");

I believe you're looking for something like this:
\d{1}\s\d{1}
\d{1} = Match any digit, 1 time
\s = Followed by Any whitespace

According to documentation:
split(String regex)
Splits this string around matches of the given regular expression.
So, you are trying to split string in the wrong way.
If you want to validate that pattern then you can do this by the following approach:
input.matches("\\d{1}\\s\\d{1}")

JAVA regular expression to replace all occurrences of a particular word "working weird"

Does anyone see something wrong with this regex I have. All I want is for this to find any occurrences of the and replace it with what word the user chooses. This expression only changes some occurrences and when it does it removes the before white space and I guess concatenates it with the word before.
Also it should not replace then, there, their, they etc
private final String MY_REGEX = (" the | THE | thE | The | tHe | ThE ");
userInput = JTxtInput.getText();
String usersChoice = JTxtUserChoice.getText();
String usersChoiceOut = (usersChoice + " ");
Pattern pattern = Pattern.compile(MY_REGEX, Pattern.CASE_INSENSITIVE);
Matcher matcher = pattern.matcher(userInput);
while (matcher.find())
{
userInput = userInput.replaceAll(MY_REGEX, usersChoiceOut);
JTxtOutput.setText(userInput);
System.out.println(userInput);
}
Ok this new code seems to replace all desired words and nothing else, also doing it without the spacing issues.
private final String MY_REGEX = ("the |THE |thE |The |tHe |ThE |THe ");
String usersChoiceOut = (usersChoice + " ");

The problem is because of the spaces in MY_REGEX. Check the following demo:
public class Main {
public static void main(String[] args) {
String str="This is the eighth wonder of THE world! How about a new style of writing The as tHe";
// Correct way
String MY_REGEX = ("the|THE|thE|The|tHe|ThE");
System.out.println(str.replaceAll(MY_REGEX, "###"));
}
}
Outputs:
This is ### eighth wonder of ### world! How about a new style of writing ### as ###
whereas
public class Main {
public static void main(String[] args) {
String str="This is the eighth wonder of THE world! How about a new style of writing The as tHe";
// Incorrect way
String MY_REGEX = ("the | THE | thE | The | tHe | ThE");
System.out.println(str.replaceAll(MY_REGEX, "###"));
}
}
Outputs:
This is ###eighth wonder of###world! How about a new style of writing###as tHe

The spaces in the alternation have meaning and will tried to be matched literally on both sides of the word.
As you are already using Pattern.CASE_INSENSITIVE, you could also match the followed by a single space as you mention in your updated answer, and use an inline modifier (?i) to make the pattern case insensitive.
userInput = userInput.replaceAll("(?i)the ", usersChoiceOut);
If the should not be part of a larger word, you add a word boundary \b before it.
(?i)\bthe

How does [] make a difference in Java regex?

I have a regex for validation of UTF-8 characters.
String regex = "[\\p{L}\\p{M}\\p{N}\\p{P}\\p{Z}\\p{S}\\p{C}]*"
I wanted to do a range check too so I modified it to
String regex = "[[\\p{L}\\p{M}\\p{N}\\p{P}\\p{Z}\\p{S}\\p{C}]*]"
String rangeRegex = regex + "{0,30}"
Notice that it’s the same regex I just wrapped it with [ ].
Now I can validate with the range by using rangeRegex but regex is now not validating UTF-8 chars.
My question is: how is [] affecting regex? If I remove [] from the original regex it will validate UTF-8 chars but not with range. If I put [] it will validate with range but not without range!
sample test code -
public class Test {
static String regex = "[[\\p{L}\\p{M}\\p{N}\\p{P}\\p{Z}\\p{S}\\p{C}]*]" ;
public static void main(String[] args) {
String userId = null;
//testUserId(userId);
userId = "";
testUserId(userId);
userId = "æÆbBcCćĆčČçďĎǳǲdzsDzs";
testUserId(userId);
userId = "test123";
testUserId(userId);
userId = "abcxyzsd";
testUserId(userId);
String zip = "i«♣│axy";
testZip(zip);
zip = "331fsdfsdfasdfasd02c3";
testZip(zip);
zip = "331";
testZip(zip);
}
/**
* without range check
* #param userId
*/
static void testUserId(String userId){
boolean pass = true;
if ( !stringValidator(userId, regex)) {
pass = false;
}
System.out.println(pass);
}
/**
* with a range check
* #param zip
*/
static void testZip(String zip){
boolean pass = true;
String regex1 = regex + "{0,10}";
if (StringUtils.isNotBlank(zip) && !stringValidator(zip, regex1)) {
pass = false;
}
System.out.println(pass);
}
static boolean stringValidator(String str, String regex) {
Pattern pattern = Pattern.compile(regex);
Matcher matcher = pattern.matcher(str);
return matcher.matches();
}
}

The explanations given are rather wrong for Java regex.
In Java, unescaped paired square brackets inside a character class are not treated as literal [ and ] characters. They have a special meaning in Java character classes:
[a-d[m-p]] a through d, or m through p: [a-dm-p] (union)
[a-z&&[def]] d, e, or f (intersection)
[a-z&&[^bc]] a through z, except for b and c: [ad-z] (subtraction)
[a-z&&[^m-p]] a through z, and not m through p: [a-lq-z] (subtraction)
So, when you add a [...] to your regex, you get a union of the previous regex pattern with literal * character and means match either [\\p{L}\\p{M}\\p{N}\\p{P}\\p{Z}\\p{S}\\p{C}] or a literal *.
Also, [[\\p{L}\\p{M}\\p{N}\\p{P}\\p{Z}\\p{S}\\p{C}]*] is equal to [\\p{L}\\p{M}\\p{N}\\p{P}\\p{Z}\\p{S}\\p{C}*] as * symbol inside a character class stops being a special character (a quantifier) and becomes a literal asterisk symbol.
If you use [[]], the engine will throw an exception: Unclosed character class near index 3
See this IDEONE demo:
System.out.println("abc[]".replaceAll("[[abc]]", "")); // => []
System.out.println("abc[]".replaceAll("[[]]", "")); // => error
Whenever you need to check the length of a string with regex, you need anchors and a limiting quantifier. Anchors are automatically added when a regex is used with Matcher#matches method:
The matches method attempts to match the entire input sequence against the pattern.
Example code:
String regex = "[\\p{L}\\p{M}\\p{N}\\p{P}\\p{Z}\\p{S}\\p{C}]";
String new_regex = regex + "{0,30}";
System.out.println("Some string".matches(new_regex)); // => true
See this IDEONE demo
UPDATE
Here is commented code of yours:
String userId = "";
testUserId(userId); // false - Correct as we test an empty string with an at-least-one-char regex
userId = "æÆbBcCćĆčČçďĎǳǲdzsDzs";
testUserId(userId); // false - Correct as we only match 1 character string, others fail
userId = "test123";
testUserId(userId); // false - see above
userId = "abcxyzsd";
testUserId(userId); // false - see above
String zip = "i«♣│axy";
testZip(zip); // true - OK, 7-symbol string matches against [...]{0,10} regex
zip = "331fsdfsdfasdfasd02c3";
testZip(zip); // false - OK, 21-symbol string does not match a regex that requires only 0 to 10 characters
zip = "331";
testZip(zip); // true - OK, 3-symbol string matches against [...]{0,10} regex

* means 0 or more, so it is almost like {0,}. i.e. you can replace the * with {0,30} and it should do what you want:
[\p{L}\p{M}\p{N}\p{P}\p{Z}\p{S}\p{C}]{0,30}
[] creates a character class, so [[]] would be "a character class of just [ followed by ] since the first ] closes the character class prematurely and doesn't really do what you want.
Also correct me if I'm wrong, but the character list you are generating is pretty much everything, so you could go with .{0,30} for the same effect.

How can I replace a named group's value

I have the regex
private static final String COPY_NUMBER_REGEX = ".*copy(?<copy_num>\\d+)";
And I need to replace the named group as follows:
private void setCopyNum(){
Pattern pa = Pattern.compile(COPY_NUMBER_REGEX);
Matcher ma = pa.matcher(template.getName());
if(ma.find()){
Integer numToReplace = Integer.valueOf(ma.group("copy_num")) + 1;
//How to replace the value of the original captured group with numToReplace?
}
}
The question is in the comment in fact. Is there something in Java Regex that allows us to replace named groups value? I mean to get a new String with a replaced value, of course. For instance:
input: Happy birthday template - copy(1)
output: Happy birthday template - copy(2)

Here's a quick and dirty solution:
// | preceded by "copy"
// | | named group "copynum":
// | | | any 1+ digit
final String COPY_NUMBER_REGEX = "(?<=copy)(?<copynum>\\d+)";
// using String instead of your custom Template object
String template = "blah copy41";
Pattern pa = Pattern.compile(COPY_NUMBER_REGEX);
Matcher ma = pa.matcher(template);
StringBuffer sb = new StringBuffer();
// iterating match
while (ma.find()) {
Integer numToReplace = Integer.valueOf(ma.group("copynum")) + 1;
ma.appendReplacement(sb, String.valueOf(numToReplace));
}
ma.appendTail(sb);
System.out.println(sb.toString());
Output
blah copy42
Notes
copy_num is an invalid named group - no underscores allowed
The example is self-contained (would work in a main method). You'll need to adapt slightly to your context.
You might need to add escaped parenthesis around your named group, if you need to actually match those: "(?<=copy\\()(?<copynum>\\d+)(?=\\))"

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Replacing all regex matches with masking characters in Java - java

Related

Split String by | and numbers

Is there a regex expression suitable for splitting two one digit numbers separated by a whitespace?

JAVA regular expression to replace all occurrences of a particular word "working weird"

How does [] make a difference in Java regex?

How can I replace a named group's value

Categories

Resources