regex help in java

regex help in java - java

I'm trying to compare following strings with regex:
#[xyz="1","2"'"4"] ------- valid
#[xyz] ------------- valid
#[xyz="a5","4r"'"8dsa"] -- valid
#[xyz="asd"] -- invalid
#[xyz"asd"] --- invalid
#[xyz="8s"'"4"] - invalid
The valid pattern should be:
#[xyz then = sign then some chars then , then some chars then ' then some chars and finally ]. This means if there is characters after xyz then they must be in format ="XXX","XXX"'"XXX".
Or only #[xyz]. No character after xyz.
I have tried following regex, but it did not worked:
String regex = "#[xyz=\"[a-zA-z][0-9]\",\"[a-zA-z][0-9]\"'\"[a-zA-z][0-9]\"]";
Here the quotations (in part after xyz) are optional and number of characters between quotes are also not fixed and there could also be some characters before and after this pattern like asdadad #[xyz] adadad.

You can use the regex:
#\[xyz(?:="[a-zA-z0-9]+","[a-zA-z0-9]+"'"[a-zA-z0-9]+")?\]
See it
Expressed as Java string it'll be:
String regex = "#\\[xyz=\"[a-zA-z0-9]+\",\"[a-zA-z0-9]+\"'\"[a-zA-z0-9]+\"\\]";
What was wrong with your regex?
[...] defines a character class. When you want to match literal [ and ] you need to escape it by preceding with a \.
[a-zA-z][0-9] match a single letter followed by a single digit. But you want one or more alphanumeric characters. So you need [a-zA-Z0-9]+

Use this:
String regex = "#\\[xyz(=\"[a-zA-z0-9]+\",\"[a-zA-z0-9]+\"'\"[a-zA-z0-9]+\")?\\]";
When you write [a-zA-z][0-9] it expects a letter character and a digit after it. And you also have to escape first and last square braces because square braces have special meaning in regexes.
Explanation:
[a-zA-z0-9]+ means alphanumeric character (but not an underline) one or more times.
(=\"[a-zA-z0-9]+\",\"[a-zA-z0-9]+\"'\"[a-zA-z0-9]+\")? means that expression in parentheses can be one time or not at all.

Since square brackets have a special meaning in regex, you used it by yourself, they define character classes, you need to escape them if you want to match them literally.
String regex = "#\\[xyz=\"[a-zA-z][0-9]\",\"[a-zA-z][0-9]\"'\"[a-zA-z][0-9]\"\\]";
The next problem is with '"[a-zA-z][0-9]' you define "first a letter, second a digit", you need to join those classes and add a quantifier:
String regex = "#\\[xyz=\"[a-zA-z0-9]+\",\"[a-zA-z0-9]+\"'\"[a-zA-z0-9]+\"\\]";
See it here on Regexr

there could also be some characters before and after this pattern like
asdadad #[xyz] adadad.
Regex should be:
String regex = "(.)*#\\[xyz(=\"[a-zA-z0-9]+\",\"[a-zA-z0-9]+\"'\"[a-zA-z0-9]+\")?\\](.)*";
The First and last (.)* will allow any string before the pattern as you have mentioned in your edit. As said by #ademiban this (=\"[a-zA-z0-9]+\",\"[a-zA-z0-9]+\"'\"[a-zA-z0-9]+\")? will come one time or not at all. Other mistakes are also very well explained by Others +1 to all other.

Related

Erase any string that doesn't match a pattern using replaceall()

I need to replace ALL characters that don't follow a pattern with "".
I have strings like:
MCC-QX-1081
TEF-CO-QX-4949
SPARE-QX-4500
So far the closest I am using the following regex.
String regex = "[^QX,-,\\d]";
Using the replaceAll String method I get QX1081 and the expected result is QX-1081

You're using a character class which matches single characters, not patterns.
You want something like
String resultString = subjectString.replaceAll("^.*?(QX-\\d+)?$", "$1");
which works as long as nothing follows the QX-digits part in your strings.

Put the dash at the end of the regex: [^QX,\d-]
Next you just have to substring to filter out the first dash.
Don't know exactly what you expect for all strings but if you want to match a dash in a character class then it must be set as last character.

You are using a character class where you have to either escape the hyphen or put it at the start or at the end like [^QX,\d-] or else you are matching a range from a comma to a comma. But changing that will give you -QX-1081 which is not the desired result.
You could match your pattern and then replace with the first capturing group $1:
^(?:[A-Z]+-)+(QX-\d+)$
In Java you have to double escape matching a digit \\d
That will match:
^ Start of the string
(?:[A-Z]+-)+ Repeat 1+ times one or more uppercase charactacters followed by a hyphen
(QX-\d+) Capture in a group QX- followed by 1+ digits
$ End of the string
For example:
String result = "MCC-QX-1081".replaceAll("^(?:[A-Z]+-)+(QX-\\d+)$", "$1");
System.out.println(result); // QX-1081
See the Regex demo | Java demo
Note that if you are doing just 1 replacement, you could also use replaceFirst

Formulating a regex with a single dot

I am trying to formulate a regex for the following scenario :
The String to match : mName87.com
So, the string may consist of any number of alpha numeric characters , but can contain only a single dot anywhere in the string .
I formulated this regex : [a-zA-Z0-9.], but it matches even multiple dots(.)
What am i doing wrong here ?

The regex you provided matches only a single character in the whole string you're trying to validate. There are a few things to take care of in your scenario
You want to match over the whole string, so your regex must start with ^ (beginning of the string) and end with $ (end of the string).
Then you want to accept any number of alpha-numeric characters, this is done with [a-zA-Z0-9]+, here the + means one or more characters.
Then match the point: \. (you must escape it here)
Finally accept more characters again.
All together the regex would then be:
^[a-zA-Z0-9]+\.[a-zA-Z0-9]+$

You can use this regex:
\\w*\\.\\w*
You can try here

Try with:
^([a-zA-Z0-9]+\.)+[a-zA-Z]$

use this regular expression ^[a-zA-Z0-9]*\.[a-zA-Z0-9.]*$

EDITED:
Try
([a-zA-Z0-9]+\.[a-zA-Z0-9]+)|(\.[a-zA-Z0-9]+)|([a-zA-Z0-9]+\.)
That is: [a word that ends with a dot] OR [two words and the dot in the middle] OR [a word that starts with a dot]

A regex that doesn't match with this character sequence

Here is my Regex, I am trying to search all special characters so that I can escape them.
(\(|\)|\[|\]|\{|\}|\?|\+|\\|\.|\$|\^|\*|\||\!|\&|\-|\#|\#|\%|\_|\"|\:|\<|\>|\/|\;|\'|\`|\~)
My problem here is, I don't want to escape some sepcial characters only when the come in a sequence
like this (.*)
So, Lets consider an example.
Sting message = "Hi, Mr.Xyz! Your account number is :- (1234567890) , (,*) &$#%#*(....))(((";
After escaping according to current regex what i get is,
Hi, Mr\.Xyz\! Your account number is \:\- \(1234567890\) , \(,\*\) \&\$\#\%\#\*\(\.\.\.\.\)\)\(\(\(
But is don't want to escape this part (.*) want to keep it as it is.
My above regex is only used for searching, So i just don't want to match with this part (.*) and my problem will be solved
Can anyone suggest regex that doesn't escape that part of the string?

See #nhahtdh for how to do this with a regex.
As an alternative, Here is a solution which does not use a regex, using Guava's CharMatcher instead:
private static final CharMatcher SPECIAL
= CharMatcher.anyOf("allspecialcharshere");
private static final String NO_ESCAPE = "(.*)";
public String doEncode(String input)
{
StringBuilder sb = new StringBuilder(input.length());
String tmp = input;
while (!tmp.isEmpty()) {
if (tmp.startsWith(NO_ESCAPE)) {
sb.append(NO_ESCAPE);
tmp = tmp.substring(NO_ESCAPE.length());
continue;
}
char c = tmp.charAt(0);
if (SPECIAL.matches(c))
sb.append('\\');
sb.append(c);
tmp = tmp.substring(1);
}
return sb.toString();
}

This answer is to demonstrate the possibility only. Using it in production code is questionable.
It is possible with Java String replaceAll function:
String input = "Hi, Mr.Xyz! Your account number is :- (1234567890) , (.*) &$#%#*(....))(((";
String output = input.replaceAll("\\G((?:[^()\\[\\]{}?+\\\\.$^*|!&##%_\":<>/;'`~-]|\\Q(.*)\\E)*+)([()\\[\\]{}?+\\\\.$^*|!&##%_\":<>/;'`~-])", "$1\\\\$2");
Result:
"Hi, Mr\.Xyz\! Your account number is \:\- \(1234567890\) , (.*) \&\$\#\%\#\*\(\.\.\.\.\)\)\(\(\("
Another test:
String input = "(.*) sdfHi test message <> >>>>><<<<f<f<,,,,<> <>(.*) sdf (.*) sdf (.*)";
Result:
"(.*) sdfHi test message \<\> \>\>\>\>\>\<\<\<\<f\<f\<,,,,\<\> \<\>(.*) sdf (.*) sdf (.*)"
Explanation
Raw regex:
\G((?:[^()\[\]{}?+\\.$^*|!&##%_":<>/;'`~-]|\Q(.*)\E)*+)([()\[\]{}?+\\.$^*|!&##%_":<>/;'`~-])
Note that \ is escaped once more when the regex is specified inside the string, and " needs to be escaped. The resulting regex in string can be seen above.
Raw replacement string:
$1\\$2
Since $ has special meaning in replacement string, and you want to keep it for $2, you need to escape the \ so that \ won't escape the $. And putting the replacement string in quoted string, you need to double up the number of \ to escape the \.
Before we dissect the monster, let's talk about the idea. We will consume non-special characters, and the sequence that we don't want to replace, and as many times as possible. The next character will either be a special character not forming sequence we don't want to replace, or is the end of the string (which means that we have found all character that needs replacing if any).
Naturally, we can think of any arbitrary string as consisting of many of the following pattern consecutively: [0 or more (non-special character or special pattern not to be replace)][special character], and the string ends with [0 or more (non-special character or special pattern not to be replace)].
replaceAll function when used with a regex without \G may find matches that are not consecutive, which can cut in the middle of the sequence not to be replaced and mess it up. \G means the boundary of last match, and can be used to make sure the next match starts from where the last match left off.
\G: Starts from last match
((?:[^()\[\]{}?+\\.$^*|!&##%_":<>/;'`~-]|\Q(.\*)\E)*+): Capture 0 or more of, the non-special character or the special pattern not to be replaced. Note that I have added the possessive qualifier + after *. This will prevent the engine from backtracking when it cannot find the special character that we specify after this.
[^()\[\]{}?+\\.$^*|!&##%_":<>/;'`~-]: Negated character class of special characters.
\Q(.*)\E: Special sequence (.*) not to be replaced, literal quoted by \Q and \E.
([()\[\]{}?+\\.$^*|!&##%_":<>/;'`~-]): Capture the single special character.
The whole regex will match string with minimum length of 1 (the special character). The first capturing group contains the parts that shouldn't be replaced, and the 2nd capturing group contains the special character that should be replaced.

regex to find substring between special characters

I am running into this problem in Java.
I have data strings that contain entities enclosed between & and ; For e.g.
&Text.ABC;, &Links.InsertSomething;
These entities can be anything from the ini file we have.
I need to find these string in the input string and remove them. There can be none, one or more occurrences of these entities in the input string.
I am trying to use regex to pattern match and failing.
Can anyone suggest the regex for this problem?
Thanks!

Here is the regex:
"&[A-Za-z]+(\\.[A-Za-z]+)*;"
It starts by matching the character &, followed by one or more letters (both uppercase and lower case) ([A-Za-z]+). Then it matches a dot followed by one or more letters (\\.[A-Za-z]+). There can be any number of this, including zero. Finally, it matches the ; character.
You can use this regex in java like this:
Pattern p = Pattern.compile("&[A-Za-z]+(\\.[A-Za-z]+)*;"); // java.util.regex.Pattern
String subject = "foo &Bar; baz\n";
String result = p.matcher(subject).replaceAll("");
Or just
"foo &Bar; baz\n".replaceAll("&[A-Za-z]+(\\.[A-Za-z]+)*;", "");
If you want to remove whitespaces after the matched tokens, you can use this re:
"&[A-Za-z]+(\\.[A-Za-z]+)*;\\s*" // the "\\s*" matches any number of whitespace

And there is a nice online regular expression tester which uses the java regexp library.
http://www.regexplanet.com/simple/index.html

You can try:
input=input.replaceAll("&[^.]+\\.[^;]+;(,\\s*&[^.]+\\.[^;]+;)*","");
See it

regular expressions using java.util.regex API- java

How can I create a regular expression to search strings with a given pattern? For example I want to search all strings that match pattern '*index.tx?'. Now this should find strings with values index.txt,mainindex.txt and somethingindex.txp.
Pattern pattern = Pattern.compile("*.html");
Matcher m = pattern.matcher("input.html");
This code is obviously not working.

You need to learn regular expression syntax. It is not the same as using wildcards. Try this:
Pattern pattern = Pattern.compile("^.*index\\.tx.$");
There is a lot of information about regular expressions here. You may find the program RegexBuddy useful while you are learning regular expressions.

The code you posted does not work because:
dot . is a special regex character. It means one instance of any character.
* means any number of occurrences of the preceding character.
therefore, .* means any number of occurrences of any character.
so you would need something like
Pattern pattern = Pattern.compile(".*\\.html.*");
the reason for the \\ is because we want to insert dot, although it is a special regex sign.
this means: match a string in which at first there are any number of wild characters, followed by a dot, followed by html, followed by anything.

* matches zero or more occurrences of the preceding token, so if you want to match zero or more of any character, use .* instead (. matches any char).
Modified regex should look something like this:
Pattern pattern = Pattern.compile("^.*\\.html$");
^ matches the start of the string
.* matches zero or more of any char
\\. matches the dot char (if not escaped it would match any char)
$ matches the end of the string

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

regex help in java - java

Related

Erase any string that doesn't match a pattern using replaceall()

Formulating a regex with a single dot

A regex that doesn't match with this character sequence

regex to find substring between special characters

regular expressions using java.util.regex API- java

Categories

Resources