Regex to match only letters and numbers

Regex to match only letters and numbers - java

Can you help with this code?
It seems easy, but always fails.
#Test
public void normalizeString(){
StringBuilder ret = new StringBuilder();
//Matcher matches = Pattern.compile( "([A-Z0-9])" ).matcher("P-12345678-P");
Matcher matches = Pattern.compile( "([\\w])" ).matcher("P-12345678-P");
for (int i = 1; i < matches.groupCount(); i++)
ret.append(matches.group(i));
assertEquals("P12345678P", ret.toString());
}

Constructing a Matcher does not automatically perform any matching. That's in part because Matcher supports two distinct matching behaviors, differing in whether the match is implicitly anchored to the beginning of the Matcher's region. It appears that you could achieve your desired result like so:
#Test
public void normalizeString(){
StringBuilder ret = new StringBuilder();
Matcher matches = Pattern.compile( "[A-Z0-9]+" ).matcher("P-12345678-P");
while (matches.find()) {
ret.append(matches.group());
}
assertEquals("P12345678P", ret.toString());
}
Note in particular the invocation of Matcher.find(), which was a key omission from your version. Also, the nullary Matcher.group() returns the substring matched by the last find().
Furthermore, although your use of Matcher.groupCount() isn't exactly wrong, it does lead me suspect that you have the wrong idea about what it does. In particular, in your code it will always return 1 -- it inquires about the pattern, not about matches to it.

First of all you don't need to add any group because entire match can be always accessed by group 0, so instead of
(regex) and group(1)
you can use
regex and group(0)
Next thing is that \\w is already character class so you don't need to surround it with another [ ], because it will be similar to [[a-z]] which is same as [a-z].
Now in your
for (int i = 1; i < matches.groupCount(); i++)
ret.append(matches.group(i));
you will iterate over all groups from 1 but you will exclude last group, because they are indexed from 1 so n so i<n will not include n. You would need to use i <= matches.groupCount() instead.
Also it looks like you are confusing something. This loop will not find all matches of regex in input. Such loop is used to iterate over groups in used regex after match for regex was found.
So if regex would be something like (\w(\w))c and your match would be like abc then
for (int i = 1; i < matches.groupCount(); i++)
System.out.println(matches.group(i));
would print
ab
b
because
first group contains two characters (\w(\w)) before c
second group is the one inside first one, right after first character.
But to print them you actually would need to first let regex engine iterate over your input and find() match, or check if entire input matches() regex, otherwise you would get IllegalStateException because regex engine can't know from which match you want to get your groups (there can be many matches of regex in input).
So what you may want to use is something like
StringBuilder ret = new StringBuilder();
Matcher matches = Pattern.compile( "[A-Z0-9]" ).matcher("P-12345678-P");
while (matches.find()){//find next match
ret.append(matches.group(0));
}
assertEquals("P12345678P", ret.toString());
Other way around (and probably simpler solution) would be actually removing all characters you don't want from your input. So you could just use replaceAll and negated character class [^...] like
String input = "P-12345678-P";
String result = input.replaceAll("[^A-Z0-9]+", "");
which will produce new string in which all characters which are not A-Z0-9 will be removed (replaced with "").

Related

Extract string between a set of multiple limiters with groups

As title says, I've a string and I want to extract some data from It.
This is my String:
text = "|tab_PRO|1|1|#tRecordType#||0|tab_PRO|";
and I want to extract all the data between the pipes: tab_PRO, 1, 1...and so on
.
I've tried:
Pattern p = Pattern.compile("\\|(.*?)\\|");
Matcher m = p.matcher(text);
while(m.find())
{
for(int i = 1; i< 10; i++) {
test = m.group(i);
System.out.println(test);
}
}
and with this i get the first group that's tab_PRO. But i also get an error
java.lang.IndexOutOfBoundsException: No group 2
Now, probably I didn't understand quite well how the groups works, but I thought that with this I could get the remaining data that I need. I'm not able to understand what I'm missing.
Thanks in advance

Use String.split(). Take into account it expects a regex as an argument, and | is a reserved regex operand, so you'll need to escape it with a \. So, make it two \ so \| won't be interpreted as if you're using an - invalid - escape sequence for the | character:
String[] parts = text.split("\\|");
See it working here:
https://ideone.com/WibjUm
If you want to go with your regex approach, you'll need to group and capture every repetition of characters after every | and restrict them to be anything except |, possibly using a regex like \\|([^\\|]*).
In your loop, you iterate over m.find() and just use capture group 1 because its the only group every match will have.
String text = "|tab_PRO|1|1|#tRecordType#||0|tab_PRO|";
Pattern p = Pattern.compile("\\|([^\\|]*)");
Matcher m = p.matcher(text);
while(m.find()){
System.out.println(m.group(1));
}
https://ideone.com/RNjZRQ

Try using .split() or .substring()

As mentioned in the comments, this is easier done with String.split.
As for your own code, you are unnecessarily using the inner loop, and that's leading to that exception. You only have one group, but the for loop will cause you to query more than one group. Your loop should be as simple as:
Pattern p = Pattern.compile("(?<=\\|)(.*?)\\|");
Matcher m = p.matcher(text);
while (m.find()) {
String test = m.group(1);
System.out.println(test);
}
And that prints
tab_PRO
1
1
#tRecordType#
0
tab_PRO
Note that I had to use a look-behind assertion in your regex.

Check if a String satisfies a regex

I have a List of String and I want to filter out the String that doesn't match a regex pattern
Input List = Orthopedic,Orthopedic/Ortho,Length(in.)
My code
for(String s : keyList){
Pattern p = Pattern.compile("[a-zA-Z0-9-_]");
Matcher m = p.matcher(s);
if (!m.find()){
System.out.println(s);
}
}
I expect the 2nd and 3rd string to be printed as they do not match the regex. But it is not printing anything

Explanation
You are not matching the entire input. Instead, you are trying to find the next matching part in the input. From Matcher#finds documentation:
Attempts to find the next subsequence of the input sequence that matches the pattern.
So your code will match an input if at least one character is one of a-zA-Z0-9-_.
Solution
If you want to match the whole region you should use Matcher#matches (documentation):
Attempts to match the entire region against the pattern.
And you probably want to adjust your pattern to allow multiple characters, for example by a pattern like
[a-zA-Z0-9-_]+
The + allows 1 to infinite many repetitions of the pattern (? is 0 to 1 and * is 0 to infinite).
Notes
You have an extra - at the end of your pattern. You probably want to remove that. Or, if you intended to match the character litteraly, you need to escape it:
[a-zA-Z0-9\\-_]+
You can test your regex on sites like regex101.com, here's your pattern: regex101.com/r/xvT8V0/1.
Note that there is also String#matches (documentation). So you could write more compact code by just using s.matches("[a-zA-Z0-9_]+").
Also note that you can shortcut character sets like [a-zA-Z0-9_] by using predefined sets. The set \w (word character) matches exactly your desired pattern.
Since the pattern and also the matcher don't change, you might want to move them outside of the loop to slightly increase performance.
Code
All in all your code might then look like:
Pattern p = Pattern.compile("[a-zA-Z0-9_]+");
Matcher m = p.matcher(s);
for (String s : keyList) {
if (!m.matches()) {
System.out.println(s);
}
}
Or compact:
for (String s : keyList) {
if (!s.matches("\\w")) {
System.out.println(s);
}
}
Using streams:
keyList.stream()
.filter(s -> !s.matches("\\w"))
.forEach(System.out::println);

You shouldn't construct a Pattern in a loop, you currently only match a single character, and you can use !String.matches(String) and a filter() operation. Like,
List<String> keyList = Arrays.asList("Orthopedic", "Orthopedic/Ortho", "Length(in.)");
keyList.stream().filter(x -> !x.matches("[a-zA-Z0-9-_]+"))
.forEachOrdered(System.out::println);
Outputs (as requested)
Orthopedic/Ortho
Length(in.)
Or, using the Pattern, like
List<String> keyList = Arrays.asList("Orthopedic", "Orthopedic/Ortho", "Length(in.)");
Pattern p = Pattern.compile("[a-zA-Z0-9-_]+");
keyList.stream().filter(x -> !p.matcher(x).matches()).forEachOrdered(System.out::println);

There are two problems:
1) the regular expression is wrong, it matches just one character.
2) you need to use m.matches() instead of m.find().

You can use matches instead of find:
//Added the + at the end and removed the extra -
Pattern p = Pattern.compile("[a-zA-Z0-9_]+");
for(String s : keyList){
Matcher m = p.matcher(s);
if (!m.matches()){
System.out.println(s);
}
}
Also note that the point of compiling a pattern is to reuse it, so put it outside the loop. Otherwise you may as well use:
for(String s : keyList){
if (!s.matches("[a-zA-Z0-9_]+")){
System.out.println(s);
}
}

“minus-sign” into this regular expression. How?

Consider:
String str = "XYhaku(ABH1235-123548)";
From the above string, I need only "ABH1235-123548" and so far I created a regular expression:
Pattern.compile("ABH\\d+")
But it returns false. So what the correct regular expression for it?

I would just grab whatever is in the parenthesis:
Pattern p = Pattern.compile("\\((?<data>[A-Z\\d]+\\-\\d+)\\)");
Or, if you want to be even more open (any parenthesis):
Pattern p = Pattern.compile("\\((?<data>.+\\)\\)");
Then just nab it:
String s = /* some input */;
Matcher m = p.matcher(s);
if (m.find()) { //just find first
String tag = m.group("data"); //ABH1235-123548
}

\d only matches digits. To include other characters, use a character class:
Pattern.compile("ABH[\\d-]+")
Note that the - must be placed first or last in the character class, because otherwise it will be treated as a range indicator ([A-Z] matching every letter between A and Z, for example). Another way to avoid that would be to escape it, but that adds two more backslashes to your string...

how to exclude "<" in regex match

I have a String which looks like "<name><address> and <Phone_1>". I have get to get the result like
1) <name>
2) <address>
3) <Phone_1>
I have tried using regex "<(.*)>" but it returns just one result.

The regex you want is
<([^<>]+?)><([^<>]+?)> and <([^<>]+?)>
Which will then spit out the stuff you want in the 3 capture groups. The full code would then look something like this:
Matcher m = Pattern.compile("<([^<>]+?)><([^<>]+?)> and <([^<>]+?)>").matcher(string);
if (m.find()) {
String name = m.group(1);
String address = m.group(2);
String phone = m.group(3);
}

The pattern .* in a regex is greedy. It will match as many characters as possible between the first < it finds and the last possible > it can find. In the case of your string it finds the first <, then looks for as much text as possible until a >, which it will find at the very end of the string.
You want a non-greedy or "lazy" pattern, which will match as few characters as possible. Simply <(.+?)>. The question mark is the syntax for non-greedy. See also this question.

This will work if you have dynamic number of groups.
Pattern p = Pattern.compile("(<\\w+>)");
Matcher m = p.matcher("<name><address> and <Phone_1>");
while (m.find()) {
System.out.println(m.group());
}

regex pattern matcher

Not too familiar with regex, but I have a block of code that does not seem to be working as expected, I think I know why, but would be looking for a solution.
Here is the string "whereClause"
where filter_2_id = 20 and acceptable_flag is true
String whereClause = report.getWhereClause();
String[] tokens = whereClause.split("filter_1_id");
Pattern p = Pattern.compile("(\\d{3})\\d+");
Matcher m = p.matcher(tokens[0]);
List<Integer> filterList = new ArrayList<Integer>();
if (m.find()) {
do {
String local = m.group();
filterList.add(Integer.parseInt(local));
} while (m.find());
}
When I am debugging, it looks like it gets to the if (m.find()){ but then it just completely skips over it. Is it because the regex pattern (\d{3}\d+) only looks for numbers greater than 3 digits? I actually need it to scan for any set of numbers, so should i just include it as 0-9 inside?
Help/advice please

You can try the regular expression "=\\s*(\\d+)" and then modify m.group() to m.group(1). This should look for an equal sign, possibly followed by some whitespace, and then a sequence of one or more digits. Putting the digits part in parentheses creates a group, which will be group 1 (group 0 is the whole match).

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Regex to match only letters and numbers - java

Related

Extract string between a set of multiple limiters with groups

Check if a String satisfies a regex

“minus-sign” into this regular expression. How?

how to exclude "<" in regex match

regex pattern matcher

Categories

Resources