Java Regular expression - java

I am not much familiar with regular expressions.
I want help for following regular exceptions:
1. String start with alpha word and then followed by any alpha or number. e.g. Abc 20 Jan to 15 Dec
2. String for a decimal number. e.g. 450,122,224.00
3. Also to check if String contain any pattern like 'Page 2 of 20'
Thanks.

// 1. String start with alpha word and then followed by
// any aplha or number. e.g. Abc 20 Jan to 15 Dec
// One or more alpha-characters, followed by a space,
// followed by some alpha-numeric character, followed by what ever
Pattern p = Pattern.compile("\\p{Alpha}+ \\p{Alnum}.*");
for (String s : new String[] {"Abc 20 Jan to 15 Dec", "hello world", "123 abc"})
System.out.println(s + " matches: " + p.matcher(s).matches());
// 2. String for a decimal number. e.g. 450,122,224.00
p = Pattern.compile(
"\\p{Digit}+(\\.\\p{Digit})?|" + // w/o thousand seps.
"\\p{Digit}{1,3}(,\\p{Digit}{3})*\\.\\p{Digit}+"); // w/ thousand seps.
for (String s : new String[] { "450", "122", "224.00", "450,122,224.00", "0.0.3" })
System.out.println(s + " matches: " + p.matcher(s).matches());
// 3. Also to check if String contain any pattern like 'Page 2 of 20'
// "Page" followed by one or more digits, followed by "of"
// followed by one or more digits.
p = Pattern.compile("Page \\p{Digit}+ of \\p{Digit}+");
for (String s : new String[] {"Page 2 of 20", "Page 2 of X"})
System.out.println(s + " matches: " + p.matcher(s).matches());
Output:
Abc 20 Jan to 15 Dec matches: true
hello world matches: true
123 abc matches: false
450 matches: true
122 matches: true
224.00 matches: true
450,122,224.00 matches: true
0.0.3 matches: false
Page 2 of 20 matches: true
Page 2 of X matches: false

1.) /[A-Z][a-z]*(\s([\d]+)|\s([A-Za-z]+))+/
[A-Z][a-z]* being an uppercase word
\s([\d]+) being a number prefixed be a (white)space
\s([A-Za-z]+) being a word prefixed be a (white)space
2.) /(\d{1,3})(,(\d{3}))*(.(\d{2}))/
(\d{1,3}) being a 1-to-3 digit number
(,(\d{3}))* being 0-or-more three-digit numbers prefixed by a comma
(.(\d{2})) being a 2-digit decimal
3.) /Page (\d+) of (\d+)/
(\d+) being one-or-more digits
When writing this (or any regex) I like to use this tool

1
I am not sure what you mean here. A word at te start, followed by any number of words and numbers?
Try this one:
^[a-zA-Z]+(\s+([a-zA-Z]+|\d+))+
2
Just a decimal number would be
\d+(\.\d+)?
Getting the commas in there:
\d{1,3}(,\d{3})*(\.\d+)?
3
Use
Page \d+ of \d+

Related

Regex not matching all numbers with delimiters

Need a single combined regex for the following pattern:
Prefix: 2221-2720 , Length: 16
Prefix: 51-55 , Length: 16
where the delimiters b/w digits can be either space ( ), minus sign (-), period (.), backslash (\), equals (=). The condition being that more than one delimiter (same or different type) can't occur more than once b/w any two digits.
Valid number - 230.293.217.952.148.4
Valid number - 230.293 217-952.148.4
Invalid number - 230..293.217.952.148.4
Invalid number - 230.293.-217. 952.148.4
A valid input is one where you have 16 digits separated by any/no delimiters as long as there are no two delimiters adjacent to each other.
Have come up with the following regex:
(2[\s=\\.-]*2[\s=\\.-]*2[\s=\\.-]*[1-9][\s=\\.-]*|2[\s=\\.-]*2[\s=\\.-]*[3-9][\s=\\.-]*[0-9][\s=\\.-]*|2[\s=\\.-]*[3-6][\s=\\.-]*[0-9](?:[\s=\\.-]*[0-9]){1}|2[\s=\\.-]*7[\s=\\.-]*[01][\s=\\.-]*[0-9][\s=\\.-]*|2[\s=\\.-]*7[\s=\\.-]*2[\s=\\.-]*0[\s=\\.-]*)[0-9](?:[\s=\\.-]*[0-9]){11}|(5[\s=\\.-]*[1-5][\s=\\.-]*)[0-9](?:[\s=\\.-]*[0-9]){13}
It does not match certain patterns. For example:
2 3 0 2 9 3 2 1 7 9 5 2 1 4 8 4
23-02-93-21-79-52-14-84
2 3 0 3 4 5 8 0 9 4 9 3 0 8 2 3
For the same numbers, it matches (as expected) the following patterns:
2302932179521484
230.293.217.952.148.4
2303458094930823
230.345.809.493.082.3
230-345-809-493-082-3
There seems to be an issue with delimiters. Kindly let me know what is wrong with my regex.
For this rule
A valid input is one where you have 16 digits separated by any/no
delimiters as long as there are no two delimiters adjacent to each
other
Prefix: 2221-2720 , Length: 16
Prefix: 51-55 , Length: 16
2221 can also be written as 2.2.-2.1
For these rules, it might be easier to write a pattern with 2 capture groups to match the whole string.
Then using some Java code, you can check the value of the capture groups for the ranges.
^((\d[ =\\.-]?\d)[ =\\.-]?\d[ =\\.-]?\d)(?:[ =\\.-]?\d){12}$
The pattern matches:
^ Start of string
( Capture group 1
(\d[ =\\.-]?\d) Capture group 2 Match 2 digits with an optional char = \ . -
[ =\\.-]?\d[ =\\.-]?\d Match 2 times optionally 1 of the listed chars and a single digit
) close group 1
(?:[ =\\.-]?\d){12} Repeat 12 times matching one of the characters and a single digit
$ End of string
Regex demo | Java demo
For example
String strings[] = {
"2221.7.952.148.412.32",
"230.293.217.952.148.4",
"5511111111111111",
"130.293 217-952.148.4",
"30..293.217.952.148.4",
"5..5",
".5.5."
};
String regex = "^((\\d[ =\\\\.-]?\\d)[ =\\\\.-]?\\d[ =\\\\.-]?\\d)(?:[ =\\\\.-]?\\d){12}$";
Pattern pattern = Pattern.compile(regex, Pattern.MULTILINE);
for (String s : strings) {
Matcher matcher = pattern.matcher(s);
if (matcher.find()) {
int grp1 = Integer.parseInt(matcher.group(1).replaceAll("\\D+", ""));
int grp2 = Integer.parseInt(matcher.group(2).replaceAll("\\D+", ""));
if ((grp1 >= 2221 && grp1 <= 2720) || (grp2 >=51 && grp2 <= 55)) {
System.out.println("Match for " + matcher.group());
}
}
}
Output
Match for 2221.7.952.148.412.32
Match for 230.293.217.952.148.4
Match for 5511111111111111

How to skip some part of Regex in Java?

I have some pdf file, and the program reads it line by line.
Here is snipped from a file:
I need to extract:
12000
The parsed line looks like the following:
Bolighus fullverdi 4 374 720 12 000 11 806
I can't find a way how to skip first 7 numbers (4 374 720).
I tried to play with some matching like:
(\d+ ){3}
It founds 2 matches:
Regex how to get value at this case:
\d+ 000
But I want to omit 000 from the regex. In different documents, it will fail.
How to solve this issue?
Maybe you can suggest some other solution to this problem?
UPDATE:
With #PushpeshKumarRajwanshi answer everything is mostly done:
public static String groupNumbers(String pageLine) {
String transformedLine = pageLine.replaceAll(" (?=\\d{3})", StringUtils.EMPTY);
log.info("TRANSFORMED LINE: \n[{}]\nFrom ORIGINAL: \n[{}]", transformedLine, pageLine);
return transformedLine;
}
public static List<String> getGroupedNumbersFromLine(String pageLine) {
String groupedLine = groupNumbers(pageLine);
List<String> numbers = Arrays.stream(groupedLine.split(" "))
.filter(StringUtils::isNumeric)
.collect(Collectors.toList());
log.info("Get list of numbers: \n{}\nFrom line: \n[{}]", numbers, pageLine);
return numbers;
}
However, I found one critical issue.
Sometimes pdf file can look like the following:
Where last 3 digits is a separate number.
And parsed line ends with:
313 400 6 000 370
Which produces an incorrect result:
313400, 6000370
instead of
313400, 6000, 370
UPDATE 2
Consider the next case:
Our line will look like:
Innbo Ekstra Nordea 1 500 000 1 302
it will produce 3 groups as a result:
1500000
1
302
In fact, we have only a second group is missing from input.
Is it possible to make a regex more flexible if the second group is missing?
How to fix this behaviour?
Your numbers have a special pattern which can be used to hack the problem for you. If you notice, any space in this string which is followed by exactly three digits can be removed to unite the number forming actual number, which will make this string,
Bolighus fullverdi 4 374 720 12 000 11 806
to this,
Bolighus fullverdi 4374720 12000 11806
And thus you can capture the second number easily by using this regex,
.*\d+\s+(\d+)\s+\d+
and capture group 2.
Here is a sample java code for same,
public static void main(String[] args) {
String s = "Bolighus fullverdi 4 374 720 12 000 11 806";
s = s.replaceAll(" (?=\\d{3})", "");
System.out.println("Transformed string: " + s);
Pattern p = Pattern.compile(".*\\d+\\s+(\\d+)\\s+\\d+");
Matcher m = p.matcher(s);
if (m.find()) {
System.out.println(m.group(1));
} else {
System.out.println("Didn't match");
}
}
Which outputs,
Transformed string: Bolighus fullverdi 4374720 12000 11806
12000
Hope this helps!
Edit:
Here is the explanation for this regex \D*\d+\s+(\d+)\s+\d+ for capturing required data from transformed string.
Bolighus fullverdi 4374720 12000 11806
.* --> Matches any data before the numbers and here it matches Bolighus fullverdi
\d+ --> Matches one or more digits and here it matches 4374720
\s+ --> Matches one or more space which is present between the numbers.
(\d+) --> Matches one or more digits and captures it in group 1 where it matches 12000
\s+ --> Matches one or more space which is present between the numbers.
\d+ --> Matches one or more digits and here it matches 11806
As OP wanted to capture the second number, hence I only grouped (put parenthesis around intended capture part) second \d+ but if you want to capture first number or third number, you can simply group them as well like this,
\D*(\d+)\s+(\d+)\s+(\d+)
Then in java code, calling,
m.group(1) would give group 1 number which is 4374720
m.group(2) would give group 2 number which is 12000
m.group(3) would give group 3 number which is 11806
Hope this clarifies and let me know if you need anything further.
Edit2
For covering the case for following string,
Andre bygninger 313 400 6 000 370
so that it captures 313400, 6000 and 370, I have to change the approach of the solution. And in this approach, I will not be transforming the string, but rather will capture the digits with spaces and once all three numbers are captured, will remove space between them. This solution will work for old string as well as new string above where we want to capture last three digits 370 as third number. But let's suppose we have following case,
Andre bygninger 313 400 6 000 370 423
where we have further 423 digits in the string, then it will be captured as following numbers,
313400, 6000370, 423
as it doesn't know whether 370 should go to 6000 or 423. So I have made the solution in a way where last three digits are captured as third number.
Here is a java code that you can use.
public static void main(String[] args) throws Exception {
Pattern p = Pattern
.compile(".*?(\\d{1,3}(?:\\s+\\d{3})*)\\s*(\\d{1,3}(?:\\s+\\d{3})*)\\s*(\\d{1,3}(?:\\s+\\d{3})*)");
List<String> list = Arrays.asList("Bolighus fullverdi 4 374 720 12 000 11 806",
"Andre bygninger 313 400 6 000 370");
for (String s : list) {
Matcher m = p.matcher(s);
if (m.matches()) {
System.out.println("For string: " + s);
System.out.println(m.group(1).replaceAll(" ", ""));
System.out.println(m.group(2).replaceAll(" ", ""));
System.out.println(m.group(3).replaceAll(" ", ""));
} else {
System.out.println("For string: '" + s + "' Didn't match");
}
System.out.println();
}
}
This code prints following output as you wanted,
For string: Bolighus fullverdi 4 374 720 12 000 11 806
4374720
12000
11806
For string: Andre bygninger 313 400 6 000 370
313400
6000
370
Here is the explanation for regex,
.*?(\\d{1,3}(?:\\s+\\d{3})*)\\s*(\\d{1,3}(?:\\s+\\d{3})*)\\s*(\\d{1,3}(?:\\s+\\d{3})*)
.*? --> Matches and consumes any input before the numbers
(\\d{1,3}(?:\\s+\\d{3})*) --> This pattern tries to capture first number which can start with one to three digits followed by space and exactly three digits and "space plus three digits" altogether can occur zero or more times.
\\s* --> Followed by zero or more space
And after that, same group (\\d{1,3}(?:\\s+\\d{3})*) is repeated two more times so it can capture numbers in three groups.
Since I have made three capturing groups, hence capturing has to take place in three groups for it to be a successful match. So for e.g. here is the mechanism of capturing this input,
Andre bygninger 313 400 6 000 370
First, .*? matches "Andre bygninger ". Then first group (\\d{1,3}(?:\\s+\\d{3})*) first matches 313 (because of \\d{1,3}) and then (?:\\s+\\d{3})* matches a space and 400 and it stops because next data followed is space followed by 6 which is just one digit and not three digit.
Similarly, second group (\\d{1,3}(?:\\s+\\d{3})*) first matches 6 (because of \\d{1,3}) and then (?:\\s+\\d{3})*) matches 000 and stops because, it needs to leave some data for matching group 3 else regex match will fail.
Finally, third group matches 370 as that is the only data that was left. So \\d{1,3} matches 370 and then (?:\\s+\\d{3})* matches nothing as it is zero or more group.
Hope that clarifies. Let me know if you still have any query.
Edit 22 Dec 2018 for grouping numbers in two groups only
As you want to group the data from this string,
Innbo Ekstra Nordea 1 500 000 1 302
Into two group of numbers having 1500000 and 1302, your regex needs to only have two groups and it becomes this like I replied in the comment,
.*?(\\d{1,3}(?:\\s+\\d{3})*)\\s*(\\d{1,3}(?:\\s+\\d{3})*)
Here is the java code for same,
public static void main(String[] args) throws Exception {
Pattern p = Pattern
.compile(".*?(\\d{1,3}(?:\\s+\\d{3})*)\\s*(\\d{1,3}(?:\\s+\\d{3})*)");
List<String> list = Arrays.asList("Innbo Ekstra Nordea 1 500 000 1 302");
for (String s : list) {
Matcher m = p.matcher(s);
if (m.matches()) {
System.out.println("For string: " + s);
System.out.println(m.group(1).replaceAll(" ", ""));
System.out.println(m.group(2).replaceAll(" ", ""));
} else {
System.out.println("For string: '" + s + "' Didn't match");
}
System.out.println();
}
}
Which prints this like you expect.
For string: Innbo Ekstra Nordea 1 500 000 1 302
1500000
1302
Instead of trying to match the part you are interested in, it might be easier to modify the string to leave nothing but what you need.
From your question it sounds like you always have 7 digit number for the 2nd column in the table so you could include that in the regex:
.*\d\s\d{3}\s\d{3}\s(\d+\s+\d+)\s.*.
^^ matches all the words from the first column
^^^^^^^^^^^^^^^^ - matches the 7 digits and 2 spaces in the 2nd column.
^^ matches the space(s) between the columns.
^^^^^^^^^ matches the 2 sets of numbers with a space(12 000) in your example.
Example program:
public static void main(String[] args) {
String string = "Bolighus fullverdi 4 374 720 12 000 11 806";
// Because it's a java string, back-slashes need to be escaped - hence the double \\
String result = string.replaceAll(".*\\d\\s\\d{3}\\s\\d{3}\\s(\\d+\\s+\\d+)\\s+.*", "$1");
System.out.println(result);
}

Java regex: how to select words starting with a specific letter and is x number of characters long?

This is the code I wrote that selects all names starting from A:
String longString = "Amal Kamal Jamal Amitha Farook Amani Tom Adele George Ariana";
String pattern = "(?i)(\\s|^)[a][A-Za-z]+(\\s|$)";
Pattern checkRegex = Pattern.compile(pattern);
Matcher regexMatcher = checkRegex.matcher(longString);
while (regexMatcher.find()) {
System.out.println(regexMatcher.start() + " : " + regexMatcher.group());
}
Output is as expected
0 : Amal
16 : Amitha
30 : Amani
40 : Adele
53 : Ariana
Now I want to select names that are at least 5 characters long. So the expected output is: Amitha, Adele, Ariana.
When I type this only Ariana is returned. And I can't understand why.
String pattern = "(?i)(\\s|^)[a][A-Za-z]+(\\s|$){5,}";
Output
53 : Ariana
If I put a bracket around the whole expression (to say that this expression should be 5 characters long) Then output is nothing
String pattern = "(?i)((\\s|^)[a][A-Za-z]+(\\s|$)){5,}";
What is the correct way of writing this?
You quantified (\\s|$) while you need to quantify [a-zA-Z]. So, you only match texts that have 5 or more whitespaces or 5 or more ends of string (makes no sense of course) after the words. Also, you need to use {4,} as [a] already matches 1 letter.
Use this regex to fix the issue (although it is not the best one, see below why):
(?i)(\s|^)a[a-z]{4,}(\s|$)
Details
(?i) - case insensitive modifier
(\s|^) - either a whitespace or a start of a string
a - an a or A letter
[a-z]{4,} - any 4 or more ASCII letters
(\s|$) - either a whitespace or an end of a string (note: the whitespace will be consumed, and consecutive matching words will not be handled properly).
You may use "(?i)(?<!\\S)a[a-z]{4,}(?!\\S)" pattern to make sure you are matching a word in between whitespaces or start/end of string positions.
Or, use word boundaries - "(?i)\\ba[a-z]{4,}\\b".
See the Java online demo:
String longString = "Amal Kamal Jamal Amitha Farook Amani Tom Adele George Ariana";
String pattern = "(?i)(?<!\\S)a[a-z]{4,}(?!\\S)";
Pattern checkRegex = Pattern.compile(pattern);
Matcher regexMatcher = checkRegex.matcher(longString);
while (regexMatcher.find()) {
System.out.println(regexMatcher.start() + " : " + regexMatcher.group());
}
Result:
17 : Amitha
31 : Amani
41 : Adele
54 : Ariana

Regex that will match 6 characters that only allows digits, leading, and trailing spaces

The regex that I'm trying to implement should match the following data:
123456
12345 
 23456
     5
1     
      
  2   
 2345 
It should not match the following:
12 456
1234 6
 1 6
1 6
It should be 6 characters in total including the digits, leading, and trailing spaces. It could also be 6 characters of just spaces. If digits are used, there should be no space between them.
I have tried the following expressions to no avail:
^\s*[0-9]{6}$
\s*[0-9]\s*
You can just use a *\d* * pattern with a restrictive (?=.{6}$) lookahead:
^(?=.{6}$) *\d* *$
See the regex demo
Explanation:
^ - start of string
(?=.{6}$) - the string should only have 6 any characters other than a newline
* - 0+ regular spaces (NOTE to match horizontal space - use [^\S\r\n])
\d* - 0+ digits
* - 0+ regular spaces
$ - end of string.
Java demo (last 4 are the test cases that should fail):
List<String> strs = Arrays.asList("123456", "12345 ", " 23456", " 5", // good
"1 ", " ", " 2 ", " 2345 ", // good
"12 456", "1234 6", " 1 6", "1 6"); // bad
for (String str : strs)
System.out.println(str.matches("(?=.{6}$) *\\d* *"));
Note that when used in String#matches(), you do not need the intial ^ and final $ anchors as the method requires a full string match by anchoring the pattern by default.
You can also do:
^(?!.*?\d +\d)[ \d]{6}$
The zero width negative lookahead (?!.*?\d +\d) ensures that the lines having space(s) in between digits are not selected
[ \d]{6} matches the desired lines that have six characters having just space and/or digits.

Understanding regular expression output [duplicate]

This question already has an answer here:
SCJP6 regex issue
(1 answer)
Closed 7 years ago.
I need help to understand the output of the code below. I am unable to figure out the output for System.out.print(m.start() + m.group());. Please can someone explain it to me?
import java.util.regex.*;
class Regex2 {
public static void main(String[] args) {
Pattern p = Pattern.compile("\\d*");
Matcher m = p.matcher("ab34ef");
boolean b = false;
while(b = m.find()) {
System.out.println(m.start() + m.group());
}
}
}
Output is:
0
1
234
4
5
6
Note that if I put System.out.println(m.start() );, output is:
0
1
2
4
5
6
Because you have included a * character, your pattern will match empty strings as well. When I change your code as I suggested in the comments, I get the following output:
0 ()
1 ()
2 (34)
4 ()
5 ()
6 ()
So you have a large number of empty matches (matching each location in the string) with the exception of 34, which matches the string of digits. Use \\d+ if you want to match digits without also matching empty strings..
You used this regex - \d* - which basically means zero or more digits. Mind the zero!
So this pattern will match any group of digits, e.g. 34 plus any other position in the string, where the matched sequence will be the empty string.
So, you will have 6 matches, starting at indices 0,1,2,4,5,6. For match starting at index 2, the matched sequence is 34, while for the remaining ones, the match will be the empty string.
If you want to find only digits, you might want to use this pattern: \d+
d* - match zero or more digits in the expresion.
expresion ab34ef and his corresponding indices 012345
On the zero index there is no match so start() prints 0 and group() prints nothing, then on the first index 1 and nothing, on the second we find match so it prints 2 and 34. Next it will print 4 and nothing and so on.
Another example:
Pattern pattern = Pattern.compile("\\d\\d");
Matcher matcher = pattern.matcher("123ddc2ab23");
while(matcher.find()) {
System.out.println("start:" + matcher.start() + " end:" + matcher.end() + " group:" + matcher.group() + ";");
}
which will println:
start:0 end:2 group:12;
start:9 end:11 group:23;
You will find more information in the tutorial

Categories