I have some pdf file, and the program reads it line by line.
Here is snipped from a file:
I need to extract:
12000
The parsed line looks like the following:
Bolighus fullverdi 4 374 720 12 000 11 806
I can't find a way how to skip first 7 numbers (4 374 720).
I tried to play with some matching like:
(\d+ ){3}
It founds 2 matches:
Regex how to get value at this case:
\d+ 000
But I want to omit 000 from the regex. In different documents, it will fail.
How to solve this issue?
Maybe you can suggest some other solution to this problem?
UPDATE:
With #PushpeshKumarRajwanshi answer everything is mostly done:
public static String groupNumbers(String pageLine) {
String transformedLine = pageLine.replaceAll(" (?=\\d{3})", StringUtils.EMPTY);
log.info("TRANSFORMED LINE: \n[{}]\nFrom ORIGINAL: \n[{}]", transformedLine, pageLine);
return transformedLine;
}
public static List<String> getGroupedNumbersFromLine(String pageLine) {
String groupedLine = groupNumbers(pageLine);
List<String> numbers = Arrays.stream(groupedLine.split(" "))
.filter(StringUtils::isNumeric)
.collect(Collectors.toList());
log.info("Get list of numbers: \n{}\nFrom line: \n[{}]", numbers, pageLine);
return numbers;
}
However, I found one critical issue.
Sometimes pdf file can look like the following:
Where last 3 digits is a separate number.
And parsed line ends with:
313 400 6 000 370
Which produces an incorrect result:
313400, 6000370
instead of
313400, 6000, 370
UPDATE 2
Consider the next case:
Our line will look like:
Innbo Ekstra Nordea 1 500 000 1 302
it will produce 3 groups as a result:
1500000
1
302
In fact, we have only a second group is missing from input.
Is it possible to make a regex more flexible if the second group is missing?
How to fix this behaviour?
Your numbers have a special pattern which can be used to hack the problem for you. If you notice, any space in this string which is followed by exactly three digits can be removed to unite the number forming actual number, which will make this string,
Bolighus fullverdi 4 374 720 12 000 11 806
to this,
Bolighus fullverdi 4374720 12000 11806
And thus you can capture the second number easily by using this regex,
.*\d+\s+(\d+)\s+\d+
and capture group 2.
Here is a sample java code for same,
public static void main(String[] args) {
String s = "Bolighus fullverdi 4 374 720 12 000 11 806";
s = s.replaceAll(" (?=\\d{3})", "");
System.out.println("Transformed string: " + s);
Pattern p = Pattern.compile(".*\\d+\\s+(\\d+)\\s+\\d+");
Matcher m = p.matcher(s);
if (m.find()) {
System.out.println(m.group(1));
} else {
System.out.println("Didn't match");
}
}
Which outputs,
Transformed string: Bolighus fullverdi 4374720 12000 11806
12000
Hope this helps!
Edit:
Here is the explanation for this regex \D*\d+\s+(\d+)\s+\d+ for capturing required data from transformed string.
Bolighus fullverdi 4374720 12000 11806
.* --> Matches any data before the numbers and here it matches Bolighus fullverdi
\d+ --> Matches one or more digits and here it matches 4374720
\s+ --> Matches one or more space which is present between the numbers.
(\d+) --> Matches one or more digits and captures it in group 1 where it matches 12000
\s+ --> Matches one or more space which is present between the numbers.
\d+ --> Matches one or more digits and here it matches 11806
As OP wanted to capture the second number, hence I only grouped (put parenthesis around intended capture part) second \d+ but if you want to capture first number or third number, you can simply group them as well like this,
\D*(\d+)\s+(\d+)\s+(\d+)
Then in java code, calling,
m.group(1) would give group 1 number which is 4374720
m.group(2) would give group 2 number which is 12000
m.group(3) would give group 3 number which is 11806
Hope this clarifies and let me know if you need anything further.
Edit2
For covering the case for following string,
Andre bygninger 313 400 6 000 370
so that it captures 313400, 6000 and 370, I have to change the approach of the solution. And in this approach, I will not be transforming the string, but rather will capture the digits with spaces and once all three numbers are captured, will remove space between them. This solution will work for old string as well as new string above where we want to capture last three digits 370 as third number. But let's suppose we have following case,
Andre bygninger 313 400 6 000 370 423
where we have further 423 digits in the string, then it will be captured as following numbers,
313400, 6000370, 423
as it doesn't know whether 370 should go to 6000 or 423. So I have made the solution in a way where last three digits are captured as third number.
Here is a java code that you can use.
public static void main(String[] args) throws Exception {
Pattern p = Pattern
.compile(".*?(\\d{1,3}(?:\\s+\\d{3})*)\\s*(\\d{1,3}(?:\\s+\\d{3})*)\\s*(\\d{1,3}(?:\\s+\\d{3})*)");
List<String> list = Arrays.asList("Bolighus fullverdi 4 374 720 12 000 11 806",
"Andre bygninger 313 400 6 000 370");
for (String s : list) {
Matcher m = p.matcher(s);
if (m.matches()) {
System.out.println("For string: " + s);
System.out.println(m.group(1).replaceAll(" ", ""));
System.out.println(m.group(2).replaceAll(" ", ""));
System.out.println(m.group(3).replaceAll(" ", ""));
} else {
System.out.println("For string: '" + s + "' Didn't match");
}
System.out.println();
}
}
This code prints following output as you wanted,
For string: Bolighus fullverdi 4 374 720 12 000 11 806
4374720
12000
11806
For string: Andre bygninger 313 400 6 000 370
313400
6000
370
Here is the explanation for regex,
.*?(\\d{1,3}(?:\\s+\\d{3})*)\\s*(\\d{1,3}(?:\\s+\\d{3})*)\\s*(\\d{1,3}(?:\\s+\\d{3})*)
.*? --> Matches and consumes any input before the numbers
(\\d{1,3}(?:\\s+\\d{3})*) --> This pattern tries to capture first number which can start with one to three digits followed by space and exactly three digits and "space plus three digits" altogether can occur zero or more times.
\\s* --> Followed by zero or more space
And after that, same group (\\d{1,3}(?:\\s+\\d{3})*) is repeated two more times so it can capture numbers in three groups.
Since I have made three capturing groups, hence capturing has to take place in three groups for it to be a successful match. So for e.g. here is the mechanism of capturing this input,
Andre bygninger 313 400 6 000 370
First, .*? matches "Andre bygninger ". Then first group (\\d{1,3}(?:\\s+\\d{3})*) first matches 313 (because of \\d{1,3}) and then (?:\\s+\\d{3})* matches a space and 400 and it stops because next data followed is space followed by 6 which is just one digit and not three digit.
Similarly, second group (\\d{1,3}(?:\\s+\\d{3})*) first matches 6 (because of \\d{1,3}) and then (?:\\s+\\d{3})*) matches 000 and stops because, it needs to leave some data for matching group 3 else regex match will fail.
Finally, third group matches 370 as that is the only data that was left. So \\d{1,3} matches 370 and then (?:\\s+\\d{3})* matches nothing as it is zero or more group.
Hope that clarifies. Let me know if you still have any query.
Edit 22 Dec 2018 for grouping numbers in two groups only
As you want to group the data from this string,
Innbo Ekstra Nordea 1 500 000 1 302
Into two group of numbers having 1500000 and 1302, your regex needs to only have two groups and it becomes this like I replied in the comment,
.*?(\\d{1,3}(?:\\s+\\d{3})*)\\s*(\\d{1,3}(?:\\s+\\d{3})*)
Here is the java code for same,
public static void main(String[] args) throws Exception {
Pattern p = Pattern
.compile(".*?(\\d{1,3}(?:\\s+\\d{3})*)\\s*(\\d{1,3}(?:\\s+\\d{3})*)");
List<String> list = Arrays.asList("Innbo Ekstra Nordea 1 500 000 1 302");
for (String s : list) {
Matcher m = p.matcher(s);
if (m.matches()) {
System.out.println("For string: " + s);
System.out.println(m.group(1).replaceAll(" ", ""));
System.out.println(m.group(2).replaceAll(" ", ""));
} else {
System.out.println("For string: '" + s + "' Didn't match");
}
System.out.println();
}
}
Which prints this like you expect.
For string: Innbo Ekstra Nordea 1 500 000 1 302
1500000
1302
Instead of trying to match the part you are interested in, it might be easier to modify the string to leave nothing but what you need.
From your question it sounds like you always have 7 digit number for the 2nd column in the table so you could include that in the regex:
.*\d\s\d{3}\s\d{3}\s(\d+\s+\d+)\s.*.
^^ matches all the words from the first column
^^^^^^^^^^^^^^^^ - matches the 7 digits and 2 spaces in the 2nd column.
^^ matches the space(s) between the columns.
^^^^^^^^^ matches the 2 sets of numbers with a space(12 000) in your example.
Example program:
public static void main(String[] args) {
String string = "Bolighus fullverdi 4 374 720 12 000 11 806";
// Because it's a java string, back-slashes need to be escaped - hence the double \\
String result = string.replaceAll(".*\\d\\s\\d{3}\\s\\d{3}\\s(\\d+\\s+\\d+)\\s+.*", "$1");
System.out.println(result);
}
Related
Need a single combined regex for the following pattern:
Prefix: 2221-2720 , Length: 16
Prefix: 51-55 , Length: 16
where the delimiters b/w digits can be either space ( ), minus sign (-), period (.), backslash (\), equals (=). The condition being that more than one delimiter (same or different type) can't occur more than once b/w any two digits.
Valid number - 230.293.217.952.148.4
Valid number - 230.293 217-952.148.4
Invalid number - 230..293.217.952.148.4
Invalid number - 230.293.-217. 952.148.4
A valid input is one where you have 16 digits separated by any/no delimiters as long as there are no two delimiters adjacent to each other.
Have come up with the following regex:
(2[\s=\\.-]*2[\s=\\.-]*2[\s=\\.-]*[1-9][\s=\\.-]*|2[\s=\\.-]*2[\s=\\.-]*[3-9][\s=\\.-]*[0-9][\s=\\.-]*|2[\s=\\.-]*[3-6][\s=\\.-]*[0-9](?:[\s=\\.-]*[0-9]){1}|2[\s=\\.-]*7[\s=\\.-]*[01][\s=\\.-]*[0-9][\s=\\.-]*|2[\s=\\.-]*7[\s=\\.-]*2[\s=\\.-]*0[\s=\\.-]*)[0-9](?:[\s=\\.-]*[0-9]){11}|(5[\s=\\.-]*[1-5][\s=\\.-]*)[0-9](?:[\s=\\.-]*[0-9]){13}
It does not match certain patterns. For example:
2 3 0 2 9 3 2 1 7 9 5 2 1 4 8 4
23-02-93-21-79-52-14-84
2 3 0 3 4 5 8 0 9 4 9 3 0 8 2 3
For the same numbers, it matches (as expected) the following patterns:
2302932179521484
230.293.217.952.148.4
2303458094930823
230.345.809.493.082.3
230-345-809-493-082-3
There seems to be an issue with delimiters. Kindly let me know what is wrong with my regex.
For this rule
A valid input is one where you have 16 digits separated by any/no
delimiters as long as there are no two delimiters adjacent to each
other
Prefix: 2221-2720 , Length: 16
Prefix: 51-55 , Length: 16
2221 can also be written as 2.2.-2.1
For these rules, it might be easier to write a pattern with 2 capture groups to match the whole string.
Then using some Java code, you can check the value of the capture groups for the ranges.
^((\d[ =\\.-]?\d)[ =\\.-]?\d[ =\\.-]?\d)(?:[ =\\.-]?\d){12}$
The pattern matches:
^ Start of string
( Capture group 1
(\d[ =\\.-]?\d) Capture group 2 Match 2 digits with an optional char = \ . -
[ =\\.-]?\d[ =\\.-]?\d Match 2 times optionally 1 of the listed chars and a single digit
) close group 1
(?:[ =\\.-]?\d){12} Repeat 12 times matching one of the characters and a single digit
$ End of string
Regex demo | Java demo
For example
String strings[] = {
"2221.7.952.148.412.32",
"230.293.217.952.148.4",
"5511111111111111",
"130.293 217-952.148.4",
"30..293.217.952.148.4",
"5..5",
".5.5."
};
String regex = "^((\\d[ =\\\\.-]?\\d)[ =\\\\.-]?\\d[ =\\\\.-]?\\d)(?:[ =\\\\.-]?\\d){12}$";
Pattern pattern = Pattern.compile(regex, Pattern.MULTILINE);
for (String s : strings) {
Matcher matcher = pattern.matcher(s);
if (matcher.find()) {
int grp1 = Integer.parseInt(matcher.group(1).replaceAll("\\D+", ""));
int grp2 = Integer.parseInt(matcher.group(2).replaceAll("\\D+", ""));
if ((grp1 >= 2221 && grp1 <= 2720) || (grp2 >=51 && grp2 <= 55)) {
System.out.println("Match for " + matcher.group());
}
}
}
Output
Match for 2221.7.952.148.412.32
Match for 230.293.217.952.148.4
Match for 5511111111111111
I have the following regex method which does the matches in 3 stages for a given string. But for some reason the Regex fails to check some of the things. As per whatever knowledge I have gained by working they seem to be correct. Can someone please correct me what am I doing wrong here?
I have the following code:
public class App {
public static void main(String[] args) {
String identifier = "urn:abc:de:xyz:234567.1890123";
if (identifier.matches("^urn:abc:de:xyz:.*")) {
System.out.println("Match ONE");
if (identifier.matches("^urn:abc:de:xyz:[0-9]{6,12}.[0-9]{1,7}.*")) {
System.out.println("Match TWO");
if (identifier.matches("^urn:abc:de:xyz:[0-9]{6,12}.[a-zA-Z0-9.-_]{1,20}$")) {
System.out.println("Match Three");
}
}
}
}
}
Ideally, this code should generate the output
Match ONE
Match TWO
Match Three
Only when the identifier = "urn:abc:de:xyz:234567.1890123.abd12" but it provides the same output event if the identifier does not match the regex such as for the following inputs:
"urn:abc:de:xyz:234567.1890123"
"urn:abc:de:xyz:234567.1890ANC"
"urn:abc:de:xyz:234567.1890123"
"urn:abc:de:xyz:234567.1890ACB.123"
I am not understanding why is it allowing the Alphanumeric characters after the . and also it does not care about the characters after the second ..
I would like my Regex to check that the string has the following format:
String starts with urn:abc:de:xyz:
Then it has the numbers [0-9] which range from 6 to 12 (234567).
Then it has the decimal point .
Then it has the numbers [0-9] which range from 1 to 7 (1890123)
Then it has the decimal point ..
Finally it has the alphanumeric character and spcial character which range from 1 to 20 (ABC123.-_12).
This is an valid string for my regex: urn:abc:de:xyz:234567.1890123.ABC123.-_12
This is an invalid string for my regex as it misses the elements from point 6:
urn:abc:de:xyz:234567.1890123
This is also an invalid string for my regex as it misses the elements from point 4 (it has ABC instead of decimal numbers).
urn:abc:de:xyz:234567.1890ABC.ABC123.-_12
This part of the regex:
[0-9]{6,12}.[0-9]{1,7} matches 6 to 12 digits followed by any character followed by 1 to 7 digits
To match a dot, it needs to be escaped. Try this:
^urn:abc:de:xyz:[0-9]{6,12}\.[0-9]{1,7}\.[a-zA-Z0-9\-_]{1,20}$
This will match with any number of dot alphanum at the end of the string as your examples:
^urn:abc:de:xyz:\d{6,12}\.\d{1,7}(?:\.[\w-]{1,20})+$
Demo & explanation
I am novice in RegEx. I am trying to strip all whitespaces and special characters between numbers in a string. Please know that string may contain other characters with numbers.
For Example take this string,
String s1 = "This is Sample AmericanExp Card Number 3400 1000 2000 009";
What I am trying is :-
String s1 = "This is Sample AmericanExp Card Number 3400 1000 2000 009";
String regExp = "[^\\w]+";
String replacement = "";
String changed= s1.replaceAll(regExp, replacement);
System..out.println("changed->"+content);
Its giving output as ThisisSampleAmericanExpCardNumber340000000000009,
The Required output is "This is Sample AmericanExp Card Number 340010002000009".
Appreciate The Help and Please let me know the concept behind it.
EDIT:-
Now I am masking the card Number and Its Pin (PCI), So I have this formula
^((4\\d{3})|(5[1-5]\\d{2})|(6011))-?\\d{4}-?\\d{4}-?\\d{4}|3[4,7]\\d{13}$
Which Checks for some type of credit cards. I am modifying it to check for its PIN and CVV also.(Matching 4 and 6 digit numbers also)
Sample String = "Sample AmericanExp Card Number 3400 1000 2000 009 and PIN is 1234 , CVV = 654321"
I modified the formula as :
^((4\\d{3})|(5[1-5]\\d{2})|(6011))-?\\d{4}-?\\d{4}-?\\d{4}|3[47]\\d{13}$|^[0-9]{4}$|^[0-9]{6}$
Which Doesn't gives me the correct output (Matching 4 and 6 digit numbers also).
You may use
.replaceAll("(?<=\\d)[\\W_]+(?=\\d)", "")
Or, if you need to deal with Unicode strings:
.replaceAll("(?U)(?<=[0-9])[\\W_]+(?=[0-9])", "")
See the regex. Details:
(?<=\d) - a positive lookbehind that matches a position immediately preceded with a digit
[\W_]+ - one or more non-word or underscore characters
(?=\d) - a positive lookahead that matches a location immediately followed with a digit.
Note that the (?U), Pattern.UNICODE_CHARACTER_CLASS embedded option, will make \W Unicode aware and it will no longer match Cyrillic, etc. letters.
See the Java demo:
String s1 = "This is Sample AmericanExp Card Number 3400 1000 2000 009";
System.out.println("changed -> " + s1.replaceAll("(?<=\\d)[\\W_]+(?=\\d)", ""));
// => changed -> This is Sample AmericanExp Card Number 340010002000009
This question already has an answer here:
SCJP6 regex issue
(1 answer)
Closed 7 years ago.
I need help to understand the output of the code below. I am unable to figure out the output for System.out.print(m.start() + m.group());. Please can someone explain it to me?
import java.util.regex.*;
class Regex2 {
public static void main(String[] args) {
Pattern p = Pattern.compile("\\d*");
Matcher m = p.matcher("ab34ef");
boolean b = false;
while(b = m.find()) {
System.out.println(m.start() + m.group());
}
}
}
Output is:
0
1
234
4
5
6
Note that if I put System.out.println(m.start() );, output is:
0
1
2
4
5
6
Because you have included a * character, your pattern will match empty strings as well. When I change your code as I suggested in the comments, I get the following output:
0 ()
1 ()
2 (34)
4 ()
5 ()
6 ()
So you have a large number of empty matches (matching each location in the string) with the exception of 34, which matches the string of digits. Use \\d+ if you want to match digits without also matching empty strings..
You used this regex - \d* - which basically means zero or more digits. Mind the zero!
So this pattern will match any group of digits, e.g. 34 plus any other position in the string, where the matched sequence will be the empty string.
So, you will have 6 matches, starting at indices 0,1,2,4,5,6. For match starting at index 2, the matched sequence is 34, while for the remaining ones, the match will be the empty string.
If you want to find only digits, you might want to use this pattern: \d+
d* - match zero or more digits in the expresion.
expresion ab34ef and his corresponding indices 012345
On the zero index there is no match so start() prints 0 and group() prints nothing, then on the first index 1 and nothing, on the second we find match so it prints 2 and 34. Next it will print 4 and nothing and so on.
Another example:
Pattern pattern = Pattern.compile("\\d\\d");
Matcher matcher = pattern.matcher("123ddc2ab23");
while(matcher.find()) {
System.out.println("start:" + matcher.start() + " end:" + matcher.end() + " group:" + matcher.group() + ";");
}
which will println:
start:0 end:2 group:12;
start:9 end:11 group:23;
You will find more information in the tutorial
I am not much familiar with regular expressions.
I want help for following regular exceptions:
1. String start with alpha word and then followed by any alpha or number. e.g. Abc 20 Jan to 15 Dec
2. String for a decimal number. e.g. 450,122,224.00
3. Also to check if String contain any pattern like 'Page 2 of 20'
Thanks.
// 1. String start with alpha word and then followed by
// any aplha or number. e.g. Abc 20 Jan to 15 Dec
// One or more alpha-characters, followed by a space,
// followed by some alpha-numeric character, followed by what ever
Pattern p = Pattern.compile("\\p{Alpha}+ \\p{Alnum}.*");
for (String s : new String[] {"Abc 20 Jan to 15 Dec", "hello world", "123 abc"})
System.out.println(s + " matches: " + p.matcher(s).matches());
// 2. String for a decimal number. e.g. 450,122,224.00
p = Pattern.compile(
"\\p{Digit}+(\\.\\p{Digit})?|" + // w/o thousand seps.
"\\p{Digit}{1,3}(,\\p{Digit}{3})*\\.\\p{Digit}+"); // w/ thousand seps.
for (String s : new String[] { "450", "122", "224.00", "450,122,224.00", "0.0.3" })
System.out.println(s + " matches: " + p.matcher(s).matches());
// 3. Also to check if String contain any pattern like 'Page 2 of 20'
// "Page" followed by one or more digits, followed by "of"
// followed by one or more digits.
p = Pattern.compile("Page \\p{Digit}+ of \\p{Digit}+");
for (String s : new String[] {"Page 2 of 20", "Page 2 of X"})
System.out.println(s + " matches: " + p.matcher(s).matches());
Output:
Abc 20 Jan to 15 Dec matches: true
hello world matches: true
123 abc matches: false
450 matches: true
122 matches: true
224.00 matches: true
450,122,224.00 matches: true
0.0.3 matches: false
Page 2 of 20 matches: true
Page 2 of X matches: false
1.) /[A-Z][a-z]*(\s([\d]+)|\s([A-Za-z]+))+/
[A-Z][a-z]* being an uppercase word
\s([\d]+) being a number prefixed be a (white)space
\s([A-Za-z]+) being a word prefixed be a (white)space
2.) /(\d{1,3})(,(\d{3}))*(.(\d{2}))/
(\d{1,3}) being a 1-to-3 digit number
(,(\d{3}))* being 0-or-more three-digit numbers prefixed by a comma
(.(\d{2})) being a 2-digit decimal
3.) /Page (\d+) of (\d+)/
(\d+) being one-or-more digits
When writing this (or any regex) I like to use this tool
1
I am not sure what you mean here. A word at te start, followed by any number of words and numbers?
Try this one:
^[a-zA-Z]+(\s+([a-zA-Z]+|\d+))+
2
Just a decimal number would be
\d+(\.\d+)?
Getting the commas in there:
\d{1,3}(,\d{3})*(\.\d+)?
3
Use
Page \d+ of \d+