Java regex to identify 3 column table from ascii text - java

So I have incoming data that looks something like this:
Applications 7 days 6 days
And I'm trying to create regex that will match this line but not a line that has another column, like this:
Applications 7 days 6 days 5 days
The regex that I'm trying to use is:
^(.*?)(\s){4,}(.*?)(\s){4,}[^(\s){2}]+
Where [^(\s){2}]+ would mean selecting everything up to a double space. The problem with this is that
it doesn't work to begin with.
the second line I have would still match this.
Is there any regex I can use to only match the 3 column table and not the 4 column, 5 column, etc.?

You should take care with character classes ([]) as some chars inside are treated literally (as if they were escaped).
Try this regex (demo here):
^((?:(?!\s\s).)+)(?:\s){4,}((?:(?!\s\s).)+)(?:\s){4,}((?:.(?!\s\s))+)$
I switched the (.*?) with ((?:(?!\s\s).)+) which will match everything up to a sequence of two spaces.
I added a $ at the end, so it wouldn't match the lines with more than two columns.
I also added some ?: so the groups would become non-matching groups.
Finally, I removed the character class from the end of the regex and added a negative look-ahead.
Columns not ending with spaces
This one will not accept lines where the second column ends with spaces (demo here):
^((?:(?!\s\s).)+)(?:\s){4,}((?:(?!\s\s).)+)(?:\s){4,}((?:.(?!\s\s)(?!\s$))+)$
Notice the addition of a second negative look-ahead in the last group: (?!\s$).

try this :
^[^\s]*(\s{2,}[^\s].*){2,}
assuming before each column-value there is at least 2 spaces.
DEMO

Related

Java 8 regex: a capturing group in a pattern doesn't match, yet the whole pattern does match

This is my first question. Nice to e-meet everyone.
I have created the following regex pattern in Java 8 (this is just a simplified example of what I actually have in my code - for the sake of clarity):
(?<!a)([0-9])\,([0-9])(?!a)|(?<!b)([0-9]) ([0-9])(?!b)|(?<!c)([0-9])([0-9])(?!c)
so in general it consists of three alternatives:
1st one matches two single digits separated with a comma, for example:
1,1
2,0
4,5
2nd one matches two single digits separated with a space, for example:
1 1
2 0
4 5
3rd one matches two single digits in a row, for example:
11
20
45
Each alternative uses lookarounds and their content has to be slightly different for each one of them - that's why I couldn't just put everything together like that:
([0-9])[, ]?([0-9])
Each of the matched digits is enclosed in a capturing group and now I have a second line to 'call out' these captured numbers like this:
(?<!n)($1 $2|$3 $4|$5 $6)(?!n)
So at the end I need to match a text that would have the same digits separated with single space and not surrounded by 'n'. So if any of the examples shown above would be matched by the pattern from the 1st line, the 2nd line pattern should match these:
1 1
2 0
4 5
11 11
22 00
44 55
And not any of these:
n1 1
2,0
45
asd asd asd
The problem is the following: it returns a match even if I do not have these captured digits in the tested text, but I do have space in it... So here I do not get match and that is correct:
aaaaaaaaa
bbbbbbbbb
aasdfasdf
but here I get a match on the following things (most apparently because there is a space/spaces):
abc abc
q w r t y
as df
Does anyone know if this is normal that despite the fact that the characters in capturing groups are not captured by the 1st line, the 'non capturing group' part (so a single space) will be matched and therefore the whole pattern returns match, as if a capturing group could be a zero-length match in the second line if nothing is captured by the first line? Thanks in advance for any comment on this.
Your regex matches whitespace because the resulting pattern for the 1,1 string is (?<!n)(1 1| | )(?!n), and it can match a space that is neither preceded nor followed with a space.
When a replacement backreference does not match any string in a .replaceAll/.replaceFirst it is assigned an empty string (it is assigned null when using .find() / .matches()), and thus you still get the blank alternatives in the resulting pattern.
You may leverage this functionality AND the fact that each alternative has exactly two capturing groups by concatenating replacement backreferences in the string replacement pattern, getting rid of the alternations altogether:
SEARCH: (?<!a)([0-9]),([0-9])(?!a)|(?<!b)([0-9]) ([0-9])(?!b)|(?<!c)([0-9])([0-9])(?!c)
REPLACE: (?<!n)($1 $2|$3 $4|$5 $6)(?!n)
Note how the backreferences are concatenated: all backreferences to odd groups come first, then all backreferences to even groups are placed in a no-alternative pattern.
See the regex demo.
Note that even if the number of groups is different across the alternatives you may just add "fake" empty groups to each of them, and this approach will still work.

Regex format for a particular Match

I am trying to write a regex for the following format
PA-123456-067_TY
It's always PA, followed by a dash, 6 digits, another dash, then 3 digits, and ends with _TY
Apparently, when I write this regex to match the above format it shows the output correctly
^[^[PA]-]+-(([^-]+)-([^_]+))_([^.]+)
with all the Negation symbols ^
This does not work if I write the regex in the below format without negation symbols
[[PA]-]+-(([-]+)-([_]+))_([.]+)
Can someone explain to me why is this so?
The negation symbol means that the character cannot be anything within the specified class. Your regex is much more complicated than it needs to be and is therefore obfuscating what you really want.
You probably want something like this:
^PA-(\d+)-(\d+)_TY$
... which matches anything that starts with PA-, then includes two groups of numbers separated by a dash, then an underscore and the letters TY. If you want everything after the PA to be what you capture, but separated into the three groups, then it's a little more abstract:
^PA-(.+)-(.+)_(.+)$
This matches:
PA-
a capture group of any characters
a dash
another capture group of any characters
an underscore
all the remaining characters until end-of-line
Character classes [...] are saying match any single character in the list, so your first capture group (([^-]+)-([^_]+)) is looking for anything that isn't a dash any number of times followed by a dash (which is fine) followed by anything that isn't an underscore (again fine). Having the extra set of parentheses around that creates another capture group (probably group 1 as it's the first parentheses reached by the regex engine)... that part is OK but probably makes interpreting the answer less intuitive in this case.
In the re-write however, your first capture group (([-]+)-([_]+)) matches [-]+, which means "one or more dashes" followed by a dash, followed by any number of underscores followed by an underscore. Since your input does not have a dash immediately following PA-, the entire regex fails to find anything.
Putting the PA inside embedded character classes is also making things complicated. The first part of your first one is looking for, well, I'm not actually sure how [^[PA]-]+ is interpreted in practice but I suspect it's something like "not either a P or an A or a dash any number of times". The second one is looking for the opposite, I think. But you don't want any of that, you just want to start without anything other than the actual sequence of characters you care about, which is just PA-.
Update: As per the clarifications in the comments on the original question, knowing you want fixed-size groups of digits, it would look like this:
^PA-(\d{6})-(\d{3})_TY$
That captures PA-, then a 6-digit number, then a dash, then a 3-digit number, then _TY. The six digit number and 3 digit numbers will be in capture groups 1 and 2, respectively.
If the sizes of those numbers could ever change, then replace {x} with + to just capture numbers regardless of max length.
according to your comment this would be appropriate PA-\d{6}-\d{3}_TY
EDIT: if you want to match a line use it with anchors: ^PA-\d{6}-\d{3}_TY$

Regex for numbers

Im trying to create a regex of numbers where 7 should appear atleast once and it shouldn't include 9
/[^9]//d+
I'm not sure how to do make it include 7 at least once
Also, it fails for the following example
123459, it accepts the string, even tho, there is a 9 included in there
However, if my string is 95, it rejects it, which is right
Code
Method 1
See regex in use here
(?=\d*7)(?!\d*9)\d+
Method 2
See regex in use here
\b(?=\d*7)[0-8]+\b
Note: This method uses fewer steps (170) as opposed to Method 1 with 406 steps.
Alternatively, you can also replace [0-8] with [^9\D] as seen here, which is basically saying don't match 9 or \D (any non-digit character).
You can also use \b(?=[^7\D]*7)[0-8]+\b as seen here, which brings the number of steps down from 170 to 147.
Method 3
See regex in use here
\b[0-8]*7[0-8]*\b
Note: This method uses few steps than both methods above at 139 steps. The only issue with this regex is that you need to identify valid characters in multiple locations in the pattern.
Results
Input
**VALID**
123456780
7
1237412
**INVALID**
9
12345680
1234567890
12341579
Output
Note: Shown below are strings that match.
123456780
7
1237412
Explanation
Method 1
(?=\d*7) Positive lookahead ensuring what follows is any digit any number of times, followed by 7 literally
(?!\d*9) Negative lookahead ensuring what follows is not any digit any number of times, followed by 9 literally
\d+ Any digit one or more times
Method 2
\b Assert the position as a word boundary
(?=\d*7) Positive lookahead ensuring what follows is any digit any number of times, followed by 7 literally
[0-8]+ Match any character present in the set 0-8
\b Assert the position as a word boundary
Method 3
\b Assert the position as a word boundary
[0-8]* Match any digit (except 9) any number of times
7 Match the digit 7 literally
[0-8]* Match any digit (except 9) any number of times
\b Assert the position as a word boundary
One way to do it would be to use several lookaheads:
(?=[^7]*7)(?!.*9)^\d+$
See a demo on regex101.com.
Note that you need to double escape the backslashes in Java, so that it becomes:
(?=[^7]*7)(?!.*9)^\\d+$
This has got a bit complex but it works for your use case :
(?=.*^[0-68-9]*7[0-68-9]*$)(?=^(?:(?!9).)*$).*$
First expression matches exactly one occurence of 7, accepts just numbers and second expression tests non-occurence of 9.
Try here : https://regex101.com/r/5OHgIr/1
If I find out correctly, you need a regex that accept all numbers that include at least one 7 and exclude 9. so try this:
(?:[0-8]*7[0-8]*)+
If you want found only numbers in a normal text add \s first and last of regex.

Java Regex: Optional Matching

I've been using the following Regex to extract a zip code from a bunch of text:
"\\d{5}\\-?[1-9]?[1-9]?[1-9]?[1-9]?"
My intention of making the last 4 [1-9] optional (using ? ) was to be able to extract both 5 digit zip codes and 5 digit zip codes with + 4 such as 11001-1010.
However, it only matches the first two digits of the last four numbers even though I put 4 digits at the end.
For example, in the zip code 11001-1010 it would match 11001-10.
Anyone know why?
Simple answer to question: For zip code 11001-1010 your regex would only match 11001-1 because the optional 4 digits after the - cannot be 0.
For the unstated question of how to fix that, it depends on whether you only want to match an optional +4, or you want to also match +3, +2, +1, and +0, like your expression would.
Matching Zip5 with optional +4, e.g. matching 11001-1010 and 11001:
"\\d{5}(?:-\\d{4})?"
Matching Zip5 with optional +N, e.g. matching 11001-1010, 11001-101, 11001-10, 11001-1, 11001-, and 11001:
"\\d{5}(?:-\\d{0,4})?"
Update
Now, if you want to make sure it doesn't match the 56789-1234 of 123456789-123456789 or abcd56789-1234qwerty, you can add a word-boundary check:
"\\b\\d{5}(?:-\\d{4})?\\b"
It's stopping at the first 0 in the suffix,
"\d{5}\-?[1-9]?[1-9]?[1-9]?[1-9]?"
So in your example, it only matches up to 11001-1
Does "\d{5}\-?[0-9]?[0-9]?[0-9]?[0-9]?" work ok?
The other answers are probably cleaner, but that is the bug.
Looks ok per this
You can use \\d{5}\\-\\d{0,4} which allows you to match 0 to 4 digits after -.
EDIT
From the comment : But then the - won’t be optional.
For that you can use \\d{5}(\\-\\d{0,4})? to make group of - and digits after dash optional.

In Java regex - how to retain numbers ONLY when attached to string

I'm trying to tokenize text files that contain useful text but also many numbers that I don't want. However, using something like [^a-zA-Z0-9], I retain all digits (0-9).
I would like to retain digits ONLY if attached to characters OR hypnenated like "24hr" or "7-days".
So, input: "There are 3, 24hr positions available 7-days a week. Call 555-1212"
Returns a list of the following tokens: There are 24hr positions available 7-days a week Call
Thanks for any help!
\d+-?[A-Za-z]+|[A-Za-z]+-?\d+|[A-Za-z]+
See it here in action: http://regexr.com?318em
The square brackets [, ] represent something called a character class, which basically means match anything in this class. [A-Za-z0-9] will match any combination of letters and digits.
If you want to specify order you need to remove the digits from the character class and add another character class after it.
ex:
[0-9]+-?[a-zA-Z]+|[a-zA-Z]+-?[0-9]+|[a-zA-Z]+
[a-zA-Z]+ - matches 1 or more letters
-? - optionally matches a dash
[0-9]+ - matches 1 or more digits
After lots of trial and error, this did it (note leading space):
\d[^-a-z]+ | -\d+|[^a-zA-Z0-9-]|[0-9]+-[0-9]+|\W-+|[0-9]+-\W
http://regexr.com?318hp
I hope this helps anyone else who needs it.
I'm using it in RapidMiner to remove unwanted tokens in text processing.

Categories