On Which Line Number Was the Regex Match Found? - java

I would like to search a .java file using Regular Expressions and I wonder if there is a way to detect one what lines in the file the matches are found.
For example if I look for the match hello with Java regular expressions, will some method tell me that the matches were found on lines 9, 15, and 30?

Possible... with Regex Trickery!
Disclaimer: This is not meant to be a practical solution, but an illustration of a way to use an extension of a terrific regex hack. Moreover, it only works on regex engines that allow capture groups to refer to themselves. For instance, you could use it in Notepad++, as it uses the PCRE engine—but not in Java.
Let's say your file is:
some code
more code
hey, hello!
more code
At the bottom of the file, paste :1:2:3:4:5:6:7, where : is a delimiter not found in the rest of the code, and where the numbers go at least as high as the number of lines.
Then, to get the line of the first hello, you can use:
(?m)(?:(?:^(?:(?!hello).)*(?:\r?\n))(?=[^:]+((?(1)\1):\d+)))*.*hello(?=[^:]+((?(1)\1)+:(\d+)))
The line number of the first line containing hello will be captured by Group 2.
In the demo, see Group 2 capture in the right pane.
The hack relies on a group referring to itself. In the classic #Qtax trick, this is done with (?>\1?). For diversity, I used a conditional instead.
Explanation
The first part of the regex is a line skipper, which captures an increasing amount of the the line counter at the bottom to Group 1
The second part of the regex matches hello and captures the line number to Group 2
Inside the line skipper, (?:^(?:(?!hello).)*(?:\r?\n)) matches a line that doesn't contain hello.
Still inside the line skipper, the (?=[^:]+((?(1)\1):\d+)) lookahead gets us to the first : with [^:]+ then the outer parentheses in ((?(1)\1):\d+)) capture to Group 1... if Group 1 is set (?(1)\1) then Group 1, then, regardless, a colon and some digits. This ensures that each time the line skipper matches a line, Group 1 expands to a longer portion of :1:2:3:4:5:6:7
The * mataches the line skipper zero or more times
.*hello matches the line with hello
The lookahead (?=[^:]+((?(1)\1)+:(\d+))) is identical to the one in the line skipper, except that this time the digits are captured to Group 2: (\d+)
-
Reference
Qtax trick (recently awarded an additional bounty by #AmalMurali)
Replace a word with the number of the line on which it is found

If you are using a Unix based OS / terminal, you could use sed:
sed -n '/regex/=' file
(from this StackOverflow response)

There are no methods in Java that will do it for you. You must read the file line-by-line and check for a match on each line. You can keep an index of the lines as you read them and do whatever you want with that index when a match is found.

Solution (workaround) M1
just append line numbers to the File, line by line, before you process (regex match) it.
stackoverflow: how to append line numbers to the File
Solution (workaround) M2
count all the newline characters that occur before the match group.
long count_NewLines = Pattern.compile("\\R")
.matcher(content.substring(0, matcher.start()))
.results()
.count() + 1;

Related

Regex to find delimiter and qualifier in csv string (java)

Im trying to come up with a regex in java that can be used to extract the used delimiter and qualifier characters in a given csv string. My idea was that instead of matching the wole string Ill just look for the last field, so in pseudocode my regex would look like this:
(match as much of beginning as possible) folowed by
Option 1: (delimiter)(qualifier)(any character)*?(qualifier)(end of string|any linebreak character)
Option 2: (delimiter)((?!reference to delimiter capturing group)[any character])*?(qualifier)(end of string|any linebreak character)
And the regex I came up wih:
([\s\S])*((\W)(?!\3)(\W)[\s\S]*?\4($|\R))|((\W)((?!\7)[\s\S])*?($|\R))
Where group 3 is the delimiter, group 4 the qualifier and group 7 the dlimiter of option 2.
regex 101 link with nonworking example
Is my concept already wrong or only my regex?
Edit: As pointed out in a comment there can be ambigious lines, but the regex doesnt have to 100% find the delimiter/qualifier on a single try. Im fine with a regex that scans multiple lines to get the result. Also this is to be used in a program where the user defines a simple definition of the data he wants to import (which doesnt include the delimiter/qualifier). Specifically the number of fields, which can be used to test which (of the found) delimiters is the right one if there isnt a clear answer after even multiple lines.

How to match n number of lines with regex

I have some text like this
Notes:
He jumped the sea-horse
but it looked ropey
Then he left
but sometimes its like this
Notes:
Clay is only green
When it is seen
I need to capture 4 lines of text after "Notes" only so the output should be
Notes:
He jumped the sea-horse
but it looked ropey
Then he left
and for the second example
Notes:
I have tried matching the newlines but it only matches after the rest of the regex
Notes:.*\n{4}
How can I create a regex that allows me to repeat the match for the whole line and a newline four times? (is this a non-capturing group??)
You were close - in Notes:.*\n{4} the {4} is binding only to the newline, so it'll only capture "Notes:" followed by anything, followed by 4 blank lines.
You're looking for something like Notes:\n((?:.*\n){1,4})
If I understand the question correctly, and you want to capture the first 4 lines, not caring if they are blank or not, it may be better to not use regex at all and just split the text on the newline so you get an array of strings. Something like this:
string[] lines = text.split("\\r")
Then simply cherry pick the first four elements of the array.

Regex to capture text with unknown number of repeated groups in between

I'm trying to parse the number that follows "Dining:" in the following text, under SECOND LEVEL. So '666' should be returned.
MAIN LEVEL
Entrance: 11
Dining: 33
SECOND LEVEL
Entrance: 4444
Living: 5555
Dining: 666
THIRD LEVEL
Dining: 999
Kitchen: 000
Family: 33332
If I use something like (?:\bDining:\s)(.*\b) then it captures the first occurrence under MAIN. I'm trying to therefore specify SECOND LEVEL in the regex, followed by a repeating pattern of: new lines, multiple spaces, and then any text, until Dining: is found. This demo illustrates the two problems I encounter. The regex used is: (?:\bSECOND\sLEVEL(\n\s+.*)*Dining:)(.*\b)
A "Catastrophic backtracking" error appears until you delete the very last line containing Laundry: 1. Is this caused by too many matches or something?
Once you delete that line, the regex captures only the last match under OTHER LEVEL .. returning '2' as opposed to the match under SECOND LEVEL.
Sometimes Dining: will not exist under SECOND LEVEL and therefore nothing should be returned.
What is a regex that will only capture the SECOND LEVEL's Dining: number, and if it doesn't exist then returns nothing? Straight up regex preferred, no looping in Java if possible. Thanks
Use a negative lookahead based regex.
"(?m)^\\s*\\bSECOND LEVEL\\n(?:(?!\\n\\n)[\\s\\S])*\\bDining:\\s*(\\d+)"
DEMO
The best example I know of for catastrophic backtracking from here is (x+x+)+y. That is to say, it cannot work out the correct boundaries for the capture groups containing x because there are too many ways to divide them.
xxxxy is the first two + once, the third twice, or each of the first twice and the third once, or either of the first thrice, the other once and the last once. As you can see that gets dangerous!
You had (?:\bSECOND\sLEVEL(\n\s+.*)*Dining:)(.*\b) note the (\n\s+.*)*
the .* can be a nightmare when combined with the previous \n\s and enclosed with a *. It should be rewritten (\n\s+[^\s\n][^\n]*)* this ensures each quantifier ends before the next begins, minimising backtracking.
With this kind of thinking in mind I came up with the following regex to match your string:
(?<=SECOND LEVEL\n)(?:\s+(?:[^\s\n:][^\n:]*):[^\n]*)*\s+Dining:\s*([^\s\n][^\n$]*)

Java Regex: Optional Matching

I've been using the following Regex to extract a zip code from a bunch of text:
"\\d{5}\\-?[1-9]?[1-9]?[1-9]?[1-9]?"
My intention of making the last 4 [1-9] optional (using ? ) was to be able to extract both 5 digit zip codes and 5 digit zip codes with + 4 such as 11001-1010.
However, it only matches the first two digits of the last four numbers even though I put 4 digits at the end.
For example, in the zip code 11001-1010 it would match 11001-10.
Anyone know why?
Simple answer to question: For zip code 11001-1010 your regex would only match 11001-1 because the optional 4 digits after the - cannot be 0.
For the unstated question of how to fix that, it depends on whether you only want to match an optional +4, or you want to also match +3, +2, +1, and +0, like your expression would.
Matching Zip5 with optional +4, e.g. matching 11001-1010 and 11001:
"\\d{5}(?:-\\d{4})?"
Matching Zip5 with optional +N, e.g. matching 11001-1010, 11001-101, 11001-10, 11001-1, 11001-, and 11001:
"\\d{5}(?:-\\d{0,4})?"
Update
Now, if you want to make sure it doesn't match the 56789-1234 of 123456789-123456789 or abcd56789-1234qwerty, you can add a word-boundary check:
"\\b\\d{5}(?:-\\d{4})?\\b"
It's stopping at the first 0 in the suffix,
"\d{5}\-?[1-9]?[1-9]?[1-9]?[1-9]?"
So in your example, it only matches up to 11001-1
Does "\d{5}\-?[0-9]?[0-9]?[0-9]?[0-9]?" work ok?
The other answers are probably cleaner, but that is the bug.
Looks ok per this
You can use \\d{5}\\-\\d{0,4} which allows you to match 0 to 4 digits after -.
EDIT
From the comment : But then the - won’t be optional.
For that you can use \\d{5}(\\-\\d{0,4})? to make group of - and digits after dash optional.

Matching the last group of something in Java

I have the problem to define the regexpression (for a Java program), that gives me the last matching group of something. The reason for that is the conversion of some text files (here: the export of some wiki) to the new format of the new wiki.
For example, when I have the following text:
Here another include: [[Include(a/a-1)]]
The hierarchy of the pages is:
/a
/a-1
The old wiki referenced the hierarchy name, the new wiki will only have the title of the page. The new format should look like:
{include:a-1}
Currently I have the following regular expression:
/\[\[Include\(([^\)]+)\)\]\]/
which matches from the example above a/a-1, but I need a regular expression that matches only a-1.
Is it possible to construct a regular expression for java that matches the last group only?
So for the following original lines:
[[Include(a)]]
[[Include(a/b)]]
[[Include(a/a-1)]]
[[Include(a/a-1/a-2)]]
I would like to match only
a
b
a-1
a-2
This is the regex you're looking for. Group 1 has the text you want, see the captures pane at the bottom right of the demo, as well as the Substitutions pane at the bottom.
EDIT: per your request, replaced the [a-z0-9-] with [^/] (Did not update the regex101 demo as this regex, which I confirmed to work, breaks in regex101, which uses / as a delimiter, even when escaping the /. However here is another demo on regexplanet)
Search:
\[\[Include\((?:[^/]+\/)*([^/]+)\)\]\]
Replace:
{include:$1}
How does it work?
After the opening bracket of the Include, we match a combination of characters such as a-1 (made of letters, dash and digits) followed by a forward slash, zero or more times, then we capture the last such combination of characters.
In the few languages that support infinite-width lookbehinds, we could match what you want without relying on Group 1 captures.

Categories