How to match n number of lines with regex

How to match n number of lines with regex - java

I have some text like this
Notes:
He jumped the sea-horse
but it looked ropey
Then he left
but sometimes its like this
Notes:
Clay is only green
When it is seen
I need to capture 4 lines of text after "Notes" only so the output should be
Notes:
He jumped the sea-horse
but it looked ropey
Then he left
and for the second example
Notes:
I have tried matching the newlines but it only matches after the rest of the regex
Notes:.*\n{4}
How can I create a regex that allows me to repeat the match for the whole line and a newline four times? (is this a non-capturing group??)

You were close - in Notes:.*\n{4} the {4} is binding only to the newline, so it'll only capture "Notes:" followed by anything, followed by 4 blank lines.
You're looking for something like Notes:\n((?:.*\n){1,4})

If I understand the question correctly, and you want to capture the first 4 lines, not caring if they are blank or not, it may be better to not use regex at all and just split the text on the newline so you get an array of strings. Something like this:
string[] lines = text.split("\\r")
Then simply cherry pick the first four elements of the array.

Related

Complex RegEx patter for replacing commas

I have a problem with the REGEX pattern, which has to replace two commas with one comma with a space behind, but if there is only one comma and there is no space behind it, I want it to add that space there.
Currently, I am using this pattern - /([,]+)/g, but in case when I have one comma and space behing, it adds one more space behing.
Cases:
text,,text -> text, text
text, text -> text, text
text,text -> text, text
text,,text,,text -> text, text, text
(I am using Java)
Do you have any suggestions, how this REGEX pattern should look like? I am still bit confused.
Thanks.

What you want is to ensure a space behind every comma-chain, not create one in every case. You can either do this with lookahead, but I prefer the inclusive check, if the character chain already contains spaces, and if so, replace them as well.
str.replaceAll("\\,+ *", ", ");
The above answer will take all (real) spaces after the comma(s) and just replace them as well, this way, the single space that you insert is the only one. This will NOT work for line breaks. If you have line breaks and want to handle them explicitly, then you need to proceed differently. In other words, if the comma(s) is/are followed by a line break, you will have a white space between the (new) comma and line break.

Regex format for a particular Match

I am trying to write a regex for the following format
PA-123456-067_TY
It's always PA, followed by a dash, 6 digits, another dash, then 3 digits, and ends with _TY
Apparently, when I write this regex to match the above format it shows the output correctly
^[^[PA]-]+-(([^-]+)-([^_]+))_([^.]+)
with all the Negation symbols ^
This does not work if I write the regex in the below format without negation symbols
[[PA]-]+-(([-]+)-([_]+))_([.]+)
Can someone explain to me why is this so?

The negation symbol means that the character cannot be anything within the specified class. Your regex is much more complicated than it needs to be and is therefore obfuscating what you really want.
You probably want something like this:
^PA-(\d+)-(\d+)_TY$
... which matches anything that starts with PA-, then includes two groups of numbers separated by a dash, then an underscore and the letters TY. If you want everything after the PA to be what you capture, but separated into the three groups, then it's a little more abstract:
^PA-(.+)-(.+)_(.+)$
This matches:
PA-
a capture group of any characters
a dash
another capture group of any characters
an underscore
all the remaining characters until end-of-line
Character classes [...] are saying match any single character in the list, so your first capture group (([^-]+)-([^_]+)) is looking for anything that isn't a dash any number of times followed by a dash (which is fine) followed by anything that isn't an underscore (again fine). Having the extra set of parentheses around that creates another capture group (probably group 1 as it's the first parentheses reached by the regex engine)... that part is OK but probably makes interpreting the answer less intuitive in this case.
In the re-write however, your first capture group (([-]+)-([_]+)) matches [-]+, which means "one or more dashes" followed by a dash, followed by any number of underscores followed by an underscore. Since your input does not have a dash immediately following PA-, the entire regex fails to find anything.
Putting the PA inside embedded character classes is also making things complicated. The first part of your first one is looking for, well, I'm not actually sure how [^[PA]-]+ is interpreted in practice but I suspect it's something like "not either a P or an A or a dash any number of times". The second one is looking for the opposite, I think. But you don't want any of that, you just want to start without anything other than the actual sequence of characters you care about, which is just PA-.
Update: As per the clarifications in the comments on the original question, knowing you want fixed-size groups of digits, it would look like this:
^PA-(\d{6})-(\d{3})_TY$
That captures PA-, then a 6-digit number, then a dash, then a 3-digit number, then _TY. The six digit number and 3 digit numbers will be in capture groups 1 and 2, respectively.
If the sizes of those numbers could ever change, then replace {x} with + to just capture numbers regardless of max length.

according to your comment this would be appropriate PA-\d{6}-\d{3}_TY
EDIT: if you want to match a line use it with anchors: ^PA-\d{6}-\d{3}_TY$

On Which Line Number Was the Regex Match Found?

I would like to search a .java file using Regular Expressions and I wonder if there is a way to detect one what lines in the file the matches are found.
For example if I look for the match hello with Java regular expressions, will some method tell me that the matches were found on lines 9, 15, and 30?

Possible... with Regex Trickery!
Disclaimer: This is not meant to be a practical solution, but an illustration of a way to use an extension of a terrific regex hack. Moreover, it only works on regex engines that allow capture groups to refer to themselves. For instance, you could use it in Notepad++, as it uses the PCRE engine—but not in Java.
Let's say your file is:
some code
more code
hey, hello!
more code
At the bottom of the file, paste :1:2:3:4:5:6:7, where : is a delimiter not found in the rest of the code, and where the numbers go at least as high as the number of lines.
Then, to get the line of the first hello, you can use:
(?m)(?:(?:^(?:(?!hello).)*(?:\r?\n))(?=[^:]+((?(1)\1):\d+)))*.*hello(?=[^:]+((?(1)\1)+:(\d+)))
The line number of the first line containing hello will be captured by Group 2.
In the demo, see Group 2 capture in the right pane.
The hack relies on a group referring to itself. In the classic #Qtax trick, this is done with (?>\1?). For diversity, I used a conditional instead.
Explanation
The first part of the regex is a line skipper, which captures an increasing amount of the the line counter at the bottom to Group 1
The second part of the regex matches hello and captures the line number to Group 2
Inside the line skipper, (?:^(?:(?!hello).)*(?:\r?\n)) matches a line that doesn't contain hello.
Still inside the line skipper, the (?=[^:]+((?(1)\1):\d+)) lookahead gets us to the first : with [^:]+ then the outer parentheses in ((?(1)\1):\d+)) capture to Group 1... if Group 1 is set (?(1)\1) then Group 1, then, regardless, a colon and some digits. This ensures that each time the line skipper matches a line, Group 1 expands to a longer portion of :1:2:3:4:5:6:7
The * mataches the line skipper zero or more times
.*hello matches the line with hello
The lookahead (?=[^:]+((?(1)\1)+:(\d+))) is identical to the one in the line skipper, except that this time the digits are captured to Group 2: (\d+)
-
Reference
Qtax trick (recently awarded an additional bounty by #AmalMurali)
Replace a word with the number of the line on which it is found

If you are using a Unix based OS / terminal, you could use sed:
sed -n '/regex/=' file
(from this StackOverflow response)

There are no methods in Java that will do it for you. You must read the file line-by-line and check for a match on each line. You can keep an index of the lines as you read them and do whatever you want with that index when a match is found.

Solution (workaround) M1
just append line numbers to the File, line by line, before you process (regex match) it.
stackoverflow: how to append line numbers to the File
Solution (workaround) M2
count all the newline characters that occur before the match group.
long count_NewLines = Pattern.compile("\\R")
.matcher(content.substring(0, matcher.start()))
.results()
.count() + 1;

Regular expression removing all words shorter than n

Well, I'm looking for a regexp in Java that deletes all words shorter than 3 characters.
I thought something like \s\w{1,2}\s would grab all the 1 and 2 letter words (a whitespace, one to two word characters and another whitespace), but it just doesn't work.
Where am I wrong?

I've got it working fairly well, but it took two passes.
public static void main(String[] args) {
String passage = "Well, I'm looking for a regexp in Java that deletes all words shorter than 3 characters.";
System.out.println(passage);
passage = passage.replaceAll("\\b[\\w']{1,2}\\b", "");
passage = passage.replaceAll("\\s{2,}", " ");
System.out.println(passage);
}
The first pass replaces all words containing less than three characters with a single space. Note that I had to include the apostrophe in the character class to eliminate because the word "I'm" was giving me trouble without it. You may find other special characters in your text that you also need to include here.
The second pass is necessary because the first pass left a few spots where there were double spaces. This just collapses all occurrences of 2 or more spaces down to one. It's up to you whether you need to keep this or not, but I think it's better with the spaces collapsed.
Output:
Well, I'm looking for a regexp in Java that deletes all words shorter than 3 characters.
Well, looking for regexp Java that deletes all words shorter than characters.

If you don't want the whitespace matched, you might want to use
\b\w{1,2}\b
to get the word boundaries.
That's working for me in RegexBuddy using the Java flavor; for the test string
"The dog is fun a cat"
it highlights "is" and "a". Similarly for words at the beginning/end of a line.
You might want to post a code sample.
(And, as GameFreak just posted, you'll still end up with double spaces.)
EDIT:
\b\w{1,2}\b\s?
is another option. This will partially fix the space-stripping issue, although words at the end of a string or followed by punctuation can still cause issues. For example, "A dog is fun no?" becomes "dog fun ?" In any case, you're still going to have issues with capitalization (dog should now be Dog).

Try: \b\w{1,2}\b although you will still have to get rid of the double spaces that will show up.

If you have a string like this:
hello there my this is a short word
This regex will match all words in the string greater than or equal to 3 characters in length:
\w{3,}
Resulting in:
hello there this short word
That, to me, is the easiest approach. Why try to match what you don't want, when you can match what you want a lot easier? No double spaces, no leftovers, and the punctuation is under your control. The other approaches break on multiple spaces and aren't very robust.

use of delimiter function from scanner for "abc-def"

I'm currently trying to filter a text-file which contains words that are separated with a "-". I want to count the words.
scanner.useDelimiter(("[.,:;()?!\" \t\n\r]+"));
The problem which occurs simply is: words that contain a "-" will get separated and counted for being two words. So just escaping with \- isn't the solution of choice.
How can I change the delimiter-expression, so that words like "foo-bar" will stay, but the "-" alone will be filtered out and ignored?
Thanks ;)

OK, I'm guessing at your question here: you mean that you have a text file with some "real" prose, i.e. sentences that actually make sense, are separated by punctuation and the like, etc., right?
Example:
This situation is ameliorated - as far as we can tell - by the fact that our most trusted allies, the Vorgons, continue to hold their poetry slam contests; the enemy has little incentive to interfere with that, even with their Mute-O-Matic devices.
So, what you need as delimiter is something that is either any amount of whitespace and/or punctuation (which you already have covered with the regex you showed), or a hyphen that is surrounded by at least one whitespace on each side. The regex character for "or" is "|". There is a shortcut for the whitespace character class (spaces, tabs, and newlines) in many regex implementations: "\s".
"[.,:;()?!\"\s]+|\s+-\s+"

If possible try to use the pre-defined classes... makes the regex much easier to read. See java.util.regex.Pattern for options.
Maybe this is what you are looking for:
string.split("\\s+(\\W*\\s)?"
Reads: Match 1 or more whitespace chars optionally followed by zero or more non-word characters and a whitespace character.

This is not very simple. One thing to try would be {current-delimeter-chars}{zero-or-more-hyphens}{zero-or-more-current-delimeter-chars-or-hyphen}.
It might be easier to just ignore words returned by scanner consisting entirely of hyphens

Scanner scanner = new Scanner("one two2 - (three) four-five - ,....|");
scanner.useDelimiter("(\\B+-\\B+|[.,:;()?!\" \t|])+");
while (scanner.hasNext()) {
System.out.println(scanner.next("\\w+(-\\w+)*"));
}
NB
the next(String) method asserts that you get only words since the original useDelimiter() method misses "|"
NB
you have used the regular expression "\r\n|\n" as line terminator. The JavaDocs for java.util.regex.Pattern shows other possible line terminators, so a more complete check would use the expression "\r\n|[\r\n\u2028\u2029\u0085]"

This should be a simple enough: [^\\w-]\\W*|-\\W+
But of course if it's prose, and you want to exclude underscores:
[^\\p{Alnum}-]\\P{Alnum}*|-\\P{Alnum}+
or if you don't expect numerics:
[^\\p{Alpha}-]\\P{Alpha}*|-\\P{Alpha}+
EDIT: These are easier forms. Keep in mind the complete solution, that would handle dashes at the beginning and end of lines would follow this pattern. (?:^|[^\\w-])\\W*|-(?:\\W+|$)

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

How to match n number of lines with regex - java

You were close - in Notes:.\n{4} the {4} is binding only to the newline, so it'll only capture "Notes:" followed by anything, followed by 4 blank lines. You're looking for something like Notes:\n((?:.\n){1,4})

Related

Complex RegEx patter for replacing commas

Regex format for a particular Match

On Which Line Number Was the Regex Match Found?

Regular expression removing all words shorter than n

use of delimiter function from scanner for "abc-def"

Categories

Resources

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

How to match n number of lines with regex - java

You were close - in Notes:.*\n{4} the {4} is binding only to the newline, so it'll only capture "Notes:" followed by anything, followed by 4 blank lines. You're looking for something like Notes:\n((?:.*\n){1,4})

Related

Complex RegEx patter for replacing commas

Regex format for a particular Match

On Which Line Number Was the Regex Match Found?

Regular expression removing all words shorter than n

use of delimiter function from scanner for "abc-def"

Categories

Resources

You were close - in Notes:.\n{4} the {4} is binding only to the newline, so it'll only capture "Notes:" followed by anything, followed by 4 blank lines. You're looking for something like Notes:\n((?:.\n){1,4})