java regex , extract a line? - java

given 3 lines , how can I extract 2nd line using regular expression ?
line1
line2
line3
I used
pattern = Pattern.compile("line1.*(.*?).*line3");
But nothing appears

You can use Pattern.DOTALL flag like this:
String str = "line1\nline2\nline3";
Pattern pt = Pattern.compile("line1\n(.+?)\nline3", Pattern.DOTALL);
Matcher m = pt.matcher(str);
while (m.find())
System.out.printf("Matched - [%s]%n", m.group(1)); // outputs [line2]

This won't work, since your first .* matches everything up to line3. Your reluctant match gets lost, as does the second .*.
Try to specify the line breaks (^ and $) after line1 / before line3.

Try pattern = Pattern.compile("line1.*?(.*?).*?line3", Pattern.DOTALL | Pattern.MULTILINE);

You can extract everything between two non-empty lines:
(?<=.+\n).+(?=\n.+)

Related

RegEx Exepression not matching

I have the following text
CHAPTER 1
Introduction
CHAPTER OVERVIEW
Which I did create and tested (http://regexr.com/) the following regEx for
(CHAPTER\s{1}\d\n)
However when I use the following code on Java it fails
String text = stripper.getText(document);//The text above
Pattern p = Pattern.compile("(CHAPTER\\s{1}\\d\\n)");
Matcher m = p.matcher(text);
if (m.find()) {
//do action
}
the m.find() returns always false.
Your document may have DOS line feed \r as well. You can use either of these patterns:
Pattern p = Pattern.compile("CHAPTER\\s+\\d+\\R");
\R (requires Java 8) will match any combination of \r and \n after your digits or just use:
Pattern p = Pattern.compile("CHAPTER\\s+\\d+\\s");
since \s also matches any whitespace including newline characters.
Another alternative is to use MULTILINE flag with anchor $:
Pattern p = Pattern.compile("(?m)CHAPTER\\s+\\d+$");
Your problem is in your source text. I think you forget about new lines. Because this:
String text = "CHAPTER 1\n" +
"Introduction\n" +
"CHAPTER OVERVIEW";
Pattern p = Pattern.compile("(CHAPTER\\s{1}\\d\\n)");
Matcher m = p.matcher(text);
System.out.println(m.find());
will write true. String body is copied from here and Intellij add there new lines. Try to debug what you really get in stripper.getText(document).
You can use Pattern as second param for compile. (Pattern.MULTILINE) More info
here
.

Regex to match the beginning and the end of a string in Java

I want to extract a certain like of string using Regex in Java. I currently have this pattern:
pattern = "^\\a.+\\sed$\n";
Supposed to match on a string that starts with "a" and ends with "sed". This is not working. Did I miss something ?
Removed the \n line at the end of the pattern and replaced it with a "$":
Still doesn't get a match. The regex looks legit from my side.
What I want to extract is the "a sed" from the temp string.
String temp = "afsgdhgd gfgshfdgadh a sed afdsgdhgdsfgdfagdfhh";
pattern = "(?s)^a.*sed$";
pr = Pattern.compile(pattern);
math = pr.matcher(temp);
UPDATE
You want to match a sed, so you can use a\\s+sed if there is only whitespace between a and sed:
String s = "afsgdhgd gfgshfdgadh a sed afdsgdhgdsfgdfagdfhh";
Pattern pattern = Pattern.compile("a\\s+sed");
Matcher matcher = pattern.matcher(s);
while (matcher.find()){
System.out.println(matcher.group(0));
}
See IDEONE demo
Now, if there can be anything between a and sed, use a tempered greedy token:
Pattern pattern = Pattern.compile("(?s)a(?:(?!a|sed).)*sed");
^^^^^^^^^^^^^
See another IDEONE demo.
ORIGINAL ANSWER
The main problem with your regex is the \n at the end. $ is the end of string, and you try to match one more character after a string end, which is impossible. Also, \\s matches a whitespace symbol, but you need a literal s.
You need to remove \\s and \n and make . match a newline, and also it is advisbale to use * quantifier to allow 0 symbols in-between:
pattern = "(?s)^a.*sed$";
See the regex demo
The regex matches:
^ - start of string
a - a literal a
.* - 0 or more any characters (since (?s) modifier makes a . match any character including a newline)
sed - a literal letter sequence sed
$ - end of string
Your temp string cannot match the pattern (?s)^a.*sed$, because this pattern says that your temp string must begin with the character a and end with the sequence sed, which is not the case. Your string has trailing characters after the "sed" sequence.
If you only want to extract that a...sed portion of the whole string, try using the unanchored pattern "a.*sed" and use the find() method of the Matcher class:
Pattern pattern = Pattern.compile("a.*sed");
Matcher m = pattern.matcher(temp);
if (m.find())
{
System.out.println("Found string "+m.group());
System.out.println("From "+m.start()+" to "+m.end());
}

How to match on single line for regex?

I have a regex to match a line and delete it. Everything is below it (and keep everything above it).
Two Part Ask:
1) Why won't this pattern match the given String text below?
2) How can I be sure to just match on a single line and not multiple lines?
- The pattern has to be found on the same single line.
String text = "Keep this.\n\n\nPlease match junkhere this t-h-i-s is missing.\n"
+ "Everything should be deleted here but don't match this on this line" + "\n\n";
Pattern p = Pattern.compile("^(Please(\\s)(match)(\\s)(.*?)\\sthis\\s(.*))$", Pattern.DOTALL );
Matcher m = p.matcher(text);
if (m.find()) {
text = (m.replaceAll("")).replaceAll("[\n]+$", ""); // remove everything below at and below "Please match ... this"
System.out.println(text);
}
Expected Output:
Keep this.
You are complicating your life...
First, as I said in the comment, use Pattern.MULTILINE.
Then, to truncate the string from the beginning of the match, use .substring():
final Pattern p = Pattern.compile("^Please\\s+match\\b.*?this",
Pattern.MULTILINE);
final Matcher m = p.matcher(input);
return m.find() ? input.substring(0, m.start()) : input;
Remove DOTALL to make sure to match on a single line and convert \s to " "
Pattern p = Pattern.compile("^(Please( )(match)( )(.*?) this (.*))$");
DOTALL makes a dot match newlines as well
\s can match any whitespace including new lines.

Remove occurrences of a given character sequence at the beginning of a string using Java Regex

I have a string that begins with one or more occurrences of the sequence "Re:". This "Re:" can be of any combinations, for ex. Re<any number of spaces>:, re:, re<any number of spaces>:, RE:, RE<any number of spaces>:, etc.
Sample sequence of string : Re: Re : Re : re : RE: This is a Re: sample string.
I want to define a java regular expression that will identify and strip off all occurrences of Re:, but only the ones at the beginning of the string and not the ones occurring within the string.
So the output should look like This is a Re: sample string.
Here is what I have tried:
String REGEX = "^(Re*\\p{Z}*:?|re*\\p{Z}*:?|\\p{Z}Re*\\p{Z}*:?)";
String INPUT = title;
String REPLACE = "";
Pattern p = Pattern.compile(REGEX);
Matcher m = p.matcher(INPUT);
while(m.find()){
m.appendReplacement(sb,REPLACE);
}
m.appendTail(sb);
I am using p{Z} to match whitespaces(have found this somewhere in this forum, as Java regex does not identify \s).
The problem I am facing with this code is that the search stops at the first match, and escapes the while loop.
Try something like this replace statement:
yourString = yourString.replaceAll("(?i)^(\\s*re\\s*:\\s*)+", "");
Explanation of the regex:
(?i) make it case insensitive
^ anchor to start of string
( start a group (this is the "re:")
\\s* any amount of optional whitespace
re "re"
\\s* optional whitespace
: ":"
\\s* optional whitespace
) end the group (the "re:" string)
+ one or more times
in your regex:
String regex = "^(Re*\\p{Z}*:?|re*\\p{Z}*:?|\\p{Z}Re*\\p{Z}*:?)"
here is what it does:
see it live here
it matches strings like:
\p{Z}Reee\p{Z: or
R\p{Z}}}
which make no sense for what you try to do:
you'd better use a regex like the following:
yourString.replaceAll("(?i)^(\\s*re\\s*:\\s*)+", "");
or to make #Doorknob happy, here's another way to achieve this, using a Matcher:
Pattern p = Pattern.compile("(?i)^(\\s*re\\s*:\\s*)+");
Matcher m = p.matcher(yourString);
if (m.find())
yourString = m.replaceAll("");
(which is as the doc says the exact same thing as yourString.replaceAll())
Look it up here
(I had the same regex as #Doorknob, but thanks to #jlordo for the replaceAll and #Doorknob for thinking about the (?i) case insensitivity part ;-) )

RegEx - problem with multiline input

I have a String with multiline content and want to select a multiline region, preferably using a regular expression (just because I'm trying to understand Java RegEx at the moment).
Consider the input like:
Line 1
abc START def
Line 2
Line 3
gh END jklm
Line 4
Assuming START and END are unique and the start/end markers for the region, I'd like to create a pattern/matcher to get the result:
def
Line 2
Line 3
gh
My current attempt is
Pattern p = Pattern.compile("START(.*)END");
Matcher m = p.matcher(input);
if (m.find())
System.out.println(m.group(1));
But the result is
gh
So m.start() seems to point at the beginning of the line that contains the 'end marker'. I tried to add Pattern.MULTILINE to the compile call but that (alone) didn't change anything.
Where is my mistake?
You want Pattern.DOTALL, so . matches newline characters. MULTILINE addresses a different issue, the ^ and $ anchors.
Pattern p = Pattern.compile("START(.*)END", Pattern.DOTALL);
You want to set Pattern.DOTALL (so you can match end of line characters with your . wildcard), see this test:
#Test
public void testMultilineRegex() throws Exception {
final String input = "Line 1\nabc START def\nLine 2\nLine 3\ngh END jklm\nLine 4";
final String expected = " def\nLine 2\nLine 3\ngh ";
final Pattern p = Pattern.compile("START(.*)END", Pattern.DOTALL);
final Matcher m = p.matcher(input);
if (m.find()) {
Assert.assertEquals(expected, m.group(1));
} else {
Assert.fail("pattern not found");
}
}
The regex metachar . does not match a newline. You can try the regex:
START([\w\W]*)END
which uses [\w\W] in place of ..
[\w\W] is a char class to match a word-char and a non-word-char, so effectively matches everything.

Categories