I have the following text
CHAPTER 1
Introduction
CHAPTER OVERVIEW
Which I did create and tested (http://regexr.com/) the following regEx for
(CHAPTER\s{1}\d\n)
However when I use the following code on Java it fails
String text = stripper.getText(document);//The text above
Pattern p = Pattern.compile("(CHAPTER\\s{1}\\d\\n)");
Matcher m = p.matcher(text);
if (m.find()) {
//do action
}
the m.find() returns always false.
Your document may have DOS line feed \r as well. You can use either of these patterns:
Pattern p = Pattern.compile("CHAPTER\\s+\\d+\\R");
\R (requires Java 8) will match any combination of \r and \n after your digits or just use:
Pattern p = Pattern.compile("CHAPTER\\s+\\d+\\s");
since \s also matches any whitespace including newline characters.
Another alternative is to use MULTILINE flag with anchor $:
Pattern p = Pattern.compile("(?m)CHAPTER\\s+\\d+$");
Your problem is in your source text. I think you forget about new lines. Because this:
String text = "CHAPTER 1\n" +
"Introduction\n" +
"CHAPTER OVERVIEW";
Pattern p = Pattern.compile("(CHAPTER\\s{1}\\d\\n)");
Matcher m = p.matcher(text);
System.out.println(m.find());
will write true. String body is copied from here and Intellij add there new lines. Try to debug what you really get in stripper.getText(document).
You can use Pattern as second param for compile. (Pattern.MULTILINE) More info
here
.
Related
I am running into an issue where my code is unable to find regex occurrences. Code:
String content = "This\ is\ an\ example.=This is an example\nThis\ is\ second\:=This is second"
String regex = "\"^.*(?=\\=)\"gm";
Pattern p = Pattern.compile(regex);
Matcher m = p.matcher(content);
List<String> mKeys = new ArrayList<>();
while (m.find()) {
mKeys.add(m.group());
}
mKeys turns out to be empty. I have already validated my regex here https://regex101.com/r/YResRc/3. I am expecting the list to contain two keys from the content.
Your content contains no " quotes, and no text gm, so why would you expect that regex to match?
FYI: Syntaxes like "foo"gm or /foo/gm are something other languages do for regex literals. Java doesn't do that.
The g flag is implied by the fact that you're using a find() loop, and m is the MULTILINE flag that affects ^ and $ and you can specify that using the (?m) pattern, or by adding a second parameter to compile(), i.e. one of these ways:
Pattern p = Pattern.compile("foo", Pattern.MULTILINE);
Pattern p = Pattern.compile("(?m)foo");
Your regex should simply be:
(?m)^.*(?==)
which means: Match everything from the beginning of a line up to the last = sign on the line.
Test
String content = "This is an example.=This is an example\nThis is second:=This is second";
String regex = "(?m)^.*(?==)";
Matcher m = Pattern.compile(regex).matcher(content);
List<String> mKeys = new ArrayList<>();
while (m.find()) {
mKeys.add(m.group());
}
System.out.println(mKeys);
Output
[This is an example., This is second:]
I have a regex to match a line and delete it. Everything is below it (and keep everything above it).
Two Part Ask:
1) Why won't this pattern match the given String text below?
2) How can I be sure to just match on a single line and not multiple lines?
- The pattern has to be found on the same single line.
String text = "Keep this.\n\n\nPlease match junkhere this t-h-i-s is missing.\n"
+ "Everything should be deleted here but don't match this on this line" + "\n\n";
Pattern p = Pattern.compile("^(Please(\\s)(match)(\\s)(.*?)\\sthis\\s(.*))$", Pattern.DOTALL );
Matcher m = p.matcher(text);
if (m.find()) {
text = (m.replaceAll("")).replaceAll("[\n]+$", ""); // remove everything below at and below "Please match ... this"
System.out.println(text);
}
Expected Output:
Keep this.
You are complicating your life...
First, as I said in the comment, use Pattern.MULTILINE.
Then, to truncate the string from the beginning of the match, use .substring():
final Pattern p = Pattern.compile("^Please\\s+match\\b.*?this",
Pattern.MULTILINE);
final Matcher m = p.matcher(input);
return m.find() ? input.substring(0, m.start()) : input;
Remove DOTALL to make sure to match on a single line and convert \s to " "
Pattern p = Pattern.compile("^(Please( )(match)( )(.*?) this (.*))$");
DOTALL makes a dot match newlines as well
\s can match any whitespace including new lines.
I have a string that begins with one or more occurrences of the sequence "Re:". This "Re:" can be of any combinations, for ex. Re<any number of spaces>:, re:, re<any number of spaces>:, RE:, RE<any number of spaces>:, etc.
Sample sequence of string : Re: Re : Re : re : RE: This is a Re: sample string.
I want to define a java regular expression that will identify and strip off all occurrences of Re:, but only the ones at the beginning of the string and not the ones occurring within the string.
So the output should look like This is a Re: sample string.
Here is what I have tried:
String REGEX = "^(Re*\\p{Z}*:?|re*\\p{Z}*:?|\\p{Z}Re*\\p{Z}*:?)";
String INPUT = title;
String REPLACE = "";
Pattern p = Pattern.compile(REGEX);
Matcher m = p.matcher(INPUT);
while(m.find()){
m.appendReplacement(sb,REPLACE);
}
m.appendTail(sb);
I am using p{Z} to match whitespaces(have found this somewhere in this forum, as Java regex does not identify \s).
The problem I am facing with this code is that the search stops at the first match, and escapes the while loop.
Try something like this replace statement:
yourString = yourString.replaceAll("(?i)^(\\s*re\\s*:\\s*)+", "");
Explanation of the regex:
(?i) make it case insensitive
^ anchor to start of string
( start a group (this is the "re:")
\\s* any amount of optional whitespace
re "re"
\\s* optional whitespace
: ":"
\\s* optional whitespace
) end the group (the "re:" string)
+ one or more times
in your regex:
String regex = "^(Re*\\p{Z}*:?|re*\\p{Z}*:?|\\p{Z}Re*\\p{Z}*:?)"
here is what it does:
see it live here
it matches strings like:
\p{Z}Reee\p{Z: or
R\p{Z}}}
which make no sense for what you try to do:
you'd better use a regex like the following:
yourString.replaceAll("(?i)^(\\s*re\\s*:\\s*)+", "");
or to make #Doorknob happy, here's another way to achieve this, using a Matcher:
Pattern p = Pattern.compile("(?i)^(\\s*re\\s*:\\s*)+");
Matcher m = p.matcher(yourString);
if (m.find())
yourString = m.replaceAll("");
(which is as the doc says the exact same thing as yourString.replaceAll())
Look it up here
(I had the same regex as #Doorknob, but thanks to #jlordo for the replaceAll and #Doorknob for thinking about the (?i) case insensitivity part ;-) )
given 3 lines , how can I extract 2nd line using regular expression ?
line1
line2
line3
I used
pattern = Pattern.compile("line1.*(.*?).*line3");
But nothing appears
You can use Pattern.DOTALL flag like this:
String str = "line1\nline2\nline3";
Pattern pt = Pattern.compile("line1\n(.+?)\nline3", Pattern.DOTALL);
Matcher m = pt.matcher(str);
while (m.find())
System.out.printf("Matched - [%s]%n", m.group(1)); // outputs [line2]
This won't work, since your first .* matches everything up to line3. Your reluctant match gets lost, as does the second .*.
Try to specify the line breaks (^ and $) after line1 / before line3.
Try pattern = Pattern.compile("line1.*?(.*?).*?line3", Pattern.DOTALL | Pattern.MULTILINE);
You can extract everything between two non-empty lines:
(?<=.+\n).+(?=\n.+)
I have a String with multiline content and want to select a multiline region, preferably using a regular expression (just because I'm trying to understand Java RegEx at the moment).
Consider the input like:
Line 1
abc START def
Line 2
Line 3
gh END jklm
Line 4
Assuming START and END are unique and the start/end markers for the region, I'd like to create a pattern/matcher to get the result:
def
Line 2
Line 3
gh
My current attempt is
Pattern p = Pattern.compile("START(.*)END");
Matcher m = p.matcher(input);
if (m.find())
System.out.println(m.group(1));
But the result is
gh
So m.start() seems to point at the beginning of the line that contains the 'end marker'. I tried to add Pattern.MULTILINE to the compile call but that (alone) didn't change anything.
Where is my mistake?
You want Pattern.DOTALL, so . matches newline characters. MULTILINE addresses a different issue, the ^ and $ anchors.
Pattern p = Pattern.compile("START(.*)END", Pattern.DOTALL);
You want to set Pattern.DOTALL (so you can match end of line characters with your . wildcard), see this test:
#Test
public void testMultilineRegex() throws Exception {
final String input = "Line 1\nabc START def\nLine 2\nLine 3\ngh END jklm\nLine 4";
final String expected = " def\nLine 2\nLine 3\ngh ";
final Pattern p = Pattern.compile("START(.*)END", Pattern.DOTALL);
final Matcher m = p.matcher(input);
if (m.find()) {
Assert.assertEquals(expected, m.group(1));
} else {
Assert.fail("pattern not found");
}
}
The regex metachar . does not match a newline. You can try the regex:
START([\w\W]*)END
which uses [\w\W] in place of ..
[\w\W] is a char class to match a word-char and a non-word-char, so effectively matches everything.