Getting dialogue snippets from text using regular expressions - java

I'm trying to extract snippets of dialogue from a book text. For example, if I have the string
"What's the matter with the flag?" inquired Captain MacWhirr. "Seems all right to me."
Then I want to extract "What's the matter with the flag?" and "Seem's all right to me.".
I found a regular expression to use here, which is "[^"\\]*(\\.[^"\\]*)*". This works great in Eclipse when I'm doing a Ctrl+F find regex on my book .txt file, but when I run the following code:
String regex = "\"[^\"\\\\]*(\\\\.[^\"\\\\]*)*\"";
String bookText = "\"What's the matter with the flag?\" inquired Captain MacWhirr. \"Seems all right to me.\""; Pattern p = Pattern.compile(regex);
Matcher m = p.matcher(bookText);
if(m.find())
System.out.println(m.group(1));
The only thing that prints is null. So am I not converting the regex into a Java string properly? Do I need to take into account the fact that Java Strings have a \" for the double quotes?

In a natural language text, it's not likely that " is escaped by a preceding slash, so you should be able to use just the pattern "([^"]*)".
As a Java string literal, this is "\"([^\"]*)\"".
Here it is in Java:
String regex = "\"([^\"]*)\"";
String bookText = "\"What's the matter with the flag?\" inquired Captain MacWhirr. \"Seems all right to me.\"";
Pattern p = Pattern.compile(regex);
Matcher m = p.matcher(bookText);
while (m.find()) {
System.out.println(m.group(1));
}
The above prints:
What's the matter with the flag?
Seems all right to me.
On escape sequences
Given this declaration:
String s = "\"";
System.out.println(s.length()); // prints "1"
The string s only has one character, ". The \ is an escape sequence present at the Java source code level; the string itself has no slash.
See also
JLS 3.10.6 Escape Sequences for Character and String Literals
The problem with the original code
There's actually nothing wrong with the pattern per se, but you're not capturing the right portion. \1 isn't capturing the quoted text. Here's the pattern with the correct capturing group:
String regex = "\"([^\"\\\\]*(?:\\\\.[^\"\\\\]*)*)\"";
String bookText = "\"What's the matter?\" inquired Captain MacWhirr. \"Seems all right to me.\"";
Pattern p = Pattern.compile(regex);
Matcher m = p.matcher(bookText);
while (m.find()) {
System.out.println(m.group(1));
}
For visual comparison, here's the original pattern, as a Java string literal:
String regex = "\"[^\"\\\\]*(\\\\.[^\"\\\\]*)*\""
^^^^^^^^^^^^^^^^^
why capture this part?
And here's the modified pattern:
String regex = "\"([^\"\\\\]*(?:\\\\.[^\"\\\\]*)*)\""
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
we want to capture this part!
As mentioned before, though: this complicated pattern isn't necessary for natural language text, which isn't likely to contain escaped quotes.
See also
regular-expressions.info/Grouping and backreferences

Related

Java regex extract capture group if it exists

I apparently don't understand Java's regex library or regex either for that matter.
for this string:
String text = "asdf 2013-05-12 asdf";
this regex explodes in my face:
String REGEX_FORMAT_1 = ".+?([0-9]{4}\\s?-\\s?[0-9]{2}\\s?-\\s?[0-9]{2}).+";
Matcher matcher_1 = PATTERN_FORMAT_1.matcher(text);
if(matcher_1.matches()) {
String matchedGroup = matcher_1.group();
...
}
Semantically this makes sense to me but it seems I've totally misunderstood something. The regex works fine in some online regex editors like regex101 but not in others. Could someone please help me understand why I don't get the capture group containing 2013-05-12 ...
group() is equivalent to group(0) and returns the entire matched string. Use group(1) to pull out the first matched group.
String text = "asdf 2013-05-12 asdf";
String regex = ".+?([0-9]{4}\\s?-\\s?[0-9]{2}\\s?-\\s?[0-9]{2}).+";
Matcher matcher = Pattern.compile(regex).matcher(text);
if (matcher.matches()) {
String matchedGroup = matcher.group(1);
System.out.println(matchedGroup);
}
Output:
2013-05-12

find substring using match regex

Using regex how to find a substring in other string. Here are two strings:
String a= "?drug <http://www4.wiwiss.fu-berlin.de/drugbank/resource/drugbank/possibleDiseaseTarget> ?disease .";
String b = "?drug <http://www4.wiwiss.fu-berlin.de/drugbank/resource/drugbank/molecularWeightAverage> ?weight . ?drug <http://www4.wiwiss.fu-berlin.de/drugbank/resource/drugbank/possibleDiseaseTarget> ?disease";
I want to match only
<http://www4.wiwiss.fu-berlin.de/drugbank/resource/drugbank/possibleDiseaseTarget>
Since this is not quite HTML and any XML/HTML parser couldn't help it you can try with regex. It seems that you want to find text in form
?drug <someData> ?disease
To describe such text regex you need to escape ? (it is one of regex special characters representing optional - zero or once - quantifier) so you need to place \ before it (which in String needs to be written as "\\").
Also part <someData> can be written as as <[^>]> which means,
<,
one or more non > after it,
and finally >
So regex to match ?drug <someData> ?disease can be written as
"\\?drug <[^>]+> \\?disease"
But since we are interested only in part <[^>]+> representing <someData> we need to let regex group founded contend. In short if we surround some part of regex with parenthesis, then string matched by this regex part will be placed in something we call group, so we will be able to get part from this group. In short final regex can look like
"\\?drug (<[^>]+>) \\?disease"
^^^^^^^^^---first group,
and can be used like
String a = "?drug <http://www4.wiwiss.fu-berlin.de/drugbank/resource/drugbank/possibleDiseaseTarget> ?disease .";
String b = "?drug <http://www4.wiwiss.fu-berlin.de/drugbank/resource/drugbank/molecularWeightAverage> ?weight . ?drug <http://www4.wiwiss.fu-berlin.de/drugbank/resource/drugbank/possibleDiseaseTarget> ?disease";
Pattern p = Pattern.compile("\\?drug (<[^>]+>) \\?disease");
Matcher m = p.matcher(a);
while (m.find()) {
System.out.println(m.group(1));
}
System.out.println("-----------");
m = p.matcher(b);
while (m.find()) {
System.out.println(m.group(1));
}
which will produce as output
<http://www4.wiwiss.fu-berlin.de/drugbank/resource/drugbank/possibleDiseaseTarget>
-----------
<http://www4.wiwiss.fu-berlin.de/drugbank/resource/drugbank/possibleDiseaseTarget>
There's no need to use a regex here, just do this :
String substr = "<http://www4.wiwiss.fu-berlin.de/drugbank/resource/drugbank/possibleDiseaseTarget>";
System.out.println(b.contains(substr)); // prints true
System.out.println(a.contains(substr)); // prints true

Remove occurrences of a given character sequence at the beginning of a string using Java Regex

I have a string that begins with one or more occurrences of the sequence "Re:". This "Re:" can be of any combinations, for ex. Re<any number of spaces>:, re:, re<any number of spaces>:, RE:, RE<any number of spaces>:, etc.
Sample sequence of string : Re: Re : Re : re : RE: This is a Re: sample string.
I want to define a java regular expression that will identify and strip off all occurrences of Re:, but only the ones at the beginning of the string and not the ones occurring within the string.
So the output should look like This is a Re: sample string.
Here is what I have tried:
String REGEX = "^(Re*\\p{Z}*:?|re*\\p{Z}*:?|\\p{Z}Re*\\p{Z}*:?)";
String INPUT = title;
String REPLACE = "";
Pattern p = Pattern.compile(REGEX);
Matcher m = p.matcher(INPUT);
while(m.find()){
m.appendReplacement(sb,REPLACE);
}
m.appendTail(sb);
I am using p{Z} to match whitespaces(have found this somewhere in this forum, as Java regex does not identify \s).
The problem I am facing with this code is that the search stops at the first match, and escapes the while loop.
Try something like this replace statement:
yourString = yourString.replaceAll("(?i)^(\\s*re\\s*:\\s*)+", "");
Explanation of the regex:
(?i) make it case insensitive
^ anchor to start of string
( start a group (this is the "re:")
\\s* any amount of optional whitespace
re "re"
\\s* optional whitespace
: ":"
\\s* optional whitespace
) end the group (the "re:" string)
+ one or more times
in your regex:
String regex = "^(Re*\\p{Z}*:?|re*\\p{Z}*:?|\\p{Z}Re*\\p{Z}*:?)"
here is what it does:
see it live here
it matches strings like:
\p{Z}Reee\p{Z: or
R\p{Z}}}
which make no sense for what you try to do:
you'd better use a regex like the following:
yourString.replaceAll("(?i)^(\\s*re\\s*:\\s*)+", "");
or to make #Doorknob happy, here's another way to achieve this, using a Matcher:
Pattern p = Pattern.compile("(?i)^(\\s*re\\s*:\\s*)+");
Matcher m = p.matcher(yourString);
if (m.find())
yourString = m.replaceAll("");
(which is as the doc says the exact same thing as yourString.replaceAll())
Look it up here
(I had the same regex as #Doorknob, but thanks to #jlordo for the replaceAll and #Doorknob for thinking about the (?i) case insensitivity part ;-) )

Replace string with part of the matching regex

I have a long string. I want to replace all the matches with part of the matching regex (group).
For example:
String = "This is a great day, is it not? If there is something, THIS IS it. <b>is</b>".
I want to replace all the words "is" by, let's say, "<h1>is</h1>". The case should remain the same as original. So the final string I want is:
This <h1>is</h1> a great day, <h1>is</h1> it not? If there <h1>is</h1> something,
THIS <h1>IS</h1> it. <b><h1>is</h1></b>.
The regex I was trying:
Pattern pattern = Pattern.compile("[.>, ](is)[.<, ]", Pattern.CASE_INSENSITIVE);
The Matcher class is commonly used in conjunction with Pattern. Use the Matcher.replaceAll() method to replace all matches in the string
String str = "This is a great day...";
Pattern p = Pattern.compile("\\bis\\b", Pattern.CASE_INSENSITIVE);
Matcher m = p.matcher(str);
String result = m.replaceAll("<h1>is</h1>");
Note: Using the \b regex command will match on a word boundary (like whitespace). This is helpful to use in order to ensure that only the word "is" is matched and not words that contain the letters "i" and "s" (like "island").
Like this:
str = str.replaceAll(yourRegex, "<h1>$1</h1>");
The $1 refers to the text captured by group #1 in your regex.
Michael's answer is better, but if you happen to specifically only want [.>, ] and [.<, ] as boundaries, you can do it like this:
String input = "This is a great day, is it not? If there is something, THIS IS it. <b>is</b>";
Pattern p = Pattern.compile("(?<=[.>, ])(is)(?=[.<, ])", Pattern.CASE_INSENSITIVE);
Matcher m = p.matcher(input);
String result = m.replaceAll("<h1>$1</h1>");
yourStr.replaceAll("(?i)([.>, ])(is)([.<, ])","$1<h1>$2</h1>$3")
(?i)to indicate ignoring case; wrap everything your want to reuse with brackets, reuse them with $1 $2 and $3, concatenate them into what you want.
Simply use a backreference for that.
"This is a great day, is it not? If there is something, THIS IS it. <b>is</b>".replaceAll("[.>, ](is)[.<, ]", "<h1>$2</h1>"); should do.
It may be a late addition, but if anyone is looking for this like
Searching for 'thing' and also he needs 'Something' too to be taken as result,
Pattern p = Pattern.compile("([^ ])is([^ \.])");
String result = m.replaceAll("<\h1>$1is$2</h1>");
will result <\h1>Something</h1> too

how to use Matcher.replaceAll in java?

i have a file which contains "(*" and "*)". i want to remove everything between this two char sequences.
i used the following code but it didn't do anything with my string.
String regex = "\\(\\*.*\\*\\)";
Pattern pattern = Pattern.compile(regex);
Matcher matcher = pattern.matcher(input);
matcher.replaceAll("");
the 'input' is:
(* This program prints out a message. *)
program is
begin
write ("Hello, world!");
end;
You need to capture the return value of your matcher - it's replaceAll method returns the replaced String.
Additionally, use a regexp to match what you want to match, this time a parenthesized String. If you don't have some strange inputs, it may look like this:
String regex = "\\(\\*.*\\*\\)";
Pattern pattern = Pattern.compile(regex);
Matcher matcher = pattern.matcher(input);
String result = matcher.replaceAll("(\\*\\*)");
System.out.println(result);
This regexp in fact captures the whole region from the first comment start to the last comment end, which would usually not be what you want. To let it match non-greedy (reluctantly), use this regexp: \(\*.*?\*\) (with doubled backslashes in Java.)

Categories