regex last word in a sentence ending with punctuation (period)

regex last word in a sentence ending with punctuation (period) - java

I'm looking for the regex pattern, not the Java code, to match the last word in an English (or European language) sentence. If the last word is, in this case, "hi" then I want to match "hi" and not "hi."
The regex (\w+)\.$ will match "hi.", whereas the output should be just "hi". What's the correct regex?
thufir#dur:~/NetBeansProjects/regex$
thufir#dur:~/NetBeansProjects/regex$ java -jar dist/regex.jar
trying
a b cd efg hi
matches:
hi
trying
a b cd efg hi.
matches:
thufir#dur:~/NetBeansProjects/regex$
code:
package regex;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class Main {
public static void main(String[] args) {
String matchesLastWordFine = "a b cd efg hi";
lastWord(matchesLastWordFine);
String noMatchFound = matchesLastWordFine + ".";
lastWord(noMatchFound);
}
private static void lastWord(String sentence) {
System.out.println("\n\ntrying\n" + sentence + "\nmatches:");
Pattern pattern = Pattern.compile("(\\w+)$");
Matcher matcher = pattern.matcher(sentence);
String match = null;
while (matcher.find()) {
match = matcher.group();
System.out.println(match);
}
}
}
My code is in Java, but that's neither here nor there. I'm strictly looking for the regex, not the Java code. (Yes, I know it's possible to strip out the last character with Java.)
What regex should I put in the pattern?

You can use lookahead asserion. For example to match sentence without period:
[\w\s]+(?=\.)
and
[\w]+(?=\.)
For just last word (word before ".")

If you need to have the whole match be the last word you can use lookahead.
\w+(?=(\.))
This matches a set of word characters that are followed by a period, without matching the period.
If you want the last word in the line, regardless of wether the line ends on the end of a sentence or not you can use:
\w+(?=(\.?$))
Or if you want to also include ,!;: etc then
\w+(?=(\p{Punct}?$))

You can use matcher.group(1) to get the content of the first capturing group ((\w+) in your case). To say a little more, matcher.group(0) would return you the full match. So your regex is almost correct. An improvement is related to your use of $, which would catch the end of the line. Use this only if your sentence fill exactly the line!

With this regular expression (\w+)\p{Punct} you get a group count of 1, means you get one group with punctionation at matcher.group(0) and one without the punctuation at matcher.group(1).
To write the regular expression in Java, use: "(\\w+)\\p{Punct}"
To test your regular expressions online with Java (and actually a lot of other languages) see RegexPlanet

By using the $ operator you will only get a match at the end of a line. So if you have multiple sentences on one line you will not get a match in the middle one.
So you should just use:
(\w+)\.
the capture group will give the correct match.
You can see an example here

I don't understand why really, but this works:
package regex;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class Main {
public static void main(String[] args) {
String matchesLastWordFine = "a b cd efg hi";
lastWord(matchesLastWordFine);
String noMatchFound = matchesLastWordFine + ".";
lastWord(noMatchFound);
}
private static void lastWord(String sentence) {
System.out.println("\n\ntrying\n" + sentence + "\nmatches:");
Pattern pattern = Pattern.compile("(\\w+)"); //(\w+)\.
Matcher matcher = pattern.matcher(sentence);
String match = null;
while (matcher.find()) {
match = matcher.group();
}
System.out.println(match);
}
}
I guess regex \w+ will match all the words (doh). Then the last word is what I was after. Too simple, really, I was trying to exclude punctuation, but I guess regex does that automagically for you..?

Related

Java Regular Expression to check for fixed length and more

I am not even sure if regular expressions are the best way to do this. Here is the requirement on a string:
To check length is 13 characters
First and Last 2 characters are always characters only.
Characters from 3 - 11 are numeric.
Please suggest whether regular expression is the best way to do it and what the regular expression would like to check such a thing?
Regards
Akhil

Use e.g.
"^[a-z]{2}[0-9]{9}[a-z]{2}$"
The square brackets say what is allowed, 'a-z' means small alphabetics between a and z. The curly says how many must be there. ^ means no characters before this, and $ means no characters after.
Usage:
import java.util.regex.Pattern;
import java.util.regex.Matcher;
public class MatcherExample {
public static void main(String[] args) {
String text = "aa123456789bb";
String patternString = "^[a-z]{2}[0-9]{9}[a-z]{2}$";
Pattern pattern = Pattern.compile(patternString, Pattern.CASE_INSENSITIVE);
Matcher matcher = pattern.matcher(text);
boolean matches = matcher.matches();
System.out.println("Matches: " + matches);
}
}

Match String ending with (regex) java

I am following the suggestions on the page, check if string ends with certain pattern
I am trying to display a string that is
Starts with anything
Has the letters ".mp4" in it
Ends explicitly with ', (apostrophe followed by comma)
Here is my Java code:
import java.util.*;
import java.lang.*;
import java.io.*;
import java.util.regex.*;
class Ideone
{
public static void main (String[] args) throws java.lang.Exception
{
// your code goes here
String str = " _file='ANyTypEofSTR1ngHere_133444556_266545797_10798866.mp4',";
Pattern p = Pattern.compile(".*.mp4[',]$");
Matcher m = p.matcher(str);
if(m.find())
System.out.println("yes");
else
System.out.println("no");
}
}
It prints "no". How should I declare my RegEx?

There are several issues in your regex:
"Has the letters .mp4 in it" means somewhere, not necessarily just in front of ',, so another .* should be inserted.
. matches any character. Use \. to match .
[,'] is a character group, i.e. exactly one of the characters in the brackets has to occur.
You can use the following regex instead:
Pattern p = Pattern.compile(".*\\.mp4.*',$");

Your character set [',] is checking whether the string ends with ' or , a single time.
If you want to match those character one or more times, use [',]+. However, you probably don't want to use a character set in this case since you said order is important.
To match an apostrophe followed by comma, just use:
.*\\.mp4',$
Also, since . has special meaning, you need to escape it in '.mp4'.

How to write a regular expressions that extracts tabbed pieces of text?

I have been trying to create a program to replace tab elements with spaces (assuming a tab is equivalent to 8 spaces, one or more of which taken by non-whitespace characters (letter).
I start to extract the text in a file from a scanner by the following:
try {
reader = new FileReader(file)
} catch (IOException io) {
println("File not found")
}
Scanner scanner = new Scanner(reader);
scanner.usedelimiter("//Z");
String text = Scanner.next();
And then I try parsing through pieces of text that end with a tab with ptrn1 below, and extract the length of the last word of each piece with ptrn2:
Pattern ptrn1 = Pattern.compile(".*\\t, Pattern.DOTALL);
Matcher matcher1 = ptrn1.matcher(text);
String nextPiece = matcher1.group();
println(matcher1.group()); /* gives me the first substring ending with tab*/
however:
Pattern ptrn2 = Pattern.compile("\\s.*\\t"); /*supposed to capture the last word in the string*/
Matcher matcher2 = ptrn2.matcher(nextPiece);
String lastword = matcher2.group();
The last line gives me an error since apparently it cannot match anything with the pattern ("\\s.\*\\t"). There is something wrong with this last regular expression, which is intended to say "any number of spaces, followed by any number of characters, followed by a tab. I have not been able to find out what is wrong with it though. I have tried ("\\s*.+\\t"), ("\\s*.*\\t"), and ("\s+.+\\t"); still no luck.
Later on, per recommendations below, I simplified the code and included the sample string in it. As follows:
import acm.program.*;
import acm.util.*;
import java.util.*;
import java.io.*;
import java.util.regex.*;
public class Untabify extends ConsoleProgram {
public void run(){
String s = "Be plain,\tgood son,\tand homely\tin thy drift.\tRiddling\tconfession\tfinds but riddling\tshrift. ";
Pattern ptrn1 =Pattern.compile(".*?\t", Pattern.DOTALL);
Pattern ptrn2 = Pattern.compile("[^\\s+]\t", Pattern.DOTALL);
String nextPiece;
Matcher matcher1 = ptrn1.matcher(s);
while (matcher1.find()){
nextPiece = matcher1.group();
println(nextPiece);
Matcher matcher2 = ptrn2.matcher(nextPiece);
println(matcher2.group());
}
}
}
The program variably crashes, first at "println(matcher2.group())"; and on the next run on "public void run()" with the message: "Debug Current Instruction Pointer" (what is the meaning of it?).

It would be useful to see a sample string. If you just want the last word before the tab, then you can use this:
([^\s]+)\t
Note the () are to put the last word in a group. [^\s]+ means 1 or more non-space.

You do not need to double-escape the tab character (i.e. \\t); \t will do fine. \t is interpreted as a tab character by the java String parser, and that tab character is sent to the regex parser, which interprets it as a tab character. You can see this answer for more information.
Also, you should use Pattern.DOTALL, not Pattern.Dotall.

The pattern "\\s.*\\t" must match a single whitespace character (\s) followed by 0 or more characters (.*) followed by a single tab (\t). If you want to capture the last word and a trailing tab you should use the word boundary escape \b
Pattern.compile("\\b.*\\b\t");
You could replace the . above to use \w or whatever your definition of a word character is if you don't want to match any character.
Here's the code you'd use to match any word immediately before a tab:
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class RegEx {
public static void main(String args[]) {
String text = "ab cd\t ef gh\t ij";
Pattern pattern = Pattern.compile("\\b(\\w+)\\b\t", Pattern.DOTALL);
Matcher matcher = pattern.matcher(text);
while (matcher.find()) {
System.out.println(matcher.group(1));
}
}
}
The above will output
cd
gh
See the Regular Expression Tutorial, especially the sections on Predefined Character Classes and Boundary Matchers for more information.
You can get more detail and experiment with this regular expression on Regex101.

Find a number of a given number of digits between given separators

What regex/pattern can I use to find the following pattern in a string?
#nnnn:
nnnn can be any 4-digit long number as long as it is sorrounded by a hashtag and a colon.
I have tried the code below:
String string = "#8226:";
if(string.matches( ".*\\d:.*" )) {
System.out.println( "Yes" );
}
It DOES work, but it matches other strings like below:
"This is a string 1234: Hahaha!" // Outputs "Yes"
"Hello 1834: World!!!" // Outputs "Yes"
I want it to only match the pattern at the top of the question.
Can anybody tell me where did I go wrong?

It can be done with Regular Expression
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class FindPattern {
public static void main(String[] args) {
Pattern pattern = Pattern.compile("#[0-9]{4}:");
String text = "#1233:#3433:abc#3993: #a343:___#8888:ki";
Matcher matcher = pattern.matcher(text);
while (matcher.find()) {
System.out.println(matcher.group());
}
}
}
output is:
#1233:
#3433:
#3993:
#8888:

You have already a pattern: #nnnn:. The only problem is that this is not a java compatible regular expression. Let's convert.
# and : are valid character literals, so let these untouched.
As you probably know (according to your solution), a number is denoted with the \d sequence (note, there are some alternatives, e. g. [0-9], \p{Digit}). Just replace all ns with \d:
#\d\d\d\d:
There are four equal subpatterns here, so we can shorten it with a fixed quantifier:
#\d{4}:
You can now write string.matches("#\\d{4}:"). Note that this is slow because compiles the given regex pattern every time. If this code is called frequently, I would consider using a precompiled Pattern like:
Pattern HASH_NUMBER_COLON_PATTERN = Pattern.compile("#\\d{4}:");
// ...
if (HASH_NUMBER_COLON_PATTERN.matcher(yourString).matches()) {
// ...
}
Even better to use some regular expression builder library, such as regex-builder, JavaVerbalExpressions or RegexBee. These tools can make your intention very clear. RegexBee example:
Pattern HASH_NUMBER_COLON_PATTERN = Bee
.then(Bee.fixedChar('#'))
.then(Bee.intBetween(1000, 9999))
.then(Bee.fixedChar(':'))
.toPattern()

java regular expression

Can anyone please help me do the following in a java regular expression?
I need to read 3 characters from the 5th position from a given String ignoring whatever is found before and after.
Example : testXXXtest
Expected result : XXX

You don't need regex at all.
Just use substring: yourString.substring(4,7)
Since you do need to use regex, you can do it like this:
Pattern pattern = Pattern.compile(".{4}(.{3}).*");
Matcher matcher = pattern.matcher("testXXXtest");
matcher.matches();
String whatYouNeed = matcher.group(1);
What does it mean, step by step:
.{4} - any four characters
( - start capturing group, i.e. what you need
.{3} - any three characters
) - end capturing group, you got it now
.* followed by 0 or more arbitrary characters.
matcher.group(1) - get the 1st (only) capturing group.

You should be able to use the substring() method to accomplish this:
string example = "testXXXtest";
string result = example.substring(4,7);

This might help: Groups and capturing in java.util.regex.Pattern.
Here is an example:
import java.util.regex.Pattern;
import java.util.regex.Matcher;
public class Example {
public static void main(String[] args) {
String text = "This is a testWithSomeDataInBetweentest.";
Pattern p = Pattern.compile("test([A-Za-z0-9]*)test");
Matcher m = p.matcher(text);
if (m.find()) {
System.out.println("Matched: " + m.group(1));
} else {
System.out.println("No match.");
}
}
}
This prints:
Matched: WithSomeDataInBetween
If you don't want to match the entire pattern rather to the input string (rather than to seek a substring that would match), you can use matches() instead of find(). You can continue searching for more matching substrings with subsequent calls with find().
Also, your question did not specify what are admissible characters and length of the string between two "test" strings. I assumed any length is OK including zero and that we seek a substring composed of small and capital letters as well as digits.

You can use substring for this, you don't need a regex.
yourString.substring(4,7);
I'm sure you could use a regex too, but why if you don't need it. Of course you should protect this code against null and strings that are too short.

Use the String.replaceAll() Class Method
If you don't need to be performance optimized, you can try the String.replaceAll() class method for a cleaner option:
String sDataLine = "testXXXtest";
String sWhatYouNeed = sDataLine.replaceAll( ".{4}(.{3}).*", "$1" );
References
https://docs.oracle.com/javase/1.5.0/docs/api/java/lang/String.html
http://www.vogella.com/tutorials/JavaRegularExpressions/article.html#using-regular-expressions-with-string-methods

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

regex last word in a sentence ending with punctuation (period) - java

You can use lookahead asserion. For example to match sentence without period: [\w\s]+(?=\.) and [\w]+(?=\.) For just last word (word before ".")

By using the $ operator you will only get a match at the end of a line. So if you have multiple sentences on one line you will not get a match in the middle one. So you should just use: (\w+)\. the capture group will give the correct match. You can see an example here

Related

Java Regular Expression to check for fixed length and more

Match String ending with (regex) java

How to write a regular expressions that extracts tabbed pieces of text?

Find a number of a given number of digits between given separators

java regular expression

Categories

Resources