Regular Expression - Starting with and ending with string - java

I would like to write a regular expression to match files that starts with "AMDF" or "SB700" and does not end with ".tmp". This will be used in Java.

Code
See regex in use here
^(?:AMDF|SB700).*\.(?!tmp$)[^.]+$
Usage
See code in use here
import java.util.*;
import java.lang.*;
import java.io.*;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
class Ideone
{
public static void main (String[] args) throws java.lang.Exception
{
final String regex = "^(?:AMDF|SB700).*\\.(?!tmp$)[^.]+$";
final String[] files = {
"AMDF123978sudjfadfs.ext",
"SB700afddasjfkadsfs.ext",
"AMDE41312312089fsas.ext",
"SB701fs98dfjasdjfsd.ext",
"AMDF123120381203113.tmp"
};
final Pattern pattern = Pattern.compile(regex, Pattern.MULTILINE);
for (String file:files) {
final Matcher matcher = pattern.matcher(file);
if(matcher.matches()) {
System.out.println(matcher.group(0));
}
}
}
}
Results
Input
AMDF123978sudjfadfs.ext
SB700afddasjfkadsfs.ext
AMDE41312312089fsas.ext
SB701fs98dfjasdjfsd.ext
AMDF123120381203113.tmp
Output
Below shows only matches.
AMDF123978sudjfadfs.ext
SB700afddasjfkadsfs.ext
Explanation
^ Assert position at the start of the line
(?:AMDF|SB700) Match either AMDF or SB700 literally
.* Match any character any number of times
\. Match a literal dot . character
(?!tmp$) Negative lookahead ensuring what follows doesn't match tmp literally (asserting the end of the line afterwards so as not to match .tmpx where x can be anything)
[^.]+ Match any character except . one or more times
$ Assert position at the end of the line

Here is another example that works:
^(SB700|AMDF).*(?!\.tmp).{4}$

An approach could be to try a regex using a negative lookahead to assert that the file name does not end on .tmp and use an anchor ^ to make sure that the file name starts with AMDF or SB700 like:
^(?!.*\.tmp$)(?:AMDF|SB700)\w*\.\w+$
Explanation
The beginning of the string ^
A negative lookahead (?!
To assert that the string ends with .tmp .*\.tmp$
A non capturing group which matches AMDF or SB700 (?:AMDF|SB700)
Match a word character zero or more times \w*
Match a dot \.
Match a word character one or more times \w+
The end of the string $
In Java it would look like:
^(?!.*\\.tmp$)(?:AMDF|SB700)\\w*\\.\\w+$
Demo

Related

JAVA regex for "String, String."

Given string "Neil, Gogte., Satyam, B.: Introduction to Java"
I need to extract only "Neil, Gogte." and "Satyam, B." from given string using regex how can I do it?
You can use matcher to group
String str = "Neil, Gogte., Satyam, B.: Introduction to Java";
Pattern pattern = Pattern.compile("([a-zA-Z]+, [a-zA-Z]+\\.)");
Matcher matcher = pattern.matcher(str);
while(matcher.find()){
String result = matcher.group();
System.out.println(result);
}
You can use the following regex to split the string. This matches any locations where ., exist:
(?<=\.),\s*
(?<=\.) Positive lookbehind ensuring what precedes is a literal dot character .
,\s* Matches , followed by any number of whitespace characters
See code in use here
import java.util.*;
import java.util.regex.Pattern;
class Main {
public static void main(String[] args) {
final String s = "Neil, Gogte., Satyam, B.: Introduction to Java";
final Pattern r = Pattern.compile("(?<=\\.),\\s*");
String[] result = r.split(s);
Arrays.stream(result).forEach(System.out::println);
}
}
Result:
Neil, Gogte.
Satyam, B.: Introduction to Java
You might use this regex to match your names:
[A-Z][a-z]+, [A-Z][a-z]*\.
In Java:
[A-Z][a-z]+, [A-Z][a-z]*\\.
That would match
[A-Z] Match an uppercase character
[a-z]+ Match one or more lowercase characters
, Match comma and a whitespace
[A-Z] Match an uppercase character
[a-z]* Match zero or more lowercase characters
\. Match a dot
Demo Java

Match String ending with (regex) java

I am following the suggestions on the page, check if string ends with certain pattern
I am trying to display a string that is
Starts with anything
Has the letters ".mp4" in it
Ends explicitly with ', (apostrophe followed by comma)
Here is my Java code:
import java.util.*;
import java.lang.*;
import java.io.*;
import java.util.regex.*;
class Ideone
{
public static void main (String[] args) throws java.lang.Exception
{
// your code goes here
String str = " _file='ANyTypEofSTR1ngHere_133444556_266545797_10798866.mp4',";
Pattern p = Pattern.compile(".*.mp4[',]$");
Matcher m = p.matcher(str);
if(m.find())
System.out.println("yes");
else
System.out.println("no");
}
}
It prints "no". How should I declare my RegEx?
There are several issues in your regex:
"Has the letters .mp4 in it" means somewhere, not necessarily just in front of ',, so another .* should be inserted.
. matches any character. Use \. to match .
[,'] is a character group, i.e. exactly one of the characters in the brackets has to occur.
You can use the following regex instead:
Pattern p = Pattern.compile(".*\\.mp4.*',$");
Your character set [',] is checking whether the string ends with ' or , a single time.
If you want to match those character one or more times, use [',]+. However, you probably don't want to use a character set in this case since you said order is important.
To match an apostrophe followed by comma, just use:
.*\\.mp4',$
Also, since . has special meaning, you need to escape it in '.mp4'.

Regex that allows only single separators between words

I need to construct a regular expression such that it should not allow / at the start or end, and there should not be more than one / in sequence.
Valid Expression is: AB/CD
Valid Expression :AB
Invalid Expression: //AB//CD//
Invalid Expression: ///////
Invalid Expression: AB////////
The / character is just a separator between two words. Its length should not be more than one between words.
Assuming you only want to allow alphanumerics (including underscore) between slashes, it's pretty trivial:
boolean foundMatch = subject.matches("\\w+(?:/\\w+)*");
Explanation:
\w+ # Match one or more alnum characters
(?: # Start a non-capturing group
/ # Match a single slash
\w+ # Match one or more alnum characters
)* # Match that group any number of times
This regex does it:
^(?!/)(?!.*//).*[^/]$
So in java:
if (str.matches("(?!/)(?!.*//).*[^/]"))
Note that ^ and $ are implied by matches(), because matches must match the whole string to be true.
[a-zA-Z]+(/[a-zA-Z]+)+
It matches
a/b
a/b/c
aa/vv/cc
doesn't matches
a
/a/b
a//b
a/b/
Demo
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class Reg {
public static void main(String[] args) {
Pattern pattern = Pattern.compile("[a-zA-Z]+(/[a-zA-Z]+)+");
Matcher matcher = pattern.matcher("a/b/c");
System.out.println(matcher.matches());
}
}

How to write a regular expressions that extracts tabbed pieces of text?

I have been trying to create a program to replace tab elements with spaces (assuming a tab is equivalent to 8 spaces, one or more of which taken by non-whitespace characters (letter).
I start to extract the text in a file from a scanner by the following:
try {
reader = new FileReader(file)
} catch (IOException io) {
println("File not found")
}
Scanner scanner = new Scanner(reader);
scanner.usedelimiter("//Z");
String text = Scanner.next();
And then I try parsing through pieces of text that end with a tab with ptrn1 below, and extract the length of the last word of each piece with ptrn2:
Pattern ptrn1 = Pattern.compile(".*\\t, Pattern.DOTALL);
Matcher matcher1 = ptrn1.matcher(text);
String nextPiece = matcher1.group();
println(matcher1.group()); /* gives me the first substring ending with tab*/
however:
Pattern ptrn2 = Pattern.compile("\\s.*\\t"); /*supposed to capture the last word in the string*/
Matcher matcher2 = ptrn2.matcher(nextPiece);
String lastword = matcher2.group();
The last line gives me an error since apparently it cannot match anything with the pattern ("\\s.\*\\t"). There is something wrong with this last regular expression, which is intended to say "any number of spaces, followed by any number of characters, followed by a tab. I have not been able to find out what is wrong with it though. I have tried ("\\s*.+\\t"), ("\\s*.*\\t"), and ("\s+.+\\t"); still no luck.
Later on, per recommendations below, I simplified the code and included the sample string in it. As follows:
import acm.program.*;
import acm.util.*;
import java.util.*;
import java.io.*;
import java.util.regex.*;
public class Untabify extends ConsoleProgram {
public void run(){
String s = "Be plain,\tgood son,\tand homely\tin thy drift.\tRiddling\tconfession\tfinds but riddling\tshrift. ";
Pattern ptrn1 =Pattern.compile(".*?\t", Pattern.DOTALL);
Pattern ptrn2 = Pattern.compile("[^\\s+]\t", Pattern.DOTALL);
String nextPiece;
Matcher matcher1 = ptrn1.matcher(s);
while (matcher1.find()){
nextPiece = matcher1.group();
println(nextPiece);
Matcher matcher2 = ptrn2.matcher(nextPiece);
println(matcher2.group());
}
}
}
The program variably crashes, first at "println(matcher2.group())"; and on the next run on "public void run()" with the message: "Debug Current Instruction Pointer" (what is the meaning of it?).
It would be useful to see a sample string. If you just want the last word before the tab, then you can use this:
([^\s]+)\t
Note the () are to put the last word in a group. [^\s]+ means 1 or more non-space.
You do not need to double-escape the tab character (i.e. \\t); \t will do fine. \t is interpreted as a tab character by the java String parser, and that tab character is sent to the regex parser, which interprets it as a tab character. You can see this answer for more information.
Also, you should use Pattern.DOTALL, not Pattern.Dotall.
The pattern "\\s.*\\t" must match a single whitespace character (\s) followed by 0 or more characters (.*) followed by a single tab (\t). If you want to capture the last word and a trailing tab you should use the word boundary escape \b
Pattern.compile("\\b.*\\b\t");
You could replace the . above to use \w or whatever your definition of a word character is if you don't want to match any character.
Here's the code you'd use to match any word immediately before a tab:
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class RegEx {
public static void main(String args[]) {
String text = "ab cd\t ef gh\t ij";
Pattern pattern = Pattern.compile("\\b(\\w+)\\b\t", Pattern.DOTALL);
Matcher matcher = pattern.matcher(text);
while (matcher.find()) {
System.out.println(matcher.group(1));
}
}
}
The above will output
cd
gh
See the Regular Expression Tutorial, especially the sections on Predefined Character Classes and Boundary Matchers for more information.
You can get more detail and experiment with this regular expression on Regex101.

regex last word in a sentence ending with punctuation (period)

I'm looking for the regex pattern, not the Java code, to match the last word in an English (or European language) sentence. If the last word is, in this case, "hi" then I want to match "hi" and not "hi."
The regex (\w+)\.$ will match "hi.", whereas the output should be just "hi". What's the correct regex?
thufir#dur:~/NetBeansProjects/regex$
thufir#dur:~/NetBeansProjects/regex$ java -jar dist/regex.jar
trying
a b cd efg hi
matches:
hi
trying
a b cd efg hi.
matches:
thufir#dur:~/NetBeansProjects/regex$
code:
package regex;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class Main {
public static void main(String[] args) {
String matchesLastWordFine = "a b cd efg hi";
lastWord(matchesLastWordFine);
String noMatchFound = matchesLastWordFine + ".";
lastWord(noMatchFound);
}
private static void lastWord(String sentence) {
System.out.println("\n\ntrying\n" + sentence + "\nmatches:");
Pattern pattern = Pattern.compile("(\\w+)$");
Matcher matcher = pattern.matcher(sentence);
String match = null;
while (matcher.find()) {
match = matcher.group();
System.out.println(match);
}
}
}
My code is in Java, but that's neither here nor there. I'm strictly looking for the regex, not the Java code. (Yes, I know it's possible to strip out the last character with Java.)
What regex should I put in the pattern?
You can use lookahead asserion. For example to match sentence without period:
[\w\s]+(?=\.)
and
[\w]+(?=\.)
For just last word (word before ".")
If you need to have the whole match be the last word you can use lookahead.
\w+(?=(\.))
This matches a set of word characters that are followed by a period, without matching the period.
If you want the last word in the line, regardless of wether the line ends on the end of a sentence or not you can use:
\w+(?=(\.?$))
Or if you want to also include ,!;: etc then
\w+(?=(\p{Punct}?$))
You can use matcher.group(1) to get the content of the first capturing group ((\w+) in your case). To say a little more, matcher.group(0) would return you the full match. So your regex is almost correct. An improvement is related to your use of $, which would catch the end of the line. Use this only if your sentence fill exactly the line!
With this regular expression (\w+)\p{Punct} you get a group count of 1, means you get one group with punctionation at matcher.group(0) and one without the punctuation at matcher.group(1).
To write the regular expression in Java, use: "(\\w+)\\p{Punct}"
To test your regular expressions online with Java (and actually a lot of other languages) see RegexPlanet
By using the $ operator you will only get a match at the end of a line. So if you have multiple sentences on one line you will not get a match in the middle one.
So you should just use:
(\w+)\.
the capture group will give the correct match.
You can see an example here
I don't understand why really, but this works:
package regex;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class Main {
public static void main(String[] args) {
String matchesLastWordFine = "a b cd efg hi";
lastWord(matchesLastWordFine);
String noMatchFound = matchesLastWordFine + ".";
lastWord(noMatchFound);
}
private static void lastWord(String sentence) {
System.out.println("\n\ntrying\n" + sentence + "\nmatches:");
Pattern pattern = Pattern.compile("(\\w+)"); //(\w+)\.
Matcher matcher = pattern.matcher(sentence);
String match = null;
while (matcher.find()) {
match = matcher.group();
}
System.out.println(match);
}
}
I guess regex \w+ will match all the words (doh). Then the last word is what I was after. Too simple, really, I was trying to exclude punctuation, but I guess regex does that automagically for you..?

Categories