How does this group() catch the text? - java

I've come across this Hackerrank problem and the regex should match string between the HTML tags. The regex and the string is
String str="<h1>Hello World!</h1>";
String regex="<(.+)>([^<]+)</\\1>";
Also what if the 'str' has more than one HTML tags like String str="<h1><h1>Hello World!</h1></h1>" and how ([^<]+) catches this 'str'.
My question is how ([^<]+) matches the 'str' and not ([a-zA-Z]+).
Here if the full source code :
import java.util.regex.Matcher;
import java.util.regex.Pattern;
/* Solution assumes we can't have the symbol "<" as text between tags */
public class Solution{
public static void main(String[] args){
Scanner scan = new Scanner(System.in);
int testCases = Integer.parseInt(scan.nextLine());
while (testCases-- > 0) {
String line = scan.nextLine();
boolean matchFound = false;
Pattern r = Pattern.compile(regex);
Matcher m = r.matcher(line);
while (m.find()) {
System.out.println(m.group(2));
matchFound = true;
}
if ( ! matchFound) {
System.out.println("None");
}
}
}
}
Don't mind if I'm stupid to ask this question and thank you in advance!

This regex guarantees that your string only contains one tag, assuming well formed HTML input.
The initial <(.+)> captures the name of your tag. The capture group will also get any attributes it can. Since + is a greedy quantifier, it will capture multiple tags if it can.
The trailing </\\1> matches against whatever the first group captured. That's why, if your HTML is well formed, the expression won't capture multiple tags or tags with attributes:
Opening tag <h1>, closing tag </h1> ✓
Opening tag <h1 attr="value">, closing tag </h1>, but expecting </h1 attr="value">
Opening tag <h1><h2>, closing tag </h2></h1>, but expecting </h1><h2>
That's why the tag can be matche with .+ rather safely, while the contents must be matched with [^<]+. You want to make sure you don't grab any stay tags in the content, but any other character at all is allowed. [^<]+ (pronounced. "not <, at least once) allows things like !, while [A-za-z] certainly would not.

If the input string is Hello World! then ([a-zA-z]+) will not properly match because of the exclamation point (!) and the space characters.
To be more clear, here is what each regex means:
([a-zA-Z]+) Match a sequence (1 or more characters) that is made up of letters of the alphabet (upper or lower case)
([^<]+) Match a sequence (1 or more characters) so long as a character is not a < character

Related

Match String ending with (regex) java

I am following the suggestions on the page, check if string ends with certain pattern
I am trying to display a string that is
Starts with anything
Has the letters ".mp4" in it
Ends explicitly with ', (apostrophe followed by comma)
Here is my Java code:
import java.util.*;
import java.lang.*;
import java.io.*;
import java.util.regex.*;
class Ideone
{
public static void main (String[] args) throws java.lang.Exception
{
// your code goes here
String str = " _file='ANyTypEofSTR1ngHere_133444556_266545797_10798866.mp4',";
Pattern p = Pattern.compile(".*.mp4[',]$");
Matcher m = p.matcher(str);
if(m.find())
System.out.println("yes");
else
System.out.println("no");
}
}
It prints "no". How should I declare my RegEx?
There are several issues in your regex:
"Has the letters .mp4 in it" means somewhere, not necessarily just in front of ',, so another .* should be inserted.
. matches any character. Use \. to match .
[,'] is a character group, i.e. exactly one of the characters in the brackets has to occur.
You can use the following regex instead:
Pattern p = Pattern.compile(".*\\.mp4.*',$");
Your character set [',] is checking whether the string ends with ' or , a single time.
If you want to match those character one or more times, use [',]+. However, you probably don't want to use a character set in this case since you said order is important.
To match an apostrophe followed by comma, just use:
.*\\.mp4',$
Also, since . has special meaning, you need to escape it in '.mp4'.

How to write a regular expressions that extracts tabbed pieces of text?

I have been trying to create a program to replace tab elements with spaces (assuming a tab is equivalent to 8 spaces, one or more of which taken by non-whitespace characters (letter).
I start to extract the text in a file from a scanner by the following:
try {
reader = new FileReader(file)
} catch (IOException io) {
println("File not found")
}
Scanner scanner = new Scanner(reader);
scanner.usedelimiter("//Z");
String text = Scanner.next();
And then I try parsing through pieces of text that end with a tab with ptrn1 below, and extract the length of the last word of each piece with ptrn2:
Pattern ptrn1 = Pattern.compile(".*\\t, Pattern.DOTALL);
Matcher matcher1 = ptrn1.matcher(text);
String nextPiece = matcher1.group();
println(matcher1.group()); /* gives me the first substring ending with tab*/
however:
Pattern ptrn2 = Pattern.compile("\\s.*\\t"); /*supposed to capture the last word in the string*/
Matcher matcher2 = ptrn2.matcher(nextPiece);
String lastword = matcher2.group();
The last line gives me an error since apparently it cannot match anything with the pattern ("\\s.\*\\t"). There is something wrong with this last regular expression, which is intended to say "any number of spaces, followed by any number of characters, followed by a tab. I have not been able to find out what is wrong with it though. I have tried ("\\s*.+\\t"), ("\\s*.*\\t"), and ("\s+.+\\t"); still no luck.
Later on, per recommendations below, I simplified the code and included the sample string in it. As follows:
import acm.program.*;
import acm.util.*;
import java.util.*;
import java.io.*;
import java.util.regex.*;
public class Untabify extends ConsoleProgram {
public void run(){
String s = "Be plain,\tgood son,\tand homely\tin thy drift.\tRiddling\tconfession\tfinds but riddling\tshrift. ";
Pattern ptrn1 =Pattern.compile(".*?\t", Pattern.DOTALL);
Pattern ptrn2 = Pattern.compile("[^\\s+]\t", Pattern.DOTALL);
String nextPiece;
Matcher matcher1 = ptrn1.matcher(s);
while (matcher1.find()){
nextPiece = matcher1.group();
println(nextPiece);
Matcher matcher2 = ptrn2.matcher(nextPiece);
println(matcher2.group());
}
}
}
The program variably crashes, first at "println(matcher2.group())"; and on the next run on "public void run()" with the message: "Debug Current Instruction Pointer" (what is the meaning of it?).
It would be useful to see a sample string. If you just want the last word before the tab, then you can use this:
([^\s]+)\t
Note the () are to put the last word in a group. [^\s]+ means 1 or more non-space.
You do not need to double-escape the tab character (i.e. \\t); \t will do fine. \t is interpreted as a tab character by the java String parser, and that tab character is sent to the regex parser, which interprets it as a tab character. You can see this answer for more information.
Also, you should use Pattern.DOTALL, not Pattern.Dotall.
The pattern "\\s.*\\t" must match a single whitespace character (\s) followed by 0 or more characters (.*) followed by a single tab (\t). If you want to capture the last word and a trailing tab you should use the word boundary escape \b
Pattern.compile("\\b.*\\b\t");
You could replace the . above to use \w or whatever your definition of a word character is if you don't want to match any character.
Here's the code you'd use to match any word immediately before a tab:
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class RegEx {
public static void main(String args[]) {
String text = "ab cd\t ef gh\t ij";
Pattern pattern = Pattern.compile("\\b(\\w+)\\b\t", Pattern.DOTALL);
Matcher matcher = pattern.matcher(text);
while (matcher.find()) {
System.out.println(matcher.group(1));
}
}
}
The above will output
cd
gh
See the Regular Expression Tutorial, especially the sections on Predefined Character Classes and Boundary Matchers for more information.
You can get more detail and experiment with this regular expression on Regex101.

regex last word in a sentence ending with punctuation (period)

I'm looking for the regex pattern, not the Java code, to match the last word in an English (or European language) sentence. If the last word is, in this case, "hi" then I want to match "hi" and not "hi."
The regex (\w+)\.$ will match "hi.", whereas the output should be just "hi". What's the correct regex?
thufir#dur:~/NetBeansProjects/regex$
thufir#dur:~/NetBeansProjects/regex$ java -jar dist/regex.jar
trying
a b cd efg hi
matches:
hi
trying
a b cd efg hi.
matches:
thufir#dur:~/NetBeansProjects/regex$
code:
package regex;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class Main {
public static void main(String[] args) {
String matchesLastWordFine = "a b cd efg hi";
lastWord(matchesLastWordFine);
String noMatchFound = matchesLastWordFine + ".";
lastWord(noMatchFound);
}
private static void lastWord(String sentence) {
System.out.println("\n\ntrying\n" + sentence + "\nmatches:");
Pattern pattern = Pattern.compile("(\\w+)$");
Matcher matcher = pattern.matcher(sentence);
String match = null;
while (matcher.find()) {
match = matcher.group();
System.out.println(match);
}
}
}
My code is in Java, but that's neither here nor there. I'm strictly looking for the regex, not the Java code. (Yes, I know it's possible to strip out the last character with Java.)
What regex should I put in the pattern?
You can use lookahead asserion. For example to match sentence without period:
[\w\s]+(?=\.)
and
[\w]+(?=\.)
For just last word (word before ".")
If you need to have the whole match be the last word you can use lookahead.
\w+(?=(\.))
This matches a set of word characters that are followed by a period, without matching the period.
If you want the last word in the line, regardless of wether the line ends on the end of a sentence or not you can use:
\w+(?=(\.?$))
Or if you want to also include ,!;: etc then
\w+(?=(\p{Punct}?$))
You can use matcher.group(1) to get the content of the first capturing group ((\w+) in your case). To say a little more, matcher.group(0) would return you the full match. So your regex is almost correct. An improvement is related to your use of $, which would catch the end of the line. Use this only if your sentence fill exactly the line!
With this regular expression (\w+)\p{Punct} you get a group count of 1, means you get one group with punctionation at matcher.group(0) and one without the punctuation at matcher.group(1).
To write the regular expression in Java, use: "(\\w+)\\p{Punct}"
To test your regular expressions online with Java (and actually a lot of other languages) see RegexPlanet
By using the $ operator you will only get a match at the end of a line. So if you have multiple sentences on one line you will not get a match in the middle one.
So you should just use:
(\w+)\.
the capture group will give the correct match.
You can see an example here
I don't understand why really, but this works:
package regex;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class Main {
public static void main(String[] args) {
String matchesLastWordFine = "a b cd efg hi";
lastWord(matchesLastWordFine);
String noMatchFound = matchesLastWordFine + ".";
lastWord(noMatchFound);
}
private static void lastWord(String sentence) {
System.out.println("\n\ntrying\n" + sentence + "\nmatches:");
Pattern pattern = Pattern.compile("(\\w+)"); //(\w+)\.
Matcher matcher = pattern.matcher(sentence);
String match = null;
while (matcher.find()) {
match = matcher.group();
}
System.out.println(match);
}
}
I guess regex \w+ will match all the words (doh). Then the last word is what I was after. Too simple, really, I was trying to exclude punctuation, but I guess regex does that automagically for you..?

Java Regex to Validate Full Name allow only Spaces and Letters

I want regex to validate for only letters and spaces. Basically this is to validate full name. Ex: Mr Steve Collins or Steve Collins I tried this regex. "[a-zA-Z]+\.?" But didnt work. Can someone assist me please
p.s. I use Java.
public static boolean validateLetters(String txt) {
String regx = "[a-zA-Z]+\\.?";
Pattern pattern = Pattern.compile(regx,Pattern.CASE_INSENSITIVE);
Matcher matcher = pattern.matcher(txt);
return matcher.find();
}
What about:
Peter Müller
François Hollande
Patrick O'Brian
Silvana Koch-Mehrin
Validating names is a difficult issue, because valid names are not only consisting of the letters A-Z.
At least you should use the Unicode property for letters and add more special characters. A first approach could be e.g.:
String regx = "^[\\p{L} .'-]+$";
\\p{L} is a Unicode Character Property that matches any kind of letter from any language
try this regex (allowing Alphabets, Dots, Spaces):
"^[A-Za-z\s]{1,}[\.]{0,1}[A-Za-z\s]{0,}$" //regular
"^\pL+[\pL\pZ\pP]{0,}$" //unicode
This will also ensure DOT never comes at the start of the name.
For those who use java/android and struggle with this matter try:
"^\\p{L}+[\\p{L}\\p{Z}\\p{P}]{0,}"
This works with names like
José Brasão
You could even try this expression ^[a-zA-Z\\s]*$ for checking a string with only letters and spaces (nothing else).
For me it worked. Hope it works for you as well.
Or go through this piece of code once:
CharSequence inputStr = expression;
Pattern pattern = Pattern.compile(new String ("^[a-zA-Z\\s]*$"));
Matcher matcher = pattern.matcher(inputStr);
if(matcher.matches())
{
//if pattern matches
}
else
{
//if pattern does not matches
}
please try this regex (allow only Alphabets and space)
"[a-zA-Z][a-zA-Z ]*"
if you want it for IOS then,
NSString *yourstring = #"hello";
NSString *Regex = #"[a-zA-Z][a-zA-Z ]*";
NSPredicate *TestResult = [NSPredicate predicateWithFormat:#"SELF MATCHES %#",Regex];
if ([TestResult evaluateWithObject:yourstring] == true)
{
// validation passed
}
else
{
// invalid name
}
Regex pattern for matching only alphabets and white spaces:
String regexUserName = "^[A-Za-z\\s]+$";
Accept only character with space :-
if (!(Pattern.matches("^[\\p{L} .'-]+$", name.getText()))) {
JOptionPane.showMessageDialog(null, "Please enter a valid character", "Error", JOptionPane.ERROR_MESSAGE);
name.setFocusable(true);
}
My personal choice is:
^\p{L}+[\p{L}\p{Pd}\p{Zs}']*\p{L}+$|^\p{L}+$, Where:
^\p{L}+ - It should start with 1 or more letters.
[\p{Pd}\p{Zs}'\p{L}]* - It can have letters, space character (including invisible), dash or hyphen characters and ' in any order 0 or more times.
\p{L}+$ - It should finish with 1 or more letters.
|^\p{L}+$ - Or it just should contain 1 or more letters (It is done to support single letter names).
Support for dots (full stops) was dropped, as in British English it can be dropped in Mr or Mrs, for example.
To validate for only letters and spaces, try this
String name1_exp = "^[a-zA-Z]+[\-'\s]?[a-zA-Z ]+$";
Validates such values as:
"", "FIR", "FIR ", "FIR LAST"
/^[A-z]*$|^[A-z]+\s[A-z]*$/
check this out.
String name validation only accept alphabets and spaces
public static boolean validateLetters(String txt) {
String regx = "^[a-zA-Z\\s]+$";
Pattern pattern = Pattern.compile(regx,Pattern.CASE_INSENSITIVE);
Matcher matcher = pattern.matcher(txt);
return matcher.find();
}
To support language like Hindi which can contain /p{Mark} as well in between language characters.
My solution is ^[\p{L}\p{M}]+([\p{L}\p{Pd}\p{Zs}'.]*[\p{L}\p{M}])+$|^[\p{L}\p{M}]+$
You can find all the test cases for this here
https://regex101.com/r/3XPOea/1/tests
#amal. This code will match your requirement. Only letter and space in between will be allow, no number. The text begin with any letter and could have space in between only. "^" denotes the beginning of the line and "$" denotes end of the line.
public static boolean validateLetters(String txt) {
String regx = "^[a-zA-Z ]+$";
Pattern pattern = Pattern.compile(regx,Pattern.CASE_INSENSITIVE);
Matcher matcher = pattern.matcher(txt);
return matcher.find();
}
Try with this:
public static boolean userNameValidation(String name){
return name.matches("(?i)(^[a-z])((?![? .,'-]$)[ .]?[a-z]){3,24}$");
}
For Java, you can use below for Name validation which uses Alpha (Letters) + Spaces (Blanks or tabs)
"[^\\\p{Alpha}\\\p{Blank}]"
Can get a reference from Wikipedia for ASCII values also.

Regex to match only commas not in parentheses?

I have a string that looks something like the following:
12,44,foo,bar,(23,45,200),6
I'd like to create a regex that matches the commas, but only the commas that are not inside of parentheses (in the example above, all of the commas except for the two after 23 and 45). How would I do this (Java regular expressions, if that makes a difference)?
Assuming that there can be no nested parens (otherwise, you can't use a Java Regex for this task because recursive matching is not supported):
Pattern regex = Pattern.compile(
", # Match a comma\n" +
"(?! # only if it's not followed by...\n" +
" [^(]* # any number of characters except opening parens\n" +
" \\) # followed by a closing parens\n" +
") # End of lookahead",
Pattern.COMMENTS);
This regex uses a negative lookahead assertion to ensure that the next following parenthesis (if any) is not a closing parenthesis. Only then the comma is allowed to match.
Paul, resurrecting this question because it had a simple solution that wasn't mentioned. (Found your question while doing some research for a regex bounty quest.)
Also the existing solution checks that the comma is not followed by a parenthesis, but that does not guarantee that it is embedded in parentheses.
The regex is very simple:
\(.*?\)|(,)
The left side of the alternation matches complete set of parentheses. We will ignore these matches. The right side matches and captures commas to Group 1, and we know they are the right commas because they were not matched by the expression on the left.
In this demo, you can see the Group 1 captures in the lower right pane.
You said you want to match the commas, but you can use the same general idea to split or replace.
To match the commas, you need to inspect Group 1. This full program's only goal in life is to do just that.
import java.util.*;
import java.io.*;
import java.util.regex.*;
import java.util.List;
class Program {
public static void main (String[] args) throws java.lang.Exception {
String subject = "12,44,foo,bar,(23,45,200),6";
Pattern regex = Pattern.compile("\\(.*?\\)|(,)");
Matcher regexMatcher = regex.matcher(subject);
List<String> group1Caps = new ArrayList<String>();
// put Group 1 captures in a list
while (regexMatcher.find()) {
if(regexMatcher.group(1) != null) {
group1Caps.add(regexMatcher.group(1));
}
} // end of building the list
// What are all the matches?
System.out.println("\n" + "*** Matches ***");
if(group1Caps.size()>0) {
for (String match : group1Caps) System.out.println(match);
}
} // end main
} // end Program
Here is a live demo
To use the same technique for splitting or replacing, see the code samples in the article in the reference.
Reference
How to match pattern except in situations s1, s2, s3
How to match a pattern unless...
I don’t understand this obsession with regular expressions, given that they are unsuited to most tasks they are used for.
String beforeParen = longString.substring(longString.indexOf('(')) + longString.substring(longString.indexOf(')') + 1);
int firstComma = beforeParen.indexOf(',');
while (firstComma != -1) {
/* do something. */
firstComma = beforeParen.indexOf(',', firstComma + 1);
}
(Of course this assumes that there always is exactly one opening parenthesis and one matching closing parenthesis coming somewhen after it.)

Categories