Java regEx URL matching issue - java

and as usual thank you in advance.
I am trying to familiarize myself with regEx and I am having an issue matching a URL.
Here is an example URL:
www.examplesite.com/dir/2012/06/19/title-of-some-story/FAQKZjC3veXSalP9zxFgZP/htmlpage.html
here is what my regex breakdown looks like:
[site]/[dir]*?/[year]/[month]/[day]/[storyTitle]?/[id]/htmlpage.html
the [id] is a string 22 characters in length that can be either uppercase or lowercase letters, as well as numbers. However, I do not want to extract that from the URL. Just clarifying
Now, I need to extract two values from this url.
First,
I need to extract the dirs(s). However, the [dir] is optional, but also can be as many as wanted. In other words that parameter could not be there, or it could be dir1/dir2/dir3 ..etc . So, going off my first example :
www.examplesite.com/dir1/dir2/dir3/2012/06/19/title-of-some-story/FAQKZjC3veXSalP9zxFgZP/htmlpage.html
Here I would need to extract dir1/dir2/dir3 where a dir is a string that is a single word with all lowercase letters (ie sports/mlb/games). There are no numbers in the dir, only using that as an example.
But in this example of a valid URL:
www.examplesite.com/2012/06/19/title-of-some-story/FAQKZjC3veXSalP9zxFgZP/htmlpage.html
There is no [dir] so I would not extract anything. thus, the [dir] is optional
Secondly,
I need to extract the [storyTitle] where the [storyTitle] is also optional just like the [dir] above, but however if there is a storyTitle there can only be one.
So going off my previous examples
www.examplesite.com/dir/2012/06/19/title-of-some-story/FAQKZjC3veXSalP9zxFgZP/htmlpage.html
would be valid where I need to extract 'title-of-some-story' where story titles are dash separated strings that are always lowercase. The example belowis also valid:
www.examplesite.com/dir/2012/06/19/FAQKZjC3veXSalP9zxFgZP/htmlpage.html
In the above example, there is no [storyTitle] thus making it optional
Lastly, just to be thorough, a URL without a [dir] and without a [storyTitle] are also valid. Example:
www.examplesite.com/2012/06/19/FAQKZjC3veXSalP9zxFgZP/htmlpage.html
Is a valid URL. Any input would be helpful I hope I am clear.

Here is one example that will work.
public static void main(String[] args) {
Pattern p = Pattern.compile("(?:http://)?.+?(/.+?)?/\\d+/\\d{2}/\\d{2}(/.+?)?/\\w{22}");
String[] strings ={
"www.examplesite.com/dir1/dir2/4444/2012/06/19/title-of-some-story/FAQKZjC3veXSalP9zxFgZP/htmlpage.html",
"www.examplesite.com/2012/06/19/title-of-some-story/FAQKZjC3veXSalP9zxFgZP/htmlpage.html",
"www.examplesite.com/dir/2012/06/19/title-of-some-story/FAQKZjC3veXSalP9zxFgZP/htmlpage.html",
"www.examplesite.com/dir/2012/06/19/FAQKZjC3veXSalP9zxFgZP/htmlpage.html",
"www.examplesite.com/2012/06/19/FAQKZjC3veXSalP9zxFgZP/htmlpage.html"
};
for (int idx = 0; idx < strings.length; idx++) {
Matcher m = p.matcher(strings[idx]);
if (m.find()) {
String dir = m.group(1);
String title = m.group(2);
if (title != null) {
title = title.substring(1); // remove the leading /
}
System.out.println(idx+": Dir: "+dir+", Title: "+title);
}
}
}

Here is an all regex solution.
Edit: Allows for http://
Java source:
import java.util.*;
import java.lang.*;
import java.util.regex.*;
class Main
{
public static void main (String[] args) throws java.lang.Exception
{
String url = "http://www.examplesite.com/dir/2012/06/19/title-of-some-story/FAQKZjC3veXSalP9zxFgZP/htmlpage.html";
String url2 = "www.examplesite.com/dir/dir2/dir3/2012/06/19/FAQKZjC3veXSalP9zxFgZP/htmlpage.html";
String url3 = "www.examplesite.com/2012/06/19/title-of-some-story/FAQKZjC3veXSalP9zxFgZP/htmlpage.html";
String patternStr = "(?:http://)?[^/]*[/]?([\\S]*)/[\\d]{4}/[\\d]{2}/[\\d]{2}[/]?([\\S]*)/[\\S]*/[\\S]*";
// Compile regular expression
Pattern pattern = Pattern.compile(patternStr);
// Match 1st url
System.out.println("Match 1st URL:");
Matcher matcher = pattern.matcher(url);
if (matcher.find()) {
System.out.println("URL: " + matcher.group(0));
System.out.println("DIR: " + matcher.group(1));
System.out.println("TITLE: " + matcher.group(2));
}
else{ System.out.println("No match."); }
// Match 2nd url
System.out.println("\nMatch 2nd URL:");
matcher = pattern.matcher(url2);
if (matcher.find()) {
System.out.println("URL: " + matcher.group(0));
System.out.println("DIR: " + matcher.group(1));
System.out.println("TITLE: " + matcher.group(2));
}
else{ System.out.println("No match."); }
// Match 3rd url
System.out.println("\nMatch 3rd URL:");
matcher = pattern.matcher(url3);
if (matcher.find()) {
System.out.println("URL: " + matcher.group(0));
System.out.println("DIR: " + matcher.group(1));
System.out.println("TITLE: " + matcher.group(2));
}
else{ System.out.println("No match."); }
}
}
Output:
Match 1st URL:
URL: http://www.examplesite.com/dir/2012/06/19/title-of-some-story/FAQKZjC3veXSalP9zxFgZP/htmlpage.html
DIR: dir
TITLE: title-of-some-story
Match 2nd URL:
URL: www.examplesite.com/dir/dir2/dir3/2012/06/19/FAQKZjC3veXSalP9zxFgZP/htmlpage.html
DIR: dir/dir2/dir3
TITLE:
Match 3rd URL:
URL: www.examplesite.com/2012/06/19/title-of-some-story/FAQKZjC3veXSalP9zxFgZP/htmlpage.html
DIR:
TITLE: title-of-some-story

Related

Regex to grab validate email address in complete XML string or normal string

Need to grab string text of email value in big XML/normal string.
Been working with Regex for it and as of now below Regex is working correctly for normal String
Regex : ^[\\w!#$%&'*+/=?`{|}~^-]+(?:\\.[\\w!#$%&'*+/=?`{|}~^-]+)*#(?:[a-zA-Z0-9-]+\\.)+[a-zA-Z]{1,6}$
Text : paris#france.c
but in case when above text is enclosed in XML tag it fails to return.
<email>paris#france.c</email>
I am trying to amend some change to this regex so that it will work for both of the scenarios
You have put ^ at the beginning which means the "Start of the string", and $ at the end which means the "End of the string". Now, look at your string:
<email>paris#france.c</email>
Do you think, it starts and ends with an email address?
I have removed them and also escaped the - in your regex. Here you can check the following auto-generated Java code with the updated regex.
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class Example {
public static void main(String[] args) {
final String regex = "[\\w!#$%&'*+/=?`\\{|\\}~^\\-]+(?:\\\\.[\\w!#$%&'*+/=?`\\{|\\}~^\\-]+)*#(?:[a-zA-Z0-9-]+\\.)+[a-zA-Z]{1,6}";
final String string = "paris#france.c\n"
+ "<email>paris#france.c</email>";
final Pattern pattern = Pattern.compile(regex, Pattern.MULTILINE);
final Matcher matcher = pattern.matcher(string);
while (matcher.find()) {
System.out.println("Full match: " + matcher.group(0));
for (int i = 1; i <= matcher.groupCount(); i++) {
System.out.println("Group " + i + ": " + matcher.group(i));
}
}
}
}
Output:
Full match: paris#france.c
Full match: paris#france.c

Regex seperate 2 numbers by komma

I'm trying to make a regex to allow only a case of a number then "," and another number or same case seperated by ";" like
57,1000
57,1000;6393,1000
So far i made this: Pattern.compile("\\b[0-9;,]{1,5}?\\d+;([0-9]{1,5},?)+").matcher("57,1000").find();
which work if case is 57,1000;6393,1000 but it also allow letters and don't work when case 57,1000
try Regex "(\d+,\d+(;\d+,\d+)?)"
#Test
void regex() {
Pattern p = Pattern.compile("(\\d+,\\d+)(;\\d+,\\d+)?");
Assertions.assertTrue(p.matcher("57,1000").matches());
Assertions.assertTrue(p.matcher("57,1000;6393,1000").matches());
}
How about like this. Just look for two numbers separated by a comma and capture them.
String[] data = {"57,1000",
"57,1000;6393,1000"};
Pattern p = Pattern.compile("(\\d+),(\\d+)");
for (String str : data) {
System.out.println("For String : " + str);
Matcher m = p.matcher(str);
while (m.find()) {
System.out.println(m.group(1) + " " + m.group(2));
}
System.out.println();
}
prints
For String : 57,1000
57 1000
For String : 57,1000;6393,1000
57 1000
6393 1000
If you just want to match those, you can do the following: It matches a single instance of the string followed by an optional one preceded by a semi-colon.
String regex = "(\\d+,\\d+)(;(\\d+,\\d+))?";
for (String str : data) {
System.out.println("Testing String " + str + " : " +str.matches(regex));
}
prints
Testing String 57,1000 : true
Testing String 57,1000;6393,1000 : true

java regex take variable between two tag

I am very new in regex and need your help. I wanna take numbers and letters between two span.
<span>454.000 $</span>
I wanna take 454.000 $. There are 12 space before . Please help me.
This Should Work.
Regexp:
\s+<.+>(.+)<.+>
Input:
<span>454.000 $</span>
Output:
454.000 $
JAVA CODE:
import java.util.regex.Matcher;
import java.util.regex.Pattern;
final String regex = "\\s+<.+>(.+)<.+>";
final String string = " <span>454.000 $</span>";
final Pattern pattern = Pattern.compile(regex);
final Matcher matcher = pattern.matcher(string);
while (matcher.find()) {
System.out.println("Full match: " + matcher.group(0));
for (int i = 1; i <= matcher.groupCount(); i++) {
System.out.println("Group " + i + ": " + matcher.group(i));
}
}
See: https://regex101.com/r/2zg5Ws/1
Capturing group using pattern matching is something like below
String x = " <span>454.000 $</span> ";
Pattern p = Pattern.compile("<span>(.*?)</span>");
Matcher m = p.matcher(x);
if (m.find()) {
System.out.println(">> "+ m.group(1)); // output 454.000 $
}
But for such cases I always prefer to use the replaceAll() as it is shorter version of code:
String num = x.replaceAll(".*<span>(.*?)</span>.*", "$1");
// num has 454.000 $
For the replace it is actually capturing the group from the text and replacing the whole text with that group ($1). This solution depends upon how your input string is.

Regex to match [[Wikipedia:Manual of Style#Links|]] # in java

I have been trying to match the following string -
String temp = "[[Wikipedia:Manual of Style#Links|]]" ;
with the regex
boolean a = temp.matches("\\[\\[Wikipedia:[a-zA-Z_0-9]*#[a-zA-Z_0-9]*\\|\\]\\]");
"\\[\\[Wikipedia:(.*?)#(.*?)\\|\\]\\]"
"\\[\\[Wikipedia:(.*)*#(.+)*\\|\\]\\]"
"\\[\\[(.*?)#(.*?)\\|\\]\\]"
But none of them are giving any positive matches.
Straight away I can see a problem: you are using a character class without a space to match input with spaces.
Try this:
boolean a = temp.matches("\\[\\[Wikipedia:[\\w ]*#[\\w ]+\\|\\]\\]");
Note that [a-zA-Z_0-9] can be replaced by [\w] (but would include letters/numbers from all languages, which should be fine)
public static void main(String[] args) {
String temp = "[[Wikipedia:Manual of Style#Links|]]";
Pattern pattern = Pattern.compile("\\[\\[Wikipedia:([\\w ]+)#([\\w ]+)\\|\\]\\]");
Matcher matcher = pattern.matcher(temp);
if(matcher.find()) {
System.out.println("Manual of Style: " + matcher.group(1));
System.out.println("links : " + matcher.group(2));
}
}
or
temp.matches("\\[\\[Wikipedia:([\\w ]+)#([\\w ]+)\\|\\]\\]");
Just add a space to your custom character class:
String temp = "[[Wikipedia:Manual of Style#Links|]]" ;
temp.matches("\\[\\[Wikipedia:[a-zA-Z_0-9 ]*#[a-zA-Z_0-9]*\\|\\]\\]"); //true

How can I recognize the indefinite articles "a" or "an"?

My task is to devise a regular expression that will recognize the indefinite article in English – the word “a” or “an” i.e. to write a regular expression to identify the word a or the word an. I must test the expression by writing a test driver which reads a file containing approximately ten lines of text. Your program should count the occurrences of the words “a” and “an”. I shall not match the characters a and an in words such as than.
This is my code so far:
import java.io.IOException;
import java.util.Arrays;
import java.util.regex.Pattern;
import java.util.regex.Matcher;
public class RegexeFindText {
public static void main(String[] args) throws IOException {
// Input for matching the regexe pattern
String file_name = "Testing.txt";
ReadFile file = new ReadFile(file_name);
String[] aryLines = file.OpenFile();
String asString = Arrays.toString(aryLines);
// Regexe to be matched
String regexe = ""; //<<--this is where the problem lies
int i;
for ( i=0; i < aryLines.length; i++ ) {
System.out.println( aryLines[ i ] ) ;
}
// Step 1: Allocate a Pattern object to compile a regexe
Pattern pattern = Pattern.compile(regexe);
//Pattern pattern = Pattern.compile(regexe, Pattern.CASE_INSENSITIVE);
// case- insensitive matching
// Step 2: Allocate a Matcher object from the compiled regexe pattern,
// and provide the input to the Matcher
Matcher matcher = pattern.matcher(asString);
// Step 3: Perform the matching and process the matching result
// Use method find()
while (matcher.find()) { // find the next match
System.out.println("find() found the pattern \"" + matcher.group()
+ "\" starting at index " + matcher.start()
+ " and ending at index " + matcher.end());
}
// Use method matches()
if (matcher.matches()) {
System.out.println("matches() found the pattern \"" + matcher.group()
+ "\" starting at index " + matcher.start()
+ " and ending at index " + matcher.end());
} else {
System.out.println("matches() found nothing");
}
// Use method lookingAt()
if (matcher.lookingAt()) {
System.out.println("lookingAt() found the pattern \"" + matcher.group()
+ "\" starting at index " + matcher.start()
+ " and ending at index " + matcher.end());
} else {
System.out.println("lookingAt() found nothing");
}
}
}
What do I have to use to find those words within my text?
Here's the regex that will match "a" or "an":
String regex = "\\ban?\\b";
Let's break that regex down:
\b means word boundary (a single back slash is written as "\\" in java)
a is simply a literal "a"
n? means zero or one literal "n"

Categories