Java - Regex problem

Java - Regex problem - java

I have a list of URLs of type
http://www.example.com/pk/ca,
http://www.example.com/pk,
http://www.example.com/anthingcangoeshere/pk, and
http://www.example.com/pkisnotnecessaryhere.
Now, I want to find out only those URLs that ends with /pk or /pk/ and don't have anything in between .com and /pk

Your problem isn't fully defined so I can't give you an exact answer but this should be a start you can use:
^[^:]+://[^/]+\.com/pk/?$
These strings will match:
http://www.example.com/pk
http://www.example.com/pk/
https://www.example.com/pk
These strings won't match:
http://www.example.co.uk/pk
http://www.example.com/pk/ca
http://www.example.com/anthingcangoeshere/pk
http://www.example.com/pkisnotnecessaryhere

String pattern = "^http://www.example.com/pk/?$";
Hope this helps.
Some details: if you don't add ^ to the beginning of the pattern, then foobarhttp://www.example.com/pk/ will be accepted too. If you don't add $ to the end of the pattern, then http://www.exampke.com/pk/foobar will be accepted too.

Directly translating your request "[...] URLs that ends with /pk or /pk/ and don't have anything in between .com and /pk", with the additional assumption that there shall always be a ".com", yields this regex:
If you use find():
\.com/pk/?$
If you use matches():
.*\.com/pk/?
Other answers given here give more restrictive patterns, allowing only URLs that are more close to your examples. Especially my pattern does not validate that the given string is a syntactically valid URL.

String pattern = "^https?://(www\.)?.+\\.com/pk/?$";

Related

RegExp for File URL

I write a RegExp for validate a file URL
file:/{2,3}[a-zA-Z0-9]*\\|?/.*
for URL like
file:/D:/workspace/project/build/libs/myjar-1.0.jar
But doesnt work,
am looking for a pattern that matches only URL like this,no other.
Pattern Will return false URLs like
file:/workspace/project/build/libs/myjar-1.0.jar
and
file:/D:/workspace/project/build/libs/myjar-1.0
are will not match
please help

Complete question rewrite
Given the OP's updated criteria, the regex you're looking for is file\:\/\w\:\/[^\s]*\.jar; ensure you enabled g (global) and i (case-insensitive) modifiers.
See a working example on Regex101

If there is no context or rule your regex should be:
/file\:.*/i
See it here: http://regex101.com/r/yY5xG6

Regex to extract valid Http or Https

I'm currently having some issues with a regex to extract a URL.
I want my regex to take URLS such as:
http://stackoverflow.com/questions/ask
https://stackoverflow.com
http://local:1000
https://local:1000
Through some tutorials, I've learned that this regex will find all the above: ^(http|https)\://.*$ however, it will also take http://local:1000;http://invalid http://khttp://as a single string when it shouldn't take it at all.
I understand that my expression isn't written to exclude this, but my issue is I cannot think of how to write it so it checks for this scenario.
Any help is greatly appreciated!
Edit:
Looking at my issue, it seems that I could eliminate my issue as long as I can implement a check to make sure '//' doesn't occur in my string after the initial http:// or https://, any ideas on how to implement?
Sorry this will be done with Java
I also need to add the following constraint: a string such as http://local:80/test:90 fails because of the duplicate of port...aka I need to have a constraint that only allows two total : symbols in a valid string (one after http/s) and one before port.

This will only produce a match if if there is no :// after its first appearance in the string.
^https?:\/\/(?!.*:\/\/)\S+
Note that trying to parse a valid url from within a string is very complex, see
In search of the perfect URL validation regex, so the above does not attempt to do that.
It will just match the protocol and following non-space characters.
In Java
Pattern reg = Pattern.compile("^https?:\\/\\/(?!.*:\\/\\/)\\S+");
Matcher m = reg.matcher("http://somesite.com");
if (m.find()) {
System.out.println(m.group());
} else {
System.out.println("No match");
}

Check your programming language to see if it already has a parser. E.g. php has parse_url()

From http://net.tutsplus.com/tutorials/other/8-regular-expressions-you-should-know/
/^(https?:\/\/)?([\da-z\.-]+)\.([a-z\.]{2,6})([\/\w \.-]*)*\/?$/
This may change based on the programming language/tool

/[A-Za-z]+://[A-Za-z0-9-]+.[A-Za-z0-9-:%&;?#/.=]+/g

Need to write a java regex expression which matches urls with http or https but does not contains specific file extensions

I need to create a java regex expression which matches URLS with http or https but should not match urls with specific file extensions.
I can get the urls with http or https using the following expression and I am unable to complete the 2nd part that is eliminating urls with certain extensions (eg:- js|css|jpg etc..)
I guess I need to use negations but I am not sure how to do it.
String regex = "\\s*(?i)(http|https)\\s*://\\s*((\"[^\"]*\"|'[^']*'|([^'\">\\s]+)))";
Please help me to modify this regex to meet this requirement.

An easy way to implement this in Java is to use the Pattern class (from java.util.regex). To accomplish what you're suggesting, you could use two separate regex objects to check the conditions for the URL. For example (using the string regex from your question):
Scanner in = new Scanner(System.in);
String input = in.nextLine();
Pattern one = Pattern.compile(regex);
Pattern two = Pattern.compile("([^\s]+(\.(?i)(js|css|jpg|etc))$)");
if(one.matcher(input).matches() && !two.matcher(input).matches())
System.out.println("It matches!");
else System.out.println("Nope!");
In short, using two Pattern objects makes your code more readable and easy to manage, since you're considering multiple aspects about an input string of a URL.

You need an anchor to look behind - see regex to match url that should give you the expression you need. The regex you have currently will match malformed urls with disallowed characters.
Here's a good site to check your expressions: http://www.regexplanet.com/advanced/java/index.html

Regex for university emails

I am looking to validate email addresses by making sure they have a specific university subdomain, e.g. if the user says they attend Oxford University, I want to check that their email ends in .ox.ac.uk
If I have the '.ox.ac.uk' part stored as a variable, how can I incorporate this with a regex to check the whole email is valid and ends in that variable suffix?
Many thanks!

We are using this email pattern (derived from this regular-expressions.info article):
^[\w!#$%&'*+/=?^`{|}~-]+(?:\.[\w!#$%&'*+/=?^`{|}~-]+)*#(?:[a-zA-Z0-9](?:[a-zA-Z0-9-]*[a-zA-Z0-9])?\.)+[a-zA-Z0-9](?:[a-zA-Z0-9-]*[a-zA-Z0-9])?$`
You should be able to extend it with your needed suffix:
^[\w!#$%&'*+/=?^`{|}~-]+(?:\.[\w!#$%&'*+/=?`{|}~-]+)*#(?:[a-zA-Z0-9](?:[a-zA-Z0-9-]*[a-zA-Z0-9])?\.)+(?:\.ox\.ac\.uk)$`
Note that I replaced the TLD part [a-zA-Z0-9](?:[a-zA-Z0-9-]*[a-zA-Z0-9])? with your required suffix (?:\.ox\.ac\.uk) (\. is used to match the dot only)
Edit: one additional note: if you use String#matches(...) or Matcher#matches() there's no need for the leading ^ and the trailing $, since the entire string would have to match anyways.

Assuming you are using php.
$ending = '.ox.ac.uk';
if(preg_match('/'.preg_quote($ending).'$/i', $email_address)) //... your code
Further info: the preg_quote() is necessary so that characters get escaped if they have a special meaning. In your case it's the dots.
edit: To check if the whole email is valid, see other questions, it is asked a lot. Just wanted to help with your special case.

How to change this regex to properly extract tag attributes - should be simple

I need to "grab" an attribute of a custom HTML tag. I know this sort of question has been asked many times before, but regex really messes with my head, and I can't seem to get it working.
A sample of XML that I need to work with is
<!-- <editable name="nameValue"> --> - content goes here - <!-- </editable> -->
I want to be able to grab the value of the name attribute, which in this case is nameValue. What I have is shown below but this returns a null value.
My regex string (for a Java app, hence the \ to escape the ") is:
"(.)?<!-- <editable name=(\".*\")?> -->.*<!-- </editable> -->(.)?"
I am trying to grab the attribute with quotation marks I figure this is the easiest and most general pattern to match. Well it just doesn't work, any help will help me keep my hair.

I don't think you need the (.)?s at the beginning and end of your regex. And you need to put in a capturing group for getting only the content-goes-here bit:
This worked for me:
String xml = "RANDOM STUFF<!-- <editable name=\"nameValue\"> --> - content goes here - <!-- </editable> -->RANDOM STUFF";
Pattern p = Pattern.compile("<!-- <editable name=(\".*\")?> -->(.*)<!-- </editable> -->");
Matcher m = p.matcher(xml);
if (m.find()) {
System.out.println(m.group(2));
} else {
System.out.println("no match found");
}
This prints:
- content goes here -

Your search is greedy. Use "\<\!-- \<editable name=\"(.*?)\"\> --\>.*?\<\!-- \<\/editable\> --\>" (added ?). Please note that this one will not work correctly with nested <editable> elements.
If you don't want to perform syntax checking, you could also simply go with: "\<\!-- \<editable name=\"(.*?)\"\> --\>" or even "\<editable name=\"(.*?)\"\>" for better simplicity and performance.
Edit: should be
Pattern re = Pattern.compile( "\\<editable name=\"(.*?)\"\\>" );

I use JavaScript, but it should help to make the expression non-greedy where possible and use not matches instead of any character matches. Not sure how similar regexps are with Java, but instead of using the expression \".*\" try using \"[^\"]*\". That will search for any character within the attribute value that isn't a quote, meaning the expression can't match beyond the attribute value.
Hope that helps

Regexes are fundamentally bad at parsing HTML (see Can you provide some examples of why it is hard to parse XML and HTML with a regex? for why). What you need is an HTML parser. See Can you provide an example of parsing HTML with your favorite parser? for examples using a variety of parsers.
You may find the answer using TagSoup helpful.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Java - Regex problem - java

I have a list of URLs of type http://www.example.com/pk/ca, http://www.example.com/pk, http://www.example.com/anthingcangoeshere/pk, and http://www.example.com/pkisnotnecessaryhere. Now, I want to find out only those URLs that ends with /pk or /pk/ and don't have anything in between .com and /pk

String pattern = "^http://www.example.com/pk/?$"; Hope this helps. Some details: if you don't add ^ to the beginning of the pattern, then foobarhttp://www.example.com/pk/ will be accepted too. If you don't add $ to the end of the pattern, then http://www.exampke.com/pk/foobar will be accepted too.

String pattern = "^https?://(www\.)?.+\\.com/pk/?$";

Related

RegExp for File URL

Regex to extract valid Http or Https

Need to write a java regex expression which matches urls with http or https but does not contains specific file extensions

Regex for university emails

How to change this regex to properly extract tag attributes - should be simple

Categories

Resources