Regular expression String for URL in JAVA - java

I would like to check URL Validation in JAVA with regular-expression. I found this comment and I tried to use it in my code as follow...
private static final String PATTERN_URL = "/((([A-Za-z]{3,9}:(?:\/\/)?)(?:[-;:&=\+\$,\w]+#)?[A-Za-z0-9.-]+(:[0-9]+)?|(?:ww‌​w.|[-;:&=\+\$,\w]+#)[A-Za-z0-9.-]+)((?:\/[\+~%\/.\w-_]*)?\??(?:[-\+=&;%#.\w_]*)#?‌​(?:[\w]*))?)/";
.....
if (!urlString.matches(PATTERN_URL)) {
System.err.println("Invalid URL");
return false;
}
But I got compile time exception for writing my PATTERN_URL variable. I have no idea how to format it and I am worried about will it become invalid regex if I have modified. Can anyone fix it for me without losing original ? Thanks for your helps.

Your regex looks fine. You just need to format it for a Java string, by escaping all the escape-slashes:
\ --> \\
Resulting in this:
"/((([A-Za-z]{3,9}:(?:\\/\\/)?)(?:[-;:&=\\+\\$,\\w]+#)?[A-Za-z0-9.-]+(:[0-9]+)?|(?:ww‌​w.|[-;:&=\\+\\$,\\w]+#)[A-Za-z0-9.-]+)((?:\\/[\\+~%\\/.\\w-_]*)?\\??(?:[-\\+=&;%#.\\w_]*)#?‌​(?:[\\w]*))?)/"
After Java interprets this string into a java.util.regex.Pattern, it will strip out those extra escape-slashes and become exactly the regex you want. You can prove this by printing it:
System.out.println(Pattern.compile(PATTERN_URL));

Related

Regex pattern to split colon char with a condition

I have a string like this :
http://schemas/identity/claims/usertype:External
Then my goal is to split that string into 2 words by colon delimiter, but in need to specified how the regex worked, it will be split the colon but not including colon in "http://", so those strings will be split into :
http://schemas/identity/claims/usertype
External
I have tried regex like this :
(http:\/\/+schemas\/identity\/claims\/usertype)
So it will be :
http://schemas/identity/claims/usertype
:External
then after that i will replace the remaining colon with empty string.
but i think its not a best practice for this, because i rarely used regex.
Do you have any suggestion to simplified the regex ?
Thanks in advance
This is an X/Y problem. Fortunately, you asked the question in a great way, by explaining the underlying problem you are trying to solve (namely: Pull some string out of a URL), and then describing the direction you've chosen to solve your problem (which is bad, see below), and then asking about a problem you have with this solution (which is irrelevant, as the entire solution is bad).
URLs aren't parsable like this. You shouldn't treat them as a string you can lop into pieces like this. For example, the server part can contain colons too: For port number. In front of the server part, there can be an authentication which can also contain a colon. It's rarely used, of course.
Try this one, which shows the problem with your approach:
https://joe:joe#google.com:443/
That link just works. Port 443 was the default anyway, and google ignores the authentication header that ends up sending, but the point is, a URL may contain this stuff.
But rzwitserloot, it.. won't! I know!
That's bad programming mindset. That mindset leads to security issues. Why go for a solution that burdens your codebase with unstated assumptions (assumption: The places that provide a URL to this code are under my control and will never send port or auth headers)? If the 'server' part is configurable in a config file, will you mention in said config file that you cannot add a port? Will you remember 4 years from now?
The solution that does it right isn't going to burden your code with all these unstated (or very unwieldy if stated) assumptions.
Okay, so what is the right way?
First, toss that string into the constructor of java.net.URI. Then, use the methods there to get what you actually want, which is the path part. That is a string you can pull apart:
URI uri = new URI("http://schemas/identity/claims/usertype:External");
String path = uri.getPath();
String newPath = path.replaceAll(":.*", "");
String type = path.replaceAll(".*?:", "");
URI newUri = uri.resolve(newPath);
System.out.println(newUri);
System.out.println(type);
prints:
http://schemas/identity/claims/usertype
External
NB: Toss some ports or auth stuff in there, or make it a relative URL - do whatever you like, this code is far more robust in the face of changing the base URL than any attempt to count colons is going to be.
Use Negative Lookbehind and split
Regex:
"(?<!(http|https)):"
Regex in context:
public static void main(String[] args) {
String input = "http://schemas/identity/claims/usertype:External";
validateURI(input);
List<String> result = Arrays.asList(input.split("(?<!(http|https)):"));
result.forEach(System.out::println);
}
private static void validateURI(String input) {
try {
new URI(input);
} catch (URISyntaxException e) {
System.out.println("Invalid URI!!!");
e.printStackTrace();
}
}
Output:
http://schemas/identity/claims/usertype
External
I think this might help you:
public class Separator{
public static void main(String[] args) {
String input = "http://schemas/identity/claims/usertype:External";
String[] splitted = input.split("\\:");
System.out.println(splitted[splitted.length-1]);
}
}
Output
External

Java Regexp to match domain of url

I would like to use Java regex to match a domain of a url, for example,
for www.table.google.com, I would like to get 'google' out of the url, namely, the second last word in this URL string.
Any help will be appreciated !!!
It really depends on the complexity of your inputs...
Here is a pretty simple regex:
.+\\.(.+)\\..+
It fetches something that is inside dots \\..
And here are some examples for that pattern: https://regex101.com/r/L52oz6/1.
As you can see, it works for simple inputs but not for complex urls.
But why reinventing the wheel, there are plenty of really good libraries that correctly parse any complex url. But sure, for simple inputs a small regex is easily build. So if that does not solve the problem for your inputs then please callback, I will adjust the regex pattern then.
Note that you can also just use simple splitting like:
String[] elements = input.split("\\.");
String secondToLastElement = elements[elements.length - 2];
But don't forget the index-bound checking.
Or if you search for a very quick solution than walk through the input starting from the last position. Work your way through until you found the first dot, continue until the second dot was found. Then extract that part with input.substring(index1, index2);.
There is also already a delegate method for exactly that purpose, namely String#lastIndexOf (see the documentation).
Take a look at this code snippet:
String input = ...
int indexLastDot = input.lastIndexOf('.');
int indexSecondToLastDot = input.lastIndexOf('.', indexLastDot);
String secondToLastWord = input.substring(indexLastDot, indexSecondToLastDot);
Maybe the bounds are off by 1, haven't tested the code, but you get the idea. Also don't forget bound checking.
The advantage of this approach is that it is really fast, it can directly work on the internal structures of Strings without creating copies.
My attempt:
(?<scheme>https?:\/\/)?(?<subdomain>\S*?)(?<domainword>[^.\s]+)(?<tld>\.[a-z]+|\.[a-z]{2,3}\.[a-z]{2,3})(?=\/|$)
Demo. Works correctly for:
http://www.foo.stackoverflow.com
http://www.stackoverflow.com
http://www.stackoverflow.com/
http://stackoverflow.com
https://www.stackoverflow.com
www.stackoverflow.com
stackoverflow.com
http://www.stackoverflow.com
http://www.stackoverflow.co.uk
foo.www.stackoverflow.com
foo.www.stackoverflow.co.uk
foo.www.stackoverflow.co.uk/a/b/c
private static final Pattern URL_MATCH_GET_SECOND_AND_LAST =
Pattern.compile("www.(.*)//.google.(.*)", Pattern.CASE_INSENSITIVE);
String sURL = "www.table.google.com";
if (URL_MATCH_GET_SECOND_AND_LAST.matcher(sURL).find()){
Matcher matchURL = URL_MATCH_GET_SECOND_AND_LAST .matcher(sURL);
if (matchURL .find()) {
String sFirst = matchURL.group(1);
String sSecond= matchURL.group(2);
}
}

Encode URL with US-ASCII character set

I refer to the following web site:
http://coderstoolbox.net/string/#!encoding=xml&action=encode&charset=us_ascii
Choosing "URL", "Encode", and "US-ASCII", the input is converted to the desired output.
How do I produce the same output with Java codes?
Thanks in advance.
I used this and it seems to work fine.
public static String encode(String input) {
Pattern doNotReplace = Pattern.compile("[a-zA-Z0-9]");
return input.chars().mapToObj(c->{
if(!doNotReplace.matcher(String.valueOf((char)c)).matches()){
return "%" + (c<256?Integer.toHexString(c):"u"+Integer.toHexString(c));
}
return String.valueOf((char)c);
}).collect(Collectors.joining("")).toUpperCase();
}
PS: I'm using 256 to limit the placement of the prefix U to non-ASCII characters. No need of prefix U for standard ASCII characters which are within 256.
Alternate option:
There is a built-in Java class (java.net.URLEncoder) that does URL Encoding. But it works a little differently (For example, it does not replace the Space character with %20, but replaces with a + instead. Something similar happens with other characters too). See if it helps:
String encoded = URLEncoder.encode(input, "US-ASCII");
Hope this helps!
You can use ESAPi.encoder().encodeForUrl(linkString)
Check more details on encodeForUrl https://en.wikipedia.org/wiki/Percent-encoding
please comment if that does not satisfy your requirement or face any other issue.
Thanks

Need help on validating domains on the basis of ASCII and BASE64 encoded UTF-8 string

I am doing some tests related to ldap in java using JDK 1.7
I have configuration file from which I am reading value of one property like "dc=domain1,dc=com" to pass that later to ldap for searching operations.
Here I want to validate the value which is coming from properties file and that value should be only ASCII or Base64 encoded UTF-8 strings.
I have written following regex to validate the string but seems like it is having some issues.
here is my sample code:
public class ValidateDN {
public static void main(String[] args) {
String istr = "dc=domain1,dc=com";
String myregex = "^dc=[a-zA-Z0-9\\-\\.]*[,dc=[a-zA-Z0-9\\-\\.]*]*";
if (istr.matches(myregex)){
System.out.println("String matches");
}
else{
System.out.println("String not matching");
}
}
}
It should pass all strings like:
dc=com
dc=domain1,dc=com
dc=domain2,dc=domain1,dc=com
It should fail for the values:
dc=domain1,dc=com,d
dc=domain1,dc=com,dc
(incomplete key or invalid syntax)
Can anyone suggest what should be done here to validate this properly?
You have a major error in your regex - you're using square brackets instead of parenthesis. Square brackets mean: "Any character", not a sequence of characters.
Further, your regex can be simplified to:
(dc=[\w-]+,?)*
As LDAP DNs may contain spaces, you may want to consider using:
(\s*dc\s*=\s*[\w-]+\s*,?)*
Remember to escape the slashes as necessary when inserting into your code.
I believe the problem you are having is due to the structure of your regex.
Your regex:
"^dc=[a-zA-Z0-9\\-\\.]*[,dc=[a-zA-Z0-9\\-\\.]*]*"
has a flaw with the second character class. Specifically:
(`[,dc=[a-zA-Z0-9\\-\\.]*]*.
It should be changed to (,dc=[a-zA-Z0-9\\-\\.]*)* for the sake of having the literal ",dc=" match as well as the inner character class match.
The complete regex that should work is:
^dc=[a-zA-Z0-9\\-\\.]*(,dc=[a-zA-Z0-9\\-\\.]*)*

Regex for match web extensions

I want to check whether a url is a image, java script, pdf etc
String url = "www.something.com/script/sample.js?xyz=xyz;
Below regex works fine but only with out xyz=zyz
".*(mov|jpg|gif|pdf|js)$"
When i remove $ at the end to eliminate regex requirement for .js to be in end but then it gives false
.*(mov|jpg|gif|pdf|js).*$ allows you to have any optional text after the file extension. The capturing group captures the file extension. You can see this here.
Use the regex as below:
.*\\.(mov|jpg|gif|pdf|js)\\?
This matches for dot(.) followed by your extension and terminated by ?
The first dot(.) is matching any character while second dot(.) prefixed by \\ match for dot(.) as literal just before your extension list.
Why not use java.net.URL to parse the url string, it could avoid lots of mismatching problems:
try {
URL url = new URL(urlString);
String filename = url.getFile();
// now test if the filename ends with your desired extensions.
} catch (Exception e) {
// This case the url cannot be parsed.
}
I'm not a big fan of this, but try:
.*\\.(mov|jpg|gif|pdf|js).*$
The problem is that it will accept things like "our.moving.day"
and post your code. there is always more than one way to skin a cat and perhaps there is something wrong with your code, not the regex.
Also, try regex testers...theres a ton of them out there. i'm a big fan of:
http://rubular.com/ and http://gskinner.com/RegExr/ (but they are mostly for php/ruby)

Categories