RegExp for File URL - java

I write a RegExp for validate a file URL
file:/{2,3}[a-zA-Z0-9]*\\|?/.*
for URL like
file:/D:/workspace/project/build/libs/myjar-1.0.jar
But doesnt work,
am looking for a pattern that matches only URL like this,no other.
Pattern Will return false URLs like
file:/workspace/project/build/libs/myjar-1.0.jar
and
file:/D:/workspace/project/build/libs/myjar-1.0
are will not match
please help

Complete question rewrite
Given the OP's updated criteria, the regex you're looking for is file\:\/\w\:\/[^\s]*\.jar; ensure you enabled g (global) and i (case-insensitive) modifiers.
See a working example on Regex101

If there is no context or rule your regex should be:
/file\:.*/i
See it here: http://regex101.com/r/yY5xG6

Related

Get domain name from URL without java.net in Java

I would like to know in GWT is there an easy way to get host name from url WITHOUT java.net.*; Since client side doesn't support this package.
Input ejw.example.com/anotherexample?fun=no
output example.com
Input https://www3.example.com/yeteagainanotherexample?fun=miserable
output example.com
com.google.gwt.user.client.Window.Location.getHost()
Try with com.google.gwt.regexp.shared.RegExp that is used for regular expressions with features like Javascript's RegExp, plus Javascript String's replace and split methods (which can take a RegExp parameter).
Try below regex that captures all the everything between first dot and first forward slash and get it at match index 1.
https*://\w+\.(.*?)/
Here is demo on debuggex and regexr
Sample code:
String url="https://www3.example.com/yeteagainanotherexample?fun=miserable";
System.out.println(RegExp.compile("https*://\\w+\\.(.*?)/").exec(url).getGroup(1));
Note: I am not good in regex so please change the regex pattern as per your need. Read more about
What is a good regular expression to match a URL?
JavaScript Regex to match a URL in a field of text

Undoing automatic linkification using Java and Regex

I am working with a database whose entries contain automatically generated html links: each URL was converted to
URL
I want to undo these links: the new software will generate the links on the fly. Is there a way in Java to use .replaceAll or a Regex method that will replace the fragments with just the URL (only for those cases where the URLs match)?
To clarify, based on the questions below: the existing entries will contain one or more instances of linkified URLs. Showing an example of just one:
I visited http://www.amazon.com/ to buy a book.
should be replaced with
I visited http://www.amazon.com/ to buy a book.
If the URL in the href differs in any way from the link text, the replacement should not occur.
You can use this pattern with replaceAll method:
<a (?>[^h>]++|\Bh|h(?!ref\b))*href\s*=\s*["']?(http://)?([^\s"']++)["']?[^>]*>\s*+(?:http://)?\2\s*+<\/a\s*+>
replacement: $1$2
I wrote the pattern as a raw pattern thus, don't forget to escape double quotes and using double backslashes before using it.
The main interest of this pattern is that urls are compared without the substring http:// to obtain more results.
First, a reminder that regular expressions are not great for parsing XML/HTML: this HTML should parse out the same as what you've got, but it's really hard to write a regex for it:
<
a
foo="bar"
href="URL">
<nothing/>URL
</a
>
That's why we say "don't use regular expressions to parse XML!"
But it's often a great shortcut. What you're looking for is a back-reference:
\1
This will match when the quoted string and the contents of the a-element are the same. The \1 matches whatever was captured in group 1. You can also use named capturing groups if you like a little more documentation in your regular expressions. See Pattern for more options.

Regex matches only part of a URL - why?

I am very weak in regex and the regex I am using (found from internet) is only partially solving my problem. I need to add an anchor tag to a URL from text input using java. Here is my code:
String text ="Hi please visit www.google.com";
String reg = "\\b(([\\w-]+://?|www[.])[^\\s()<>]+(?:\\([\\w\\d]+\\)|([^[:punct:]\\s]|/)))";
String s = text.replaceAll(reg, "<a href='$1'>$1</a>");
System.out.println(""+s);
The output currently is Hi please visit <a href='www.google.c'>www.google.c</a>om. What's wrong with the regex?
I need to parse a text and display a URL entered from text field as hot link in a jsp page. The actual output expected would be
Hi please visit <a href='www.google.com'>www.google.com</a>
Edit
Following regex
(http(s)?://)?(www(\.\w+)+[^\s.,"']*)
works like a charm in url ending with .com but fails in other extensions like .jsp.Is there any way for it to work in all sort of extension?
To answer your question why the regex doesn't work: It doesn't observe Java's regex syntax rules.
Specifically:
[^[:punct:]\s]
doesn't work as you expect it to because Java doesn't recognize POSIX shorthands like [:punct:]. Instead, it treats that as a nested character class. That again leads to the ^ becoming illegal in that context, so Java ignores it, leaving you with a character class that matches the same as
[:punct\s]
which only matches the c of com, therefore ending your match there.
As for your question of how to find URLs in a block of text, I suggest you read Jan Goyvaert's excellent blog entry Detecting URLs in a block of text. You'll need to decide yourself how sensitive and how specific you want to make your regex.
For example, the solution proposed at the end of the post would translate to Java as
String resultString = subjectString.replaceAll(
"(?imx)\\b(?:(?:https?|ftp|file)://|www\\.|ftp\\.)\n" +
"(?:\\([-A-Z0-9+&#\\#/%=~_|$?!:,.]*\\)|\n" +
" [-A-Z0-9+&#\\#/%=~_|$?!:,.])*\n" +
"(?:\\([-A-Z0-9+&#\\#/%=~_|$?!:,.]*\\)|\n" +
" [A-Z0-9+&#\\#/%=~_|$])", "$0");
Java recognises posix expressions (see javadoc), but the syntax is a little different. It looks like this instead:
\p{Punct}
But I would simplify your regex for a URL to:
(?i)(http(s)?://)?((www(\.\w+)+|(\d{1,3}\.){3}\.\d{1,3})[^\s,"']*(?<!\\.))
And elaborate it only if you find a test case that breaks it.
As a java line it would be:
text = text.replaceAll("(?i)(http(s)?://)?((www(\\.\w+)+|(\\d{1,3}\\.){3}\\d{1,3})[^\\s,\"']*(?<!\\.))", "$3");
Note the neat capture of the "s" in "https" (if found) that is restored if required.

Regex to extract valid Http or Https

I'm currently having some issues with a regex to extract a URL.
I want my regex to take URLS such as:
http://stackoverflow.com/questions/ask
https://stackoverflow.com
http://local:1000
https://local:1000
Through some tutorials, I've learned that this regex will find all the above: ^(http|https)\://.*$ however, it will also take http://local:1000;http://invalid http://khttp://as a single string when it shouldn't take it at all.
I understand that my expression isn't written to exclude this, but my issue is I cannot think of how to write it so it checks for this scenario.
Any help is greatly appreciated!
Edit:
Looking at my issue, it seems that I could eliminate my issue as long as I can implement a check to make sure '//' doesn't occur in my string after the initial http:// or https://, any ideas on how to implement?
Sorry this will be done with Java
I also need to add the following constraint: a string such as http://local:80/test:90 fails because of the duplicate of port...aka I need to have a constraint that only allows two total : symbols in a valid string (one after http/s) and one before port.
This will only produce a match if if there is no :// after its first appearance in the string.
^https?:\/\/(?!.*:\/\/)\S+
Note that trying to parse a valid url from within a string is very complex, see
In search of the perfect URL validation regex, so the above does not attempt to do that.
It will just match the protocol and following non-space characters.
In Java
Pattern reg = Pattern.compile("^https?:\\/\\/(?!.*:\\/\\/)\\S+");
Matcher m = reg.matcher("http://somesite.com");
if (m.find()) {
System.out.println(m.group());
} else {
System.out.println("No match");
}
Check your programming language to see if it already has a parser. E.g. php has parse_url()
From http://net.tutsplus.com/tutorials/other/8-regular-expressions-you-should-know/
/^(https?:\/\/)?([\da-z\.-]+)\.([a-z\.]{2,6})([\/\w \.-]*)*\/?$/
This may change based on the programming language/tool
/[A-Za-z]+://[A-Za-z0-9-]+.[A-Za-z0-9-:%&;?#/.=]+/g

Java - Regex problem

I have a list of URLs of type
http://www.example.com/pk/ca,
http://www.example.com/pk,
http://www.example.com/anthingcangoeshere/pk, and
http://www.example.com/pkisnotnecessaryhere.
Now, I want to find out only those URLs that ends with /pk or /pk/ and don't have anything in between .com and /pk
Your problem isn't fully defined so I can't give you an exact answer but this should be a start you can use:
^[^:]+://[^/]+\.com/pk/?$
These strings will match:
http://www.example.com/pk
http://www.example.com/pk/
https://www.example.com/pk
These strings won't match:
http://www.example.co.uk/pk
http://www.example.com/pk/ca
http://www.example.com/anthingcangoeshere/pk
http://www.example.com/pkisnotnecessaryhere
String pattern = "^http://www.example.com/pk/?$";
Hope this helps.
Some details: if you don't add ^ to the beginning of the pattern, then foobarhttp://www.example.com/pk/ will be accepted too. If you don't add $ to the end of the pattern, then http://www.exampke.com/pk/foobar will be accepted too.
Directly translating your request "[...] URLs that ends with /pk or /pk/ and don't have anything in between .com and /pk", with the additional assumption that there shall always be a ".com", yields this regex:
If you use find():
\.com/pk/?$
If you use matches():
.*\.com/pk/?
Other answers given here give more restrictive patterns, allowing only URLs that are more close to your examples. Especially my pattern does not validate that the given string is a syntactically valid URL.
String pattern = "^https?://(www\.)?.+\\.com/pk/?$";

Categories