I would like to get URLs given by user in his/her text (I assume that URL must be started with http://) . This is first attempt:
Pattern pattern = Pattern.compile("http://[^ ]+");
but if user types something like this:
"look at somepage (http://somepage.net)"
"look at http://somepage1.net, http://somepage2.net and sth else"
"Please visit our page http://somepage.net."
the URL was with incorrect(?) character at the end. How to avoid this?
Can math, what URL can't end by [,.)] etc, end only [A-Za-z] or / , but this broke url's whith specific end such as http://site.com/read.php?key=F#$.)
The answer is that you cannot do this with 100% accuracy.
A URL like "http://somepage1.net," is technically legal, and there is no way of knowing for sure whether the "," is part of the URL or just punctuation.
A URL like "http://somepage1.net or something" is technically illegal, but typical end users don't know this. (They are used to browsers that do all sorts of funky things to what they type at their browser.)
Probably, best you can do is use a regex to extract legal URLs, and then trim text punctuation characters from the right end of the URL ... on the assumption that they are not intended to be part of the URL.
You could also treat matching quotes or left / right brackets as denoting URL boundaries; e.g.
The secret URL is "http://example.com/?" ... don't leave off the "?"
Related
Good day,
I'm looking for a regex that validates URLs and file locations that will work within a struts 2 environment.
What I mean by in a struts 2 environment, is the string will be input into a textfield:
<s:textfield name="linkAddr.urlAddress" id="linkAddr" maxlength="2500"/>
In struts 2, as you know, if someone inputs google.ca, it will return
APP_LOCATION/NAMESPACE/google.ca
, and will not point to google, despite the input normally being correct.
Therefore, I want a regex that will validate to take this into account (user MUST type http, https, ftp, or \\ (in the case of a file located on a shared drive).
EDIT:
Some examples:
I want to allow:
http://foo.com/blah_blah_(wikipedia)_(again)
http://www.example.com/wpstyle/?p=364
https://www.example.com/foo/?bar=baz&inga=42&quux
http://✪df.ws/123
ftp://foo.bar/baz
http://foo.bar/?q=Test%20URL-encoded%20stuff
http://1337.net
http://a.b-c.de
\\asdf.233.net\natdfs\AAA\HQ\FFEE\FFEE_H0E\GV1\AAA\FFFEEE\Web Dev\Web Applications Team\Web Applications Team Document.docx
Try this for your regex:
((http://|https://|ftp://)([\S.]+))|((\\\\)(.+)(\.)(\w+))
Your case is a little complicated because of the last one and I think this regex wlil validate some urls that you don't want to be validated, since it's attempting to cover subdomains, etc., too, but you can try it out and make adjustments where necessary.
This regex will check if your string starts with http://, https:// or ftp://, followed by any number of characters besides whitespace or newline, or if it starts with \\ and is followed by any number of characters ending with a file extension (eg, .doc). If it doesn't have a file extension, it will be invalid.
You can test out the regex and anything else you come up with at RegExr!
I need to download a list of all the pages on some domain that have specific URL endings.
For example, I have a webpage, like http://brnensky.denik.cz/, which is a Czech webpage with news. Every article has URL ending with post date, like http://brnensky.denik.cz/zpravy_region/ruzova-kola-usnadni-presun-po-brne-20140418.html.
So I would like to find the list of all URLs that begin with http://brnensky.denik.cz/, then whatever, and then for example -20140418.html. Is it possible to achieve?
I'm trying to solve this in Java, but also any other way would help.
Regex would be
^http://brnensky\.denik\.cz.*[0-9]{8}\.html
Logic
Beginning with URL and ending with date.html and date will be always 8 digit string.
You may have to escape '/' according to tool or Lang used to implement this expression
I am very weak in regex and the regex I am using (found from internet) is only partially solving my problem. I need to add an anchor tag to a URL from text input using java. Here is my code:
String text ="Hi please visit www.google.com";
String reg = "\\b(([\\w-]+://?|www[.])[^\\s()<>]+(?:\\([\\w\\d]+\\)|([^[:punct:]\\s]|/)))";
String s = text.replaceAll(reg, "<a href='$1'>$1</a>");
System.out.println(""+s);
The output currently is Hi please visit <a href='www.google.c'>www.google.c</a>om. What's wrong with the regex?
I need to parse a text and display a URL entered from text field as hot link in a jsp page. The actual output expected would be
Hi please visit <a href='www.google.com'>www.google.com</a>
Edit
Following regex
(http(s)?://)?(www(\.\w+)+[^\s.,"']*)
works like a charm in url ending with .com but fails in other extensions like .jsp.Is there any way for it to work in all sort of extension?
To answer your question why the regex doesn't work: It doesn't observe Java's regex syntax rules.
Specifically:
[^[:punct:]\s]
doesn't work as you expect it to because Java doesn't recognize POSIX shorthands like [:punct:]. Instead, it treats that as a nested character class. That again leads to the ^ becoming illegal in that context, so Java ignores it, leaving you with a character class that matches the same as
[:punct\s]
which only matches the c of com, therefore ending your match there.
As for your question of how to find URLs in a block of text, I suggest you read Jan Goyvaert's excellent blog entry Detecting URLs in a block of text. You'll need to decide yourself how sensitive and how specific you want to make your regex.
For example, the solution proposed at the end of the post would translate to Java as
String resultString = subjectString.replaceAll(
"(?imx)\\b(?:(?:https?|ftp|file)://|www\\.|ftp\\.)\n" +
"(?:\\([-A-Z0-9+&#\\#/%=~_|$?!:,.]*\\)|\n" +
" [-A-Z0-9+&#\\#/%=~_|$?!:,.])*\n" +
"(?:\\([-A-Z0-9+&#\\#/%=~_|$?!:,.]*\\)|\n" +
" [A-Z0-9+&#\\#/%=~_|$])", "$0");
Java recognises posix expressions (see javadoc), but the syntax is a little different. It looks like this instead:
\p{Punct}
But I would simplify your regex for a URL to:
(?i)(http(s)?://)?((www(\.\w+)+|(\d{1,3}\.){3}\.\d{1,3})[^\s,"']*(?<!\\.))
And elaborate it only if you find a test case that breaks it.
As a java line it would be:
text = text.replaceAll("(?i)(http(s)?://)?((www(\\.\w+)+|(\\d{1,3}\\.){3}\\d{1,3})[^\\s,\"']*(?<!\\.))", "$3");
Note the neat capture of the "s" in "https" (if found) that is restored if required.
I am looking to validate email addresses by making sure they have a specific university subdomain, e.g. if the user says they attend Oxford University, I want to check that their email ends in .ox.ac.uk
If I have the '.ox.ac.uk' part stored as a variable, how can I incorporate this with a regex to check the whole email is valid and ends in that variable suffix?
Many thanks!
We are using this email pattern (derived from this regular-expressions.info article):
^[\w!#$%&'*+/=?^`{|}~-]+(?:\.[\w!#$%&'*+/=?^`{|}~-]+)*#(?:[a-zA-Z0-9](?:[a-zA-Z0-9-]*[a-zA-Z0-9])?\.)+[a-zA-Z0-9](?:[a-zA-Z0-9-]*[a-zA-Z0-9])?$`
You should be able to extend it with your needed suffix:
^[\w!#$%&'*+/=?^`{|}~-]+(?:\.[\w!#$%&'*+/=?`{|}~-]+)*#(?:[a-zA-Z0-9](?:[a-zA-Z0-9-]*[a-zA-Z0-9])?\.)+(?:\.ox\.ac\.uk)$`
Note that I replaced the TLD part [a-zA-Z0-9](?:[a-zA-Z0-9-]*[a-zA-Z0-9])? with your required suffix (?:\.ox\.ac\.uk) (\. is used to match the dot only)
Edit: one additional note: if you use String#matches(...) or Matcher#matches() there's no need for the leading ^ and the trailing $, since the entire string would have to match anyways.
Assuming you are using php.
$ending = '.ox.ac.uk';
if(preg_match('/'.preg_quote($ending).'$/i', $email_address)) //... your code
Further info: the preg_quote() is necessary so that characters get escaped if they have a special meaning. In your case it's the dots.
edit: To check if the whole email is valid, see other questions, it is asked a lot. Just wanted to help with your special case.
Is there any real way to represent a URL (which more than likely will also have a query string) as a filename in Java without obscuring the original URL completely?
My first approach was to simply escape invalid characters with arbitrary replacements (for example, replacing "/" with "_", etc).
The problem is, as in the example of replacing with underscores is that a URL such as "app/my_app" would become "app_my_app" thus obscuring the original URL completely.
I have also attempted to encode all the special characters, however again, seeing crazy %3e %20 etc is really not clear.
Thank you for any suggestions.
Well, you should know what you want here, exactly. Keep in mind that the restrictions on file names vary between systems. On a Unix system you probably only need to escape the virgule somehow, whereas on Windows you need to take care of the colon and the question mark as well.
I guess, the safest thing would be to encode anything that could potentially clash (everything non-alphanumeric would be a good candidate, although you migth adapt this to the platform) with percent-encoding. It's still somewhat readable and you're guaranteed to get the original URL back.
Why? URL-encoding is already defined in an RFC: there's not much point in reinventing it. Basically you must have an escape character such as %, otherwise you can't tell whether a character represents itself or an escape. E.g. in your example app_my_app could represent app/my/app. You therefore also need a double-escape convention so you can represent the escape character itself. It is not simple.