Regex for validating URLs or File locations within Struts 2 - java

Good day,
I'm looking for a regex that validates URLs and file locations that will work within a struts 2 environment.
What I mean by in a struts 2 environment, is the string will be input into a textfield:
<s:textfield name="linkAddr.urlAddress" id="linkAddr" maxlength="2500"/>
In struts 2, as you know, if someone inputs google.ca, it will return
APP_LOCATION/NAMESPACE/google.ca
, and will not point to google, despite the input normally being correct.
Therefore, I want a regex that will validate to take this into account (user MUST type http, https, ftp, or \\ (in the case of a file located on a shared drive).
EDIT:
Some examples:
I want to allow:
http://foo.com/blah_blah_(wikipedia)_(again)
http://www.example.com/wpstyle/?p=364
https://www.example.com/foo/?bar=baz&inga=42&quux
http://✪df.ws/123
ftp://foo.bar/baz
http://foo.bar/?q=Test%20URL-encoded%20stuff
http://1337.net
http://a.b-c.de
\\asdf.233.net\natdfs\AAA\HQ\FFEE\FFEE_H0E\GV1\AAA\FFFEEE\Web Dev\Web Applications Team\Web Applications Team Document.docx

Try this for your regex:
((http://|https://|ftp://)([\S.]+))|((\\\\)(.+)(\.)(\w+))
Your case is a little complicated because of the last one and I think this regex wlil validate some urls that you don't want to be validated, since it's attempting to cover subdomains, etc., too, but you can try it out and make adjustments where necessary.
This regex will check if your string starts with http://, https:// or ftp://, followed by any number of characters besides whitespace or newline, or if it starts with \\ and is followed by any number of characters ending with a file extension (eg, .doc). If it doesn't have a file extension, it will be invalid.
You can test out the regex and anything else you come up with at RegExr!

Related

Regex expression for comma and dash seperated text of items

I do have a Java Web Application, where I get some inputs from the user. Once I got this input I have to parse it and the parsing part depends on what kind of input I'll get. I decided to use the Pattern class of java for some of predefined user inputs.
So I need the last 2 regex patterns:
a)Enumaration:
input can be - A03,B24.1,A25.7
The simple way would be to check if there are a comma in there ([^,]+) but it will end up with a lot of updates in to parsing function, which I would like to avoid. So, in addition to comma it should check if it starts with
letter
minimum 3 letters (combined with numbers)
can have one dot in the word
minimum 1 comma (updated it)
b) Mixed
input can be A03,B24.1-B35.5,A25.7
So all of what Enumuration part got, but with addition that it can have a dash minimum one.
I've tried to use multiple online regex generators but didnt get it correct. Would be much appreciated if you can help.
Here is what I got if its B24.1-B35.5 if its just a simple range.
"='.{1}\\d{0,2}-.{1}\\d{0,2}'|='.{1}\\d{1,2}.\\d{1,2}-.{1}\\d{1,2}.\\d{1,2}'";
Edit1: Valid and Invalid inputs
for a)Enumaration
A03,B24.1,A25.7 Valid
A03,B24.1 Valid
A03,B24.1-B25.1 -Invalid because in this case (enumaration) it should not contain dash
A03 invalid because no comma
A03,B24.1 - Valid
A03 Invalid
for b)Mixed
everything that a enumeration has with addition that it can have dash too.
You can use this regex for (a) Enumeration part as per your rules:
[A-Za-z][A-Za-z0-9]{2,}(?:\.[A-Za-z0-9]{1,})?(?:,[A-Za-z][A-Za-z0-9]{2,}(?:\.[A-Za-z0-9]{1,})?)+
Rules:
Verifies that each segment starts with a letter
Minimum of three letters or numbers [A-Za-z][A-Za-z0-9]{2,}
Optionally followed by decimal . and one or more alphabets and numbers i.e (?:\.[A-Za-z0-9]{1,})?
Same thing repeated, and seperated by a comma ,. Also must have atleast one comma so using + i.e (?:,[A-Za-z][A-Za-z0-9]{2,}(?:\.[A-Za-z0-9]{1,})?)+
?: to indicate non-capturing group
Using [A-Za-z0-9] instead of \w to avoid underscores
Regex101 Demo
For (b) Mixed, you haven't shared too many valid and invalid cases, but based on my current understanding here's what I have:
[A-Za-z][A-Za-z0-9]{2,}(?:\.[A-Za-z0-9]{1,})?(?:[,-][A-Za-z][A-Za-z0-9]{2,}(?:\.[A-Za-z0-9]{1,})?)+
Note that , from previous regex has been replaced with [,-] to allow - as well!
Regex101 Demo
// Will match
A03,B24.1-B35.5,A25.7
A03,B24.1,A25.7
A03,B24.1-B25.1
Hope this helps!
EDIT: Making sure each group starts with a letter (and not a number)
Thanks to #diginoise and #anubhava for pointing out! Changed [A-Za-z0-9]{3,} to [A-Za-z][A-Za-z0-9]{2,}
As I said in the comments, I would chop the input by commas and verify each segment separately. Your domain ICD 10 CM codes is very well defined and also I would be very wary of any input which could be non valid, yet pass the validation.
Here is my solution:
regex
([A-TV-Z][0-9][A-Z0-9](\.?[A-Z0-9]{0,4})?)
... however I would avoid that.
Since your domain is (moste likely) medical software, people's lives (or at least well being) is at stake. Not to mention astronomical damages and the lawyers ever-chasing ambulances. Therefore avoid the easy solution, and implement the bomb proof one.
You could use the regex to establish that given code is definitely not valid. However if a code passes your regex it does not mean that it is valid.
bomb proof method
See this example: O09.7, O09.70, O09.71, O09.72, O09.73 are valid entries, but O09.1 is not valid.
Therefore just get all possible codes. According to this gist there are 42784 different codes. Just load them to memory and any code which is not in the set, is not valid. You could compress said list and be clever about the encoding in memory, to occupy less space, but verbatim all codes are under 300kB on disk, so few MBs max in memory, therefore not a massive cost to pay for a price of people not having left instead of right kidney removed.

Regex to validate words do not contain numbers or special characters

I am developing a java app, running on android. I am trying to pick all words which do not contain any embedded digits or symbols.
The best I have come up with is:
\b[a-zA-Z]+[a-zA-Z]*+\b
Test Data:
this is a test , an0ther gr8 WW##ee one, w1n 1test test1 end
This results in picking the following: this, is, a, test, WW##ee, one, end
I need to eliminate the WW##ee from the results.
You shouldn't use a word boundary meta-character \b since it matches the position right after WW which sees a hash # character. This position is a word boundary itself. So you should pick up a different way:
(?<![\S&&[^,]])[a-zA-Z]+(?![\S&&[^,]])
Using character class intersection feature of Java's regex you are able to define punctuation characters that are allowed to follow or precede a word character. Here it is a comma ,.
You could use look behind and look ahead to check there is no #.
\b(?<!\#)[a-zA-Z]+(?!\#)\b
My solution has evolved a bit as I have gotten additional help with this. So, this is now my best solution but still a bit lacking. I have not been able to accept "as-is" while rejecting "-this-" and a similar case of accept "and/or" while rejecting "/slash/". Also for simplicity I have made the input data single word per line.
^(?:[\p{P}\p{S}])?((?:[\p{L}\p{Pd}'])+)(?:[\p{P}\p{S}])$
as-is is picked valid
-this- is valid but I wish it weren't
and/or is not valid but I wish it would be picked
/slash/ "slash" is picked valid
(test) "test" is picked valid
[test] "test" is picked valid
<test> "test" is picked valid

Download list of pages from some domain with URL constraint

I need to download a list of all the pages on some domain that have specific URL endings.
For example, I have a webpage, like http://brnensky.denik.cz/, which is a Czech webpage with news. Every article has URL ending with post date, like http://brnensky.denik.cz/zpravy_region/ruzova-kola-usnadni-presun-po-brne-20140418.html.
So I would like to find the list of all URLs that begin with http://brnensky.denik.cz/, then whatever, and then for example -20140418.html. Is it possible to achieve?
I'm trying to solve this in Java, but also any other way would help.
Regex would be
^http://brnensky\.denik\.cz.*[0-9]{8}\.html
Logic
Beginning with URL and ending with date.html and date will be always 8 digit string.
You may have to escape '/' according to tool or Lang used to implement this expression

Regex for university emails

I am looking to validate email addresses by making sure they have a specific university subdomain, e.g. if the user says they attend Oxford University, I want to check that their email ends in .ox.ac.uk
If I have the '.ox.ac.uk' part stored as a variable, how can I incorporate this with a regex to check the whole email is valid and ends in that variable suffix?
Many thanks!
We are using this email pattern (derived from this regular-expressions.info article):
^[\w!#$%&'*+/=?^`{|}~-]+(?:\.[\w!#$%&'*+/=?^`{|}~-]+)*#(?:[a-zA-Z0-9](?:[a-zA-Z0-9-]*[a-zA-Z0-9])?\.)+[a-zA-Z0-9](?:[a-zA-Z0-9-]*[a-zA-Z0-9])?$`
You should be able to extend it with your needed suffix:
^[\w!#$%&'*+/=?^`{|}~-]+(?:\.[\w!#$%&'*+/=?`{|}~-]+)*#(?:[a-zA-Z0-9](?:[a-zA-Z0-9-]*[a-zA-Z0-9])?\.)+(?:\.ox\.ac\.uk)$`
Note that I replaced the TLD part [a-zA-Z0-9](?:[a-zA-Z0-9-]*[a-zA-Z0-9])? with your required suffix (?:\.ox\.ac\.uk) (\. is used to match the dot only)
Edit: one additional note: if you use String#matches(...) or Matcher#matches() there's no need for the leading ^ and the trailing $, since the entire string would have to match anyways.
Assuming you are using php.
$ending = '.ox.ac.uk';
if(preg_match('/'.preg_quote($ending).'$/i', $email_address)) //... your code
Further info: the preg_quote() is necessary so that characters get escaped if they have a special meaning. In your case it's the dots.
edit: To check if the whole email is valid, see other questions, it is asked a lot. Just wanted to help with your special case.

How to catch URLs given by user in text

I would like to get URLs given by user in his/her text (I assume that URL must be started with http://) . This is first attempt:
Pattern pattern = Pattern.compile("http://[^ ]+");
but if user types something like this:
"look at somepage (http://somepage.net)"
"look at http://somepage1.net, http://somepage2.net and sth else"
"Please visit our page http://somepage.net."
the URL was with incorrect(?) character at the end. How to avoid this?
Can math, what URL can't end by [,.)] etc, end only [A-Za-z] or / , but this broke url's whith specific end such as http://site.com/read.php?key=F#$.)
The answer is that you cannot do this with 100% accuracy.
A URL like "http://somepage1.net," is technically legal, and there is no way of knowing for sure whether the "," is part of the URL or just punctuation.
A URL like "http://somepage1.net or something" is technically illegal, but typical end users don't know this. (They are used to browsers that do all sorts of funky things to what they type at their browser.)
Probably, best you can do is use a regex to extract legal URLs, and then trim text punctuation characters from the right end of the URL ... on the assumption that they are not intended to be part of the URL.
You could also treat matching quotes or left / right brackets as denoting URL boundaries; e.g.
The secret URL is "http://example.com/?" ... don't leave off the "?"

Categories