Extract specific lines from multiple text files - java

I want to print certain lines from multiple text files in a folder, depending on the file name. Consider the following text files named from 3 words separated by underscore:
Small_Apple_Red.txt
Small_Orange_Yellow.txt
Large_Apple_Green.txt
Large_Orange_Green.txt
How can the following be achieved?
if (first word of file name is "Small") {
// Print row 3 column 2 of the file (space delimited);
}
if (second word of file name is "Orange") {
// print row 1 column 4 of the file;
}
Is this possible with awk?

Try as follow.
Use glob for to handle the files in a folder.
Then use regular expression for check the file name. Here
grep uses for extract the particular content from a file.
my $path = "folderpath";
while (my $file = glob("$path/*"))
{
if($file =~/\/Small_Apple/)
{
open my $fh, "<", "$file";
print grep{/content what you want/ } <$fh>;
}
}

use strict;
use warnings;
my #file_names = ("Small_Apple_Red.txt",
"Small_Orange_Yellow.txt",
"Large_Apple_Green.txt",
"Large_Orange_Green.txt");
foreach my $file ( #file_names) {
if ( $file =~ /^Small/){ // "^" marks the begining of the string
print "\n $file has the first word small";
}
elsif ( $file =~ /.*?_Orange/){ // .*? is non-greedy, this means that it matches anything<br>
// until the first "_" is found
print "\n $file has the second word orange";
}
}
There is still a special case it your file has "Small_Orange" you have to decide which is more important. If the second word is more important, then switch the content from if section with the content from elsif section

In Awk:
awk 'FILENAME ~ /^Large/ {print $1,$4}
FILENAME ~ /^Small/ {print $3,$2}' *
in Perl:
perl -naE 'say "$F[0] $F[3]" if $ARGV =~ /^Large/;
say "$F[2] $F[1]" if $ARGV =~ /^Small/ ' *

Try this one:
use strict;
use warnings;
use Cwd;
use File::Basename;
my $dir = getcwd(); #or shift the input values from the user
my #txtfiles = glob("$dir/*.txt");
foreach my $each_txt_file (#txtfiles)
{
open(DATA, $each_txt_file) || die "reason: $!";
my #allLines = <DATA>;
(my $removeExt = $each_txt_file)=~s/\.txt$//g;
my($word1, $word2, $word3) = split/\_/, basename $removeExt; #Select the file name with matching case
if($word1=~m/small/i) #Select your match case
{
my #split_space = "";
my #allrows = split /\n/, $allLines[1]; #Mentioned the row number
my #allcolns = split /\s/, $allrows[0];
print "\n", $allcolns[1]; #Mentioned the column number
}
}

Related

Regex for domain extraction [duplicate]

How can I check if a given string is a valid URL address?
My knowledge of regular expressions is basic and doesn't allow me to choose from the hundreds of regular expressions I've already seen on the web.
I wrote my URL (actually IRI, internationalized) pattern to comply with RFC 3987 (http://www.faqs.org/rfcs/rfc3987.html). These are in PCRE syntax.
For absolute IRIs (internationalized):
/^[a-z](?:[-a-z0-9\+\.])*:(?:\/\/(?:(?:%[0-9a-f][0-9a-f]|[-a-z0-9\._~\x{A0}-\x{D7FF}\x{F900}-\x{FDCF}\x{FDF0}-\x{FFEF}\x{10000}-\x{1FFFD}\x{20000}-\x{2FFFD}\x{30000}-\x{3FFFD}\x{40000}-\x{4FFFD}\x{50000}-\x{5FFFD}\x{60000}-\x{6FFFD}\x{70000}-\x{7FFFD}\x{80000}-\x{8FFFD}\x{90000}-\x{9FFFD}\x{A0000}-\x{AFFFD}\x{B0000}-\x{BFFFD}\x{C0000}-\x{CFFFD}\x{D0000}-\x{DFFFD}\x{E1000}-\x{EFFFD}!\$&'\(\)\*\+,;=:])*#)?(?:\[(?:(?:(?:[0-9a-f]{1,4}:){6}(?:[0-9a-f]{1,4}:[0-9a-f]{1,4}|(?:[0-9]|[1-9][0-9]|1[0-9][0-9]|2[0-4][0-9]|25[0-5])(?:\.(?:[0-9]|[1-9][0-9]|1[0-9][0-9]|2[0-4][0-9]|25[0-5])){3})|::(?:[0-9a-f]{1,4}:){5}(?:[0-9a-f]{1,4}:[0-9a-f]{1,4}|(?:[0-9]|[1-9][0-9]|1[0-9][0-9]|2[0-4][0-9]|25[0-5])(?:\.(?:[0-9]|[1-9][0-9]|1[0-9][0-9]|2[0-4][0-9]|25[0-5])){3})|(?:[0-9a-f]{1,4})?::(?:[0-9a-f]{1,4}:){4}(?:[0-9a-f]{1,4}:[0-9a-f]{1,4}|(?:[0-9]|[1-9][0-9]|1[0-9][0-9]|2[0-4][0-9]|25[0-5])(?:\.(?:[0-9]|[1-9][0-9]|1[0-9][0-9]|2[0-4][0-9]|25[0-5])){3})|(?:(?:[0-9a-f]{1,4}:){0,1}[0-9a-f]{1,4})?::(?:[0-9a-f]{1,4}:){3}(?:[0-9a-f]{1,4}:[0-9a-f]{1,4}|(?:[0-9]|[1-9][0-9]|1[0-9][0-9]|2[0-4][0-9]|25[0-5])(?:\.(?:[0-9]|[1-9][0-9]|1[0-9][0-9]|2[0-4][0-9]|25[0-5])){3})|(?:(?:[0-9a-f]{1,4}:){0,2}[0-9a-f]{1,4})?::(?:[0-9a-f]{1,4}:){2}(?:[0-9a-f]{1,4}:[0-9a-f]{1,4}|(?:[0-9]|[1-9][0-9]|1[0-9][0-9]|2[0-4][0-9]|25[0-5])(?:\.(?:[0-9]|[1-9][0-9]|1[0-9][0-9]|2[0-4][0-9]|25[0-5])){3})|(?:(?:[0-9a-f]{1,4}:){0,3}[0-9a-f]{1,4})?::[0-9a-f]{1,4}:(?:[0-9a-f]{1,4}:[0-9a-f]{1,4}|(?:[0-9]|[1-9][0-9]|1[0-9][0-9]|2[0-4][0-9]|25[0-5])(?:\.(?:[0-9]|[1-9][0-9]|1[0-9][0-9]|2[0-4][0-9]|25[0-5])){3})|(?:(?:[0-9a-f]{1,4}:){0,4}[0-9a-f]{1,4})?::(?:[0-9a-f]{1,4}:[0-9a-f]{1,4}|(?:[0-9]|[1-9][0-9]|1[0-9][0-9]|2[0-4][0-9]|25[0-5])(?:\.(?:[0-9]|[1-9][0-9]|1[0-9][0-9]|2[0-4][0-9]|25[0-5])){3})|(?:(?:[0-9a-f]{1,4}:){0,5}[0-9a-f]{1,4})?::[0-9a-f]{1,4}|(?:(?:[0-9a-f]{1,4}:){0,6}[0-9a-f]{1,4})?::)|v[0-9a-f]+\.[-a-z0-9\._~!\$&'\(\)\*\+,;=:]+)\]|(?:[0-9]|[1-9][0-9]|1[0-9][0-9]|2[0-4][0-9]|25[0-5])(?:\.(?:[0-9]|[1-9][0-9]|1[0-9][0-9]|2[0-4][0-9]|25[0-5])){3}|(?:%[0-9a-f][0-9a-f]|[-a-z0-9\._~\x{A0}-\x{D7FF}\x{F900}-\x{FDCF}\x{FDF0}-\x{FFEF}\x{10000}-\x{1FFFD}\x{20000}-\x{2FFFD}\x{30000}-\x{3FFFD}\x{40000}-\x{4FFFD}\x{50000}-\x{5FFFD}\x{60000}-\x{6FFFD}\x{70000}-\x{7FFFD}\x{80000}-\x{8FFFD}\x{90000}-\x{9FFFD}\x{A0000}-\x{AFFFD}\x{B0000}-\x{BFFFD}\x{C0000}-\x{CFFFD}\x{D0000}-\x{DFFFD}\x{E1000}-\x{EFFFD}!\$&'\(\)\*\+,;=])*)(?::[0-9]*)?(?:\/(?:(?:%[0-9a-f][0-9a-f]|[-a-z0-9\._~\x{A0}-\x{D7FF}\x{F900}-\x{FDCF}\x{FDF0}-\x{FFEF}\x{10000}-\x{1FFFD}\x{20000}-\x{2FFFD}\x{30000}-\x{3FFFD}\x{40000}-\x{4FFFD}\x{50000}-\x{5FFFD}\x{60000}-\x{6FFFD}\x{70000}-\x{7FFFD}\x{80000}-\x{8FFFD}\x{90000}-\x{9FFFD}\x{A0000}-\x{AFFFD}\x{B0000}-\x{BFFFD}\x{C0000}-\x{CFFFD}\x{D0000}-\x{DFFFD}\x{E1000}-\x{EFFFD}!\$&'\(\)\*\+,;=:#]))*)*|\/(?:(?:(?:(?:%[0-9a-f][0-9a-f]|[-a-z0-9\._~\x{A0}-\x{D7FF}\x{F900}-\x{FDCF}\x{FDF0}-\x{FFEF}\x{10000}-\x{1FFFD}\x{20000}-\x{2FFFD}\x{30000}-\x{3FFFD}\x{40000}-\x{4FFFD}\x{50000}-\x{5FFFD}\x{60000}-\x{6FFFD}\x{70000}-\x{7FFFD}\x{80000}-\x{8FFFD}\x{90000}-\x{9FFFD}\x{A0000}-\x{AFFFD}\x{B0000}-\x{BFFFD}\x{C0000}-\x{CFFFD}\x{D0000}-\x{DFFFD}\x{E1000}-\x{EFFFD}!\$&'\(\)\*\+,;=:#]))+)(?:\/(?:(?:%[0-9a-f][0-9a-f]|[-a-z0-9\._~\x{A0}-\x{D7FF}\x{F900}-\x{FDCF}\x{FDF0}-\x{FFEF}\x{10000}-\x{1FFFD}\x{20000}-\x{2FFFD}\x{30000}-\x{3FFFD}\x{40000}-\x{4FFFD}\x{50000}-\x{5FFFD}\x{60000}-\x{6FFFD}\x{70000}-\x{7FFFD}\x{80000}-\x{8FFFD}\x{90000}-\x{9FFFD}\x{A0000}-\x{AFFFD}\x{B0000}-\x{BFFFD}\x{C0000}-\x{CFFFD}\x{D0000}-\x{DFFFD}\x{E1000}-\x{EFFFD}!\$&'\(\)\*\+,;=:#]))*)*)?|(?:(?:(?:%[0-9a-f][0-9a-f]|[-a-z0-9\._~\x{A0}-\x{D7FF}\x{F900}-\x{FDCF}\x{FDF0}-\x{FFEF}\x{10000}-\x{1FFFD}\x{20000}-\x{2FFFD}\x{30000}-\x{3FFFD}\x{40000}-\x{4FFFD}\x{50000}-\x{5FFFD}\x{60000}-\x{6FFFD}\x{70000}-\x{7FFFD}\x{80000}-\x{8FFFD}\x{90000}-\x{9FFFD}\x{A0000}-\x{AFFFD}\x{B0000}-\x{BFFFD}\x{C0000}-\x{CFFFD}\x{D0000}-\x{DFFFD}\x{E1000}-\x{EFFFD}!\$&'\(\)\*\+,;=:#]))+)(?:\/(?:(?:%[0-9a-f][0-9a-f]|[-a-z0-9\._~\x{A0}-\x{D7FF}\x{F900}-\x{FDCF}\x{FDF0}-\x{FFEF}\x{10000}-\x{1FFFD}\x{20000}-\x{2FFFD}\x{30000}-\x{3FFFD}\x{40000}-\x{4FFFD}\x{50000}-\x{5FFFD}\x{60000}-\x{6FFFD}\x{70000}-\x{7FFFD}\x{80000}-\x{8FFFD}\x{90000}-\x{9FFFD}\x{A0000}-\x{AFFFD}\x{B0000}-\x{BFFFD}\x{C0000}-\x{CFFFD}\x{D0000}-\x{DFFFD}\x{E1000}-\x{EFFFD}!\$&'\(\)\*\+,;=:#]))*)*|(?!(?:%[0-9a-f][0-9a-f]|[-a-z0-9\._~\x{A0}-\x{D7FF}\x{F900}-\x{FDCF}\x{FDF0}-\x{FFEF}\x{10000}-\x{1FFFD}\x{20000}-\x{2FFFD}\x{30000}-\x{3FFFD}\x{40000}-\x{4FFFD}\x{50000}-\x{5FFFD}\x{60000}-\x{6FFFD}\x{70000}-\x{7FFFD}\x{80000}-\x{8FFFD}\x{90000}-\x{9FFFD}\x{A0000}-\x{AFFFD}\x{B0000}-\x{BFFFD}\x{C0000}-\x{CFFFD}\x{D0000}-\x{DFFFD}\x{E1000}-\x{EFFFD}!\$&'\(\)\*\+,;=:#])))(?:\?(?:(?:%[0-9a-f][0-9a-f]|[-a-z0-9\._~\x{A0}-\x{D7FF}\x{F900}-\x{FDCF}\x{FDF0}-\x{FFEF}\x{10000}-\x{1FFFD}\x{20000}-\x{2FFFD}\x{30000}-\x{3FFFD}\x{40000}-\x{4FFFD}\x{50000}-\x{5FFFD}\x{60000}-\x{6FFFD}\x{70000}-\x{7FFFD}\x{80000}-\x{8FFFD}\x{90000}-\x{9FFFD}\x{A0000}-\x{AFFFD}\x{B0000}-\x{BFFFD}\x{C0000}-\x{CFFFD}\x{D0000}-\x{DFFFD}\x{E1000}-\x{EFFFD}!\$&'\(\)\*\+,;=:#])|[\x{E000}-\x{F8FF}\x{F0000}-\x{FFFFD}\x{100000}-\x{10FFFD}\/\?])*)?(?:\#(?:(?:%[0-9a-f][0-9a-f]|[-a-z0-9\._~\x{A0}-\x{D7FF}\x{F900}-\x{FDCF}\x{FDF0}-\x{FFEF}\x{10000}-\x{1FFFD}\x{20000}-\x{2FFFD}\x{30000}-\x{3FFFD}\x{40000}-\x{4FFFD}\x{50000}-\x{5FFFD}\x{60000}-\x{6FFFD}\x{70000}-\x{7FFFD}\x{80000}-\x{8FFFD}\x{90000}-\x{9FFFD}\x{A0000}-\x{AFFFD}\x{B0000}-\x{BFFFD}\x{C0000}-\x{CFFFD}\x{D0000}-\x{DFFFD}\x{E1000}-\x{EFFFD}!\$&'\(\)\*\+,;=:#])|[\/\?])*)?$/i
To also allow relative IRIs:
/^(?:[a-z](?:[-a-z0-9\+\.])*:(?:\/\/(?:(?:%[0-9a-f][0-9a-f]|[-a-z0-9\._~\x{A0}-\x{D7FF}\x{F900}-\x{FDCF}\x{FDF0}-\x{FFEF}\x{10000}-\x{1FFFD}\x{20000}-\x{2FFFD}\x{30000}-\x{3FFFD}\x{40000}-\x{4FFFD}\x{50000}-\x{5FFFD}\x{60000}-\x{6FFFD}\x{70000}-\x{7FFFD}\x{80000}-\x{8FFFD}\x{90000}-\x{9FFFD}\x{A0000}-\x{AFFFD}\x{B0000}-\x{BFFFD}\x{C0000}-\x{CFFFD}\x{D0000}-\x{DFFFD}\x{E1000}-\x{EFFFD}!\$&'\(\)\*\+,;=:])*#)?(?:\[(?:(?:(?:[0-9a-f]{1,4}:){6}(?:[0-9a-f]{1,4}:[0-9a-f]{1,4}|(?:[0-9]|[1-9][0-9]|1[0-9][0-9]|2[0-4][0-9]|25[0-5])(?:\.(?:[0-9]|[1-9][0-9]|1[0-9][0-9]|2[0-4][0-9]|25[0-5])){3})|::(?:[0-9a-f]{1,4}:){5}(?:[0-9a-f]{1,4}:[0-9a-f]{1,4}|(?:[0-9]|[1-9][0-9]|1[0-9][0-9]|2[0-4][0-9]|25[0-5])(?:\.(?:[0-9]|[1-9][0-9]|1[0-9][0-9]|2[0-4][0-9]|25[0-5])){3})|(?:[0-9a-f]{1,4})?::(?:[0-9a-f]{1,4}:){4}(?:[0-9a-f]{1,4}:[0-9a-f]{1,4}|(?:[0-9]|[1-9][0-9]|1[0-9][0-9]|2[0-4][0-9]|25[0-5])(?:\.(?:[0-9]|[1-9][0-9]|1[0-9][0-9]|2[0-4][0-9]|25[0-5])){3})|(?:(?:[0-9a-f]{1,4}:){0,1}[0-9a-f]{1,4})?::(?:[0-9a-f]{1,4}:){3}(?:[0-9a-f]{1,4}:[0-9a-f]{1,4}|(?:[0-9]|[1-9][0-9]|1[0-9][0-9]|2[0-4][0-9]|25[0-5])(?:\.(?:[0-9]|[1-9][0-9]|1[0-9][0-9]|2[0-4][0-9]|25[0-5])){3})|(?:(?:[0-9a-f]{1,4}:){0,2}[0-9a-f]{1,4})?::(?:[0-9a-f]{1,4}:){2}(?:[0-9a-f]{1,4}:[0-9a-f]{1,4}|(?:[0-9]|[1-9][0-9]|1[0-9][0-9]|2[0-4][0-9]|25[0-5])(?:\.(?:[0-9]|[1-9][0-9]|1[0-9][0-9]|2[0-4][0-9]|25[0-5])){3})|(?:(?:[0-9a-f]{1,4}:){0,3}[0-9a-f]{1,4})?::[0-9a-f]{1,4}:(?:[0-9a-f]{1,4}:[0-9a-f]{1,4}|(?:[0-9]|[1-9][0-9]|1[0-9][0-9]|2[0-4][0-9]|25[0-5])(?:\.(?:[0-9]|[1-9][0-9]|1[0-9][0-9]|2[0-4][0-9]|25[0-5])){3})|(?:(?:[0-9a-f]{1,4}:){0,4}[0-9a-f]{1,4})?::(?:[0-9a-f]{1,4}:[0-9a-f]{1,4}|(?:[0-9]|[1-9][0-9]|1[0-9][0-9]|2[0-4][0-9]|25[0-5])(?:\.(?:[0-9]|[1-9][0-9]|1[0-9][0-9]|2[0-4][0-9]|25[0-5])){3})|(?:(?:[0-9a-f]{1,4}:){0,5}[0-9a-f]{1,4})?::[0-9a-f]{1,4}|(?:(?:[0-9a-f]{1,4}:){0,6}[0-9a-f]{1,4})?::)|v[0-9a-f]+\.[-a-z0-9\._~!\$&'\(\)\*\+,;=:]+)\]|(?:[0-9]|[1-9][0-9]|1[0-9][0-9]|2[0-4][0-9]|25[0-5])(?:\.(?:[0-9]|[1-9][0-9]|1[0-9][0-9]|2[0-4][0-9]|25[0-5])){3}|(?:%[0-9a-f][0-9a-f]|[-a-z0-9\._~\x{A0}-\x{D7FF}\x{F900}-\x{FDCF}\x{FDF0}-\x{FFEF}\x{10000}-\x{1FFFD}\x{20000}-\x{2FFFD}\x{30000}-\x{3FFFD}\x{40000}-\x{4FFFD}\x{50000}-\x{5FFFD}\x{60000}-\x{6FFFD}\x{70000}-\x{7FFFD}\x{80000}-\x{8FFFD}\x{90000}-\x{9FFFD}\x{A0000}-\x{AFFFD}\x{B0000}-\x{BFFFD}\x{C0000}-\x{CFFFD}\x{D0000}-\x{DFFFD}\x{E1000}-\x{EFFFD}!\$&'\(\)\*\+,;=])*)(?::[0-9]*)?(?:\/(?:(?:%[0-9a-f][0-9a-f]|[-a-z0-9\._~\x{A0}-\x{D7FF}\x{F900}-\x{FDCF}\x{FDF0}-\x{FFEF}\x{10000}-\x{1FFFD}\x{20000}-\x{2FFFD}\x{30000}-\x{3FFFD}\x{40000}-\x{4FFFD}\x{50000}-\x{5FFFD}\x{60000}-\x{6FFFD}\x{70000}-\x{7FFFD}\x{80000}-\x{8FFFD}\x{90000}-\x{9FFFD}\x{A0000}-\x{AFFFD}\x{B0000}-\x{BFFFD}\x{C0000}-\x{CFFFD}\x{D0000}-\x{DFFFD}\x{E1000}-\x{EFFFD}!\$&'\(\)\*\+,;=:#]))*)*|\/(?:(?:(?:(?:%[0-9a-f][0-9a-f]|[-a-z0-9\._~\x{A0}-\x{D7FF}\x{F900}-\x{FDCF}\x{FDF0}-\x{FFEF}\x{10000}-\x{1FFFD}\x{20000}-\x{2FFFD}\x{30000}-\x{3FFFD}\x{40000}-\x{4FFFD}\x{50000}-\x{5FFFD}\x{60000}-\x{6FFFD}\x{70000}-\x{7FFFD}\x{80000}-\x{8FFFD}\x{90000}-\x{9FFFD}\x{A0000}-\x{AFFFD}\x{B0000}-\x{BFFFD}\x{C0000}-\x{CFFFD}\x{D0000}-\x{DFFFD}\x{E1000}-\x{EFFFD}!\$&'\(\)\*\+,;=:#]))+)(?:\/(?:(?:%[0-9a-f][0-9a-f]|[-a-z0-9\._~\x{A0}-\x{D7FF}\x{F900}-\x{FDCF}\x{FDF0}-\x{FFEF}\x{10000}-\x{1FFFD}\x{20000}-\x{2FFFD}\x{30000}-\x{3FFFD}\x{40000}-\x{4FFFD}\x{50000}-\x{5FFFD}\x{60000}-\x{6FFFD}\x{70000}-\x{7FFFD}\x{80000}-\x{8FFFD}\x{90000}-\x{9FFFD}\x{A0000}-\x{AFFFD}\x{B0000}-\x{BFFFD}\x{C0000}-\x{CFFFD}\x{D0000}-\x{DFFFD}\x{E1000}-\x{EFFFD}!\$&'\(\)\*\+,;=:#]))*)*)?|(?:(?:(?:%[0-9a-f][0-9a-f]|[-a-z0-9\._~\x{A0}-\x{D7FF}\x{F900}-\x{FDCF}\x{FDF0}-\x{FFEF}\x{10000}-\x{1FFFD}\x{20000}-\x{2FFFD}\x{30000}-\x{3FFFD}\x{40000}-\x{4FFFD}\x{50000}-\x{5FFFD}\x{60000}-\x{6FFFD}\x{70000}-\x{7FFFD}\x{80000}-\x{8FFFD}\x{90000}-\x{9FFFD}\x{A0000}-\x{AFFFD}\x{B0000}-\x{BFFFD}\x{C0000}-\x{CFFFD}\x{D0000}-\x{DFFFD}\x{E1000}-\x{EFFFD}!\$&'\(\)\*\+,;=:#]))+)(?:\/(?:(?:%[0-9a-f][0-9a-f]|[-a-z0-9\._~\x{A0}-\x{D7FF}\x{F900}-\x{FDCF}\x{FDF0}-\x{FFEF}\x{10000}-\x{1FFFD}\x{20000}-\x{2FFFD}\x{30000}-\x{3FFFD}\x{40000}-\x{4FFFD}\x{50000}-\x{5FFFD}\x{60000}-\x{6FFFD}\x{70000}-\x{7FFFD}\x{80000}-\x{8FFFD}\x{90000}-\x{9FFFD}\x{A0000}-\x{AFFFD}\x{B0000}-\x{BFFFD}\x{C0000}-\x{CFFFD}\x{D0000}-\x{DFFFD}\x{E1000}-\x{EFFFD}!\$&'\(\)\*\+,;=:#]))*)*|(?!(?:%[0-9a-f][0-9a-f]|[-a-z0-9\._~\x{A0}-\x{D7FF}\x{F900}-\x{FDCF}\x{FDF0}-\x{FFEF}\x{10000}-\x{1FFFD}\x{20000}-\x{2FFFD}\x{30000}-\x{3FFFD}\x{40000}-\x{4FFFD}\x{50000}-\x{5FFFD}\x{60000}-\x{6FFFD}\x{70000}-\x{7FFFD}\x{80000}-\x{8FFFD}\x{90000}-\x{9FFFD}\x{A0000}-\x{AFFFD}\x{B0000}-\x{BFFFD}\x{C0000}-\x{CFFFD}\x{D0000}-\x{DFFFD}\x{E1000}-\x{EFFFD}!\$&'\(\)\*\+,;=:#])))(?:\?(?:(?:%[0-9a-f][0-9a-f]|[-a-z0-9\._~\x{A0}-\x{D7FF}\x{F900}-\x{FDCF}\x{FDF0}-\x{FFEF}\x{10000}-\x{1FFFD}\x{20000}-\x{2FFFD}\x{30000}-\x{3FFFD}\x{40000}-\x{4FFFD}\x{50000}-\x{5FFFD}\x{60000}-\x{6FFFD}\x{70000}-\x{7FFFD}\x{80000}-\x{8FFFD}\x{90000}-\x{9FFFD}\x{A0000}-\x{AFFFD}\x{B0000}-\x{BFFFD}\x{C0000}-\x{CFFFD}\x{D0000}-\x{DFFFD}\x{E1000}-\x{EFFFD}!\$&'\(\)\*\+,;=:#])|[\x{E000}-\x{F8FF}\x{F0000}-\x{FFFFD}\x{100000}-\x{10FFFD}\/\?])*)?(?:\#(?:(?:%[0-9a-f][0-9a-f]|[-a-z0-9\._~\x{A0}-\x{D7FF}\x{F900}-\x{FDCF}\x{FDF0}-\x{FFEF}\x{10000}-\x{1FFFD}\x{20000}-\x{2FFFD}\x{30000}-\x{3FFFD}\x{40000}-\x{4FFFD}\x{50000}-\x{5FFFD}\x{60000}-\x{6FFFD}\x{70000}-\x{7FFFD}\x{80000}-\x{8FFFD}\x{90000}-\x{9FFFD}\x{A0000}-\x{AFFFD}\x{B0000}-\x{BFFFD}\x{C0000}-\x{CFFFD}\x{D0000}-\x{DFFFD}\x{E1000}-\x{EFFFD}!\$&'\(\)\*\+,;=:#])|[\/\?])*)?|(?:\/\/(?:(?:%[0-9a-f][0-9a-f]|[-a-z0-9\._~\x{A0}-\x{D7FF}\x{F900}-\x{FDCF}\x{FDF0}-\x{FFEF}\x{10000}-\x{1FFFD}\x{20000}-\x{2FFFD}\x{30000}-\x{3FFFD}\x{40000}-\x{4FFFD}\x{50000}-\x{5FFFD}\x{60000}-\x{6FFFD}\x{70000}-\x{7FFFD}\x{80000}-\x{8FFFD}\x{90000}-\x{9FFFD}\x{A0000}-\x{AFFFD}\x{B0000}-\x{BFFFD}\x{C0000}-\x{CFFFD}\x{D0000}-\x{DFFFD}\x{E1000}-\x{EFFFD}!\$&'\(\)\*\+,;=:])*#)?(?:\[(?:(?:(?:[0-9a-f]{1,4}:){6}(?:[0-9a-f]{1,4}:[0-9a-f]{1,4}|(?:[0-9]|[1-9][0-9]|1[0-9][0-9]|2[0-4][0-9]|25[0-5])(?:\.(?:[0-9]|[1-9][0-9]|1[0-9][0-9]|2[0-4][0-9]|25[0-5])){3})|::(?:[0-9a-f]{1,4}:){5}(?:[0-9a-f]{1,4}:[0-9a-f]{1,4}|(?:[0-9]|[1-9][0-9]|1[0-9][0-9]|2[0-4][0-9]|25[0-5])(?:\.(?:[0-9]|[1-9][0-9]|1[0-9][0-9]|2[0-4][0-9]|25[0-5])){3})|(?:[0-9a-f]{1,4})?::(?:[0-9a-f]{1,4}:){4}(?:[0-9a-f]{1,4}:[0-9a-f]{1,4}|(?:[0-9]|[1-9][0-9]|1[0-9][0-9]|2[0-4][0-9]|25[0-5])(?:\.(?:[0-9]|[1-9][0-9]|1[0-9][0-9]|2[0-4][0-9]|25[0-5])){3})|(?:(?:[0-9a-f]{1,4}:){0,1}[0-9a-f]{1,4})?::(?:[0-9a-f]{1,4}:){3}(?:[0-9a-f]{1,4}:[0-9a-f]{1,4}|(?:[0-9]|[1-9][0-9]|1[0-9][0-9]|2[0-4][0-9]|25[0-5])(?:\.(?:[0-9]|[1-9][0-9]|1[0-9][0-9]|2[0-4][0-9]|25[0-5])){3})|(?:(?:[0-9a-f]{1,4}:){0,2}[0-9a-f]{1,4})?::(?:[0-9a-f]{1,4}:){2}(?:[0-9a-f]{1,4}:[0-9a-f]{1,4}|(?:[0-9]|[1-9][0-9]|1[0-9][0-9]|2[0-4][0-9]|25[0-5])(?:\.(?:[0-9]|[1-9][0-9]|1[0-9][0-9]|2[0-4][0-9]|25[0-5])){3})|(?:(?:[0-9a-f]{1,4}:){0,3}[0-9a-f]{1,4})?::[0-9a-f]{1,4}:(?:[0-9a-f]{1,4}:[0-9a-f]{1,4}|(?:[0-9]|[1-9][0-9]|1[0-9][0-9]|2[0-4][0-9]|25[0-5])(?:\.(?:[0-9]|[1-9][0-9]|1[0-9][0-9]|2[0-4][0-9]|25[0-5])){3})|(?:(?:[0-9a-f]{1,4}:){0,4}[0-9a-f]{1,4})?::(?:[0-9a-f]{1,4}:[0-9a-f]{1,4}|(?:[0-9]|[1-9][0-9]|1[0-9][0-9]|2[0-4][0-9]|25[0-5])(?:\.(?:[0-9]|[1-9][0-9]|1[0-9][0-9]|2[0-4][0-9]|25[0-5])){3})|(?:(?:[0-9a-f]{1,4}:){0,5}[0-9a-f]{1,4})?::[0-9a-f]{1,4}|(?:(?:[0-9a-f]{1,4}:){0,6}[0-9a-f]{1,4})?::)|v[0-9a-f]+\.[-a-z0-9\._~!\$&'\(\)\*\+,;=:]+)\]|(?:[0-9]|[1-9][0-9]|1[0-9][0-9]|2[0-4][0-9]|25[0-5])(?:\.(?:[0-9]|[1-9][0-9]|1[0-9][0-9]|2[0-4][0-9]|25[0-5])){3}|(?:%[0-9a-f][0-9a-f]|[-a-z0-9\._~\x{A0}-\x{D7FF}\x{F900}-\x{FDCF}\x{FDF0}-\x{FFEF}\x{10000}-\x{1FFFD}\x{20000}-\x{2FFFD}\x{30000}-\x{3FFFD}\x{40000}-\x{4FFFD}\x{50000}-\x{5FFFD}\x{60000}-\x{6FFFD}\x{70000}-\x{7FFFD}\x{80000}-\x{8FFFD}\x{90000}-\x{9FFFD}\x{A0000}-\x{AFFFD}\x{B0000}-\x{BFFFD}\x{C0000}-\x{CFFFD}\x{D0000}-\x{DFFFD}\x{E1000}-\x{EFFFD}!\$&'\(\)\*\+,;=])*)(?::[0-9]*)?(?:\/(?:(?:%[0-9a-f][0-9a-f]|[-a-z0-9\._~\x{A0}-\x{D7FF}\x{F900}-\x{FDCF}\x{FDF0}-\x{FFEF}\x{10000}-\x{1FFFD}\x{20000}-\x{2FFFD}\x{30000}-\x{3FFFD}\x{40000}-\x{4FFFD}\x{50000}-\x{5FFFD}\x{60000}-\x{6FFFD}\x{70000}-\x{7FFFD}\x{80000}-\x{8FFFD}\x{90000}-\x{9FFFD}\x{A0000}-\x{AFFFD}\x{B0000}-\x{BFFFD}\x{C0000}-\x{CFFFD}\x{D0000}-\x{DFFFD}\x{E1000}-\x{EFFFD}!\$&'\(\)\*\+,;=:#]))*)*|\/(?:(?:(?:(?:%[0-9a-f][0-9a-f]|[-a-z0-9\._~\x{A0}-\x{D7FF}\x{F900}-\x{FDCF}\x{FDF0}-\x{FFEF}\x{10000}-\x{1FFFD}\x{20000}-\x{2FFFD}\x{30000}-\x{3FFFD}\x{40000}-\x{4FFFD}\x{50000}-\x{5FFFD}\x{60000}-\x{6FFFD}\x{70000}-\x{7FFFD}\x{80000}-\x{8FFFD}\x{90000}-\x{9FFFD}\x{A0000}-\x{AFFFD}\x{B0000}-\x{BFFFD}\x{C0000}-\x{CFFFD}\x{D0000}-\x{DFFFD}\x{E1000}-\x{EFFFD}!\$&'\(\)\*\+,;=:#]))+)(?:\/(?:(?:%[0-9a-f][0-9a-f]|[-a-z0-9\._~\x{A0}-\x{D7FF}\x{F900}-\x{FDCF}\x{FDF0}-\x{FFEF}\x{10000}-\x{1FFFD}\x{20000}-\x{2FFFD}\x{30000}-\x{3FFFD}\x{40000}-\x{4FFFD}\x{50000}-\x{5FFFD}\x{60000}-\x{6FFFD}\x{70000}-\x{7FFFD}\x{80000}-\x{8FFFD}\x{90000}-\x{9FFFD}\x{A0000}-\x{AFFFD}\x{B0000}-\x{BFFFD}\x{C0000}-\x{CFFFD}\x{D0000}-\x{DFFFD}\x{E1000}-\x{EFFFD}!\$&'\(\)\*\+,;=:#]))*)*)?|(?:(?:%[0-9a-f][0-9a-f]|[-a-z0-9\._~\x{A0}-\x{D7FF}\x{F900}-\x{FDCF}\x{FDF0}-\x{FFEF}\x{10000}-\x{1FFFD}\x{20000}-\x{2FFFD}\x{30000}-\x{3FFFD}\x{40000}-\x{4FFFD}\x{50000}-\x{5FFFD}\x{60000}-\x{6FFFD}\x{70000}-\x{7FFFD}\x{80000}-\x{8FFFD}\x{90000}-\x{9FFFD}\x{A0000}-\x{AFFFD}\x{B0000}-\x{BFFFD}\x{C0000}-\x{CFFFD}\x{D0000}-\x{DFFFD}\x{E1000}-\x{EFFFD}!\$&'\(\)\*\+,;=#])+)(?:\/(?:(?:%[0-9a-f][0-9a-f]|[-a-z0-9\._~\x{A0}-\x{D7FF}\x{F900}-\x{FDCF}\x{FDF0}-\x{FFEF}\x{10000}-\x{1FFFD}\x{20000}-\x{2FFFD}\x{30000}-\x{3FFFD}\x{40000}-\x{4FFFD}\x{50000}-\x{5FFFD}\x{60000}-\x{6FFFD}\x{70000}-\x{7FFFD}\x{80000}-\x{8FFFD}\x{90000}-\x{9FFFD}\x{A0000}-\x{AFFFD}\x{B0000}-\x{BFFFD}\x{C0000}-\x{CFFFD}\x{D0000}-\x{DFFFD}\x{E1000}-\x{EFFFD}!\$&'\(\)\*\+,;=:#]))*)*|(?!(?:%[0-9a-f][0-9a-f]|[-a-z0-9\._~\x{A0}-\x{D7FF}\x{F900}-\x{FDCF}\x{FDF0}-\x{FFEF}\x{10000}-\x{1FFFD}\x{20000}-\x{2FFFD}\x{30000}-\x{3FFFD}\x{40000}-\x{4FFFD}\x{50000}-\x{5FFFD}\x{60000}-\x{6FFFD}\x{70000}-\x{7FFFD}\x{80000}-\x{8FFFD}\x{90000}-\x{9FFFD}\x{A0000}-\x{AFFFD}\x{B0000}-\x{BFFFD}\x{C0000}-\x{CFFFD}\x{D0000}-\x{DFFFD}\x{E1000}-\x{EFFFD}!\$&'\(\)\*\+,;=:#])))(?:\?(?:(?:%[0-9a-f][0-9a-f]|[-a-z0-9\._~\x{A0}-\x{D7FF}\x{F900}-\x{FDCF}\x{FDF0}-\x{FFEF}\x{10000}-\x{1FFFD}\x{20000}-\x{2FFFD}\x{30000}-\x{3FFFD}\x{40000}-\x{4FFFD}\x{50000}-\x{5FFFD}\x{60000}-\x{6FFFD}\x{70000}-\x{7FFFD}\x{80000}-\x{8FFFD}\x{90000}-\x{9FFFD}\x{A0000}-\x{AFFFD}\x{B0000}-\x{BFFFD}\x{C0000}-\x{CFFFD}\x{D0000}-\x{DFFFD}\x{E1000}-\x{EFFFD}!\$&'\(\)\*\+,;=:#])|[\x{E000}-\x{F8FF}\x{F0000}-\x{FFFFD}\x{100000}-\x{10FFFD}\/\?])*)?(?:\#(?:(?:%[0-9a-f][0-9a-f]|[-a-z0-9\._~\x{A0}-\x{D7FF}\x{F900}-\x{FDCF}\x{FDF0}-\x{FFEF}\x{10000}-\x{1FFFD}\x{20000}-\x{2FFFD}\x{30000}-\x{3FFFD}\x{40000}-\x{4FFFD}\x{50000}-\x{5FFFD}\x{60000}-\x{6FFFD}\x{70000}-\x{7FFFD}\x{80000}-\x{8FFFD}\x{90000}-\x{9FFFD}\x{A0000}-\x{AFFFD}\x{B0000}-\x{BFFFD}\x{C0000}-\x{CFFFD}\x{D0000}-\x{DFFFD}\x{E1000}-\x{EFFFD}!\$&'\(\)\*\+,;=:#])|[\/\?])*)?)$/i
How they were compiled (in PHP):
<?php
/* Regex convenience functions (character class, non-capturing group) */
function cc($str, $suffix = '', $negate = false) {
return '[' . ($negate ? '^' : '') . $str . ']' . $suffix;
}
function ncg($str, $suffix = '') {
return '(?:' . $str . ')' . $suffix;
}
/* Preserved from RFC3986 */
$ALPHA = 'a-z';
$DIGIT = '0-9';
$HEXDIG = $DIGIT . 'a-f';
$sub_delims = '!\\$&\'\\(\\)\\*\\+,;=';
$gen_delims = ':\\/\\?\\#\\[\\]#';
$reserved = $gen_delims . $sub_delims;
$unreserved = '-' . $ALPHA . $DIGIT . '\\._~';
$pct_encoded = '%' . cc($HEXDIG) . cc($HEXDIG);
$dec_octet = ncg(implode('|', array(
cc($DIGIT),
cc('1-9') . cc($DIGIT),
'1' . cc($DIGIT) . cc($DIGIT),
'2' . cc('0-4') . cc($DIGIT),
'25' . cc('0-5')
)));
$IPv4address = $dec_octet . ncg('\\.' . $dec_octet, '{3}');
$h16 = cc($HEXDIG, '{1,4}');
$ls32 = ncg($h16 . ':' . $h16 . '|' . $IPv4address);
$IPv6address = ncg(implode('|', array(
ncg($h16 . ':', '{6}') . $ls32,
'::' . ncg($h16 . ':', '{5}') . $ls32,
ncg($h16, '?') . '::' . ncg($h16 . ':', '{4}') . $ls32,
ncg($h16 . ':' . $h16, '?') . '::' . ncg($h16 . ':', '{3}') . $ls32,
ncg(ncg($h16 . ':', '{0,2}') . $h16, '?') . '::' . ncg($h16 . ':', '{2}') . $ls32,
ncg(ncg($h16 . ':', '{0,3}') . $h16, '?') . '::' . $h16 . ':' . $ls32,
ncg(ncg($h16 . ':', '{0,4}') . $h16, '?') . '::' . $ls32,
ncg(ncg($h16 . ':', '{0,5}') . $h16, '?') . '::' . $h16,
ncg(ncg($h16 . ':', '{0,6}') . $h16, '?') . '::',
)));
$IPvFuture = 'v' . cc($HEXDIG, '+') . cc($unreserved . $sub_delims . ':', '+');
$IP_literal = '\\[' . ncg(implode('|', array($IPv6address, $IPvFuture))) . '\\]';
$port = cc($DIGIT, '*');
$scheme = cc($ALPHA) . ncg(cc('-' . $ALPHA . $DIGIT . '\\+\\.'), '*');
/* New or changed in RFC3987 */
$iprivate = '\x{E000}-\x{F8FF}\x{F0000}-\x{FFFFD}\x{100000}-\x{10FFFD}';
$ucschar = '\x{A0}-\x{D7FF}\x{F900}-\x{FDCF}\x{FDF0}-\x{FFEF}' .
'\x{10000}-\x{1FFFD}\x{20000}-\x{2FFFD}\x{30000}-\x{3FFFD}' .
'\x{40000}-\x{4FFFD}\x{50000}-\x{5FFFD}\x{60000}-\x{6FFFD}' .
'\x{70000}-\x{7FFFD}\x{80000}-\x{8FFFD}\x{90000}-\x{9FFFD}' .
'\x{A0000}-\x{AFFFD}\x{B0000}-\x{BFFFD}\x{C0000}-\x{CFFFD}' .
'\x{D0000}-\x{DFFFD}\x{E1000}-\x{EFFFD}';
$iunreserved = '-' . $ALPHA . $DIGIT . '\\._~' . $ucschar;
$ipchar = ncg($pct_encoded . '|' . cc($iunreserved . $sub_delims . ':#'));
$ifragment = ncg($ipchar . '|' . cc('\\/\\?'), '*');
$iquery = ncg($ipchar . '|' . cc($iprivate . '\\/\\?'), '*');
$isegment_nz_nc = ncg($pct_encoded . '|' . cc($iunreserved . $sub_delims . '#'), '+');
$isegment_nz = ncg($ipchar, '+');
$isegment = ncg($ipchar, '*');
$ipath_empty = '(?!' . $ipchar . ')';
$ipath_rootless = ncg($isegment_nz) . ncg('\\/' . $isegment, '*');
$ipath_noscheme = ncg($isegment_nz_nc) . ncg('\\/' . $isegment, '*');
$ipath_absolute = '\\/' . ncg($ipath_rootless, '?'); // Spec says isegment-nz *( "/" isegment )
$ipath_abempty = ncg('\\/' . $isegment, '*');
$ipath = ncg(implode('|', array(
$ipath_abempty,
$ipath_absolute,
$ipath_noscheme,
$ipath_rootless,
$ipath_empty
))) . ')';
$ireg_name = ncg($pct_encoded . '|' . cc($iunreserved . $sub_delims . '#'), '*');
$ihost = ncg(implode('|', array($IP_literal, $IPv4address, $ireg_name)));
$iuserinfo = ncg($pct_encoded . '|' . cc($iunreserved . $sub_delims . ':'), '*');
$iauthority = ncg($iuserinfo . '#', '?') . $ihost . ncg(':' . $port, '?');
$irelative_part = ncg(implode('|', array(
'\\/\\/' . $iauthority . $ipath_abempty . '',
'' . $ipath_absolute . '',
'' . $ipath_noscheme . '',
'' . $ipath_empty . ''
)));
$irelative_ref = $irelative_part . ncg('\\?' . $iquery, '?') . ncg('\\#' . $ifragment, '?');
$ihier_part = ncg(implode('|', array(
'\\/\\/' . $iauthority . $ipath_abempty . '',
'' . $ipath_absolute . '',
'' . $ipath_rootless . '',
'' . $ipath_empty . ''
)));
$absolute_IRI = $scheme . ':' . $ihier_part . ncg('\\?' . $iquery, '?');
$IRI = $scheme . ':' . $ihier_part . ncg('\\?' . $iquery, '?') . ncg('\\#' . $ifragment, '?');
$IRI_reference = ncg($IRI . '|' . $irelative_ref);
Edit 7 March 2011: Because of the way PHP handles backslashes in quoted strings, these are unusable by default. You'll need to double-escape backslashes except where the backslash has a special meaning in regex. You can do that this way:
$escape_backslash = '/(?<!\\)\\(?![\[\]\\\^\$\.\|\*\+\(\)QEnrtaefvdwsDWSbAZzB1-9GX]|x\{[0-9a-f]{1,4}\}|\c[A-Z]|)/';
$absolute_IRI = preg_replace($escape_backslash, '\\\\', $absolute_IRI);
$IRI = preg_replace($escape_backslash, '\\\\', $IRI);
$IRI_reference = preg_replace($escape_backslash, '\\\\', $IRI_reference);
I've just written up a blog post for a great solution for recognizing URLs in most used formats such as:
www.google.com
http://www.google.com
mailto:somebody#google.com
somebody#google.com
www.url-with-querystring.com/?url=has-querystring
The regular expression used is:
/((([A-Za-z]{3,9}:(?:\/\/)?)(?:[-;:&=\+\$,\w]+#)?[A-Za-z0-9.-]+|(?:www.|[-;:&=\+\$,\w]+#)[A-Za-z0-9.-]+)((?:\/[\+~%\/.\w-_]*)?\??(?:[-\+=&;%#.\w_]*)#?(?:[\w]*))?)/
What platform? If using .NET, use System.Uri.TryCreate, not a regex.
For example:
static bool IsValidUrl(string urlString)
{
Uri uri;
return Uri.TryCreate(urlString, UriKind.Absolute, out uri)
&& (uri.Scheme == Uri.UriSchemeHttp
|| uri.Scheme == Uri.UriSchemeHttps
|| uri.Scheme == Uri.UriSchemeFtp
|| uri.Scheme == Uri.UriSchemeMailto
/*...*/);
}
// In test fixture...
[Test]
void IsValidUrl_Test()
{
Assert.True(IsValidUrl("http://www.example.com"));
Assert.False(IsValidUrl("javascript:alert('xss')"));
Assert.False(IsValidUrl(""));
Assert.False(IsValidUrl(null));
}
(Thanks to #Yoshi for the tip about javascript:)
Here's what RegexBuddy uses.
(\b(https?|ftp|file)://)?[-A-Za-z0-9+&##/%?=~_|!:,.;]+[-A-Za-z0-9+&##/%=~_|]
It matches these below (inside the ** ** marks):
**http://www.regexbuddy.com**
**http://www.regexbuddy.com/**
**http://www.regexbuddy.com/index.html**
**http://www.regexbuddy.com/index.html?source=library**
**http://www.regexbuddy.com/index.html?source=library#copyright**
You can download RegexBuddy at http://www.regexbuddy.com/download.html.
Mathias Bynens has a great article on the best comparison of a lot of regular expressions: In search of the perfect URL validation regex
The best one posted is a little long, but it matches just about anything you can throw at it.
JavaScript version
/^(?:(?:(?:https?|ftp):)?\/\/)(?:\S+(?::\S*)?#)?(?:(?!(?:10|127)(?:\.\d{1,3}){3})(?!(?:169\.254|192\.168)(?:\.\d{1,3}){2})(?!172\.(?:1[6-9]|2\d|3[0-1])(?:\.\d{1,3}){2})(?:[1-9]\d?|1\d\d|2[01]\d|22[0-3])(?:\.(?:1?\d{1,2}|2[0-4]\d|25[0-5])){2}(?:\.(?:[1-9]\d?|1\d\d|2[0-4]\d|25[0-4]))|(?:(?:[a-z0-9\u00a1-\uffff][a-z0-9\u00a1-\uffff_-]{0,62})?[a-z0-9\u00a1-\uffff]\.)+(?:[a-z\u00a1-\uffff]{2,}\.?))(?::\d{2,5})?(?:[/?#]\S*)?$/i
PHP version (uses % symbol as delimiter)
%^(?:(?:(?:https?|ftp):)?\/\/)(?:\S+(?::\S*)?#)?(?:(?!(?:10|127)(?:\.\d{1,3}){3})(?!(?:169\.254|192\.168)(?:\.\d{1,3}){2})(?!172\.(?:1[6-9]|2\d|3[0-1])(?:\.\d{1,3}){2})(?:[1-9]\d?|1\d\d|2[01]\d|22[0-3])(?:\.(?:1?\d{1,2}|2[0-4]\d|25[0-5])){2}(?:\.(?:[1-9]\d?|1\d\d|2[0-4]\d|25[0-4]))|(?:(?:[a-z0-9\x{00a1}-\x{ffff}][a-z0-9\x{00a1}-\x{ffff}_-]{0,62})?[a-z0-9\x{00a1}-\x{ffff}]\.)+(?:[a-z\x{00a1}-\x{ffff}]{2,}\.?))(?::\d{2,5})?(?:[/?#]\S*)?$%iuS
With regard to eyelidness' answer post that reads "This is based on my reading of the URI specification.": Thanks Eyelidness, yours is the perfect solution I sought, as it is based on the URI spec! Superb work. :)
I had to make two amendments. The first to get the regexp to match IP address URLs correctly in PHP (v5.2.10) with the preg_match() function.
I had to add one more set of parenthesis to the line above "IP Address" around the pipes:
)|((\d|[1-9]\d|1\d{2}|2[0-4][0-9]|25[0-5])\.){3}(?#
Not sure why.
I have also reduced the top level domain minimum length from 3 to 2 letters to support .co.uk and similar.
Final code:
/^(https?|ftp):\/\/(?# protocol
)(([a-z0-9$_\.\+!\*\'\(\),;\?&=-]|%[0-9a-f]{2})+(?# username
)(:([a-z0-9$_\.\+!\*\'\(\),;\?&=-]|%[0-9a-f]{2})+)?(?# password
)#)?(?# auth requires #
)((([a-z0-9]\.|[a-z0-9][a-z0-9-]*[a-z0-9]\.)*(?# domain segments AND
)[a-z][a-z0-9-]*[a-z0-9](?# top level domain OR
)|((\d|[1-9]\d|1\d{2}|2[0-4][0-9]|25[0-5])\.){3}(?#
)(\d|[1-9]\d|1\d{2}|2[0-4][0-9]|25[0-5])(?# IP address
))(:\d+)?(?# port
))(((\/+([a-z0-9$_\.\+!\*\'\(\),;:#&=-]|%[0-9a-f]{2})*)*(?# path
)(\?([a-z0-9$_\.\+!\*\'\(\),;:#&=-]|%[0-9a-f]{2})*)(?# query string
)?)?)?(?# path and query string optional
)(#([a-z0-9$_\.\+!\*\'\(\),;:#&=-]|%[0-9a-f]{2})*)?(?# fragment
)$/i
This modified version was not checked against the URI specification so I can't vouch for it's compliance, it was altered to handle URLs on local network environments and two digit TLDs as well as other kinds of Web URL, and to work better in the PHP setup I use.
As PHP code:
define('URL_FORMAT',
'/^(https?):\/\/'. // protocol
'(([a-z0-9$_\.\+!\*\'\(\),;\?&=-]|%[0-9a-f]{2})+'. // username
'(:([a-z0-9$_\.\+!\*\'\(\),;\?&=-]|%[0-9a-f]{2})+)?'. // password
'#)?(?#'. // auth requires #
')((([a-z0-9]\.|[a-z0-9][a-z0-9-]*[a-z0-9]\.)*'. // domain segments AND
'[a-z][a-z0-9-]*[a-z0-9]'. // top level domain OR
'|((\d|[1-9]\d|1\d{2}|2[0-4][0-9]|25[0-5])\.){3}'.
'(\d|[1-9]\d|1\d{2}|2[0-4][0-9]|25[0-5])'. // IP address
')(:\d+)?'. // port
')(((\/+([a-z0-9$_\.\+!\*\'\(\),;:#&=-]|%[0-9a-f]{2})*)*'. // path
'(\?([a-z0-9$_\.\+!\*\'\(\),;:#&=-]|%[0-9a-f]{2})*)'. // query string
'?)?)?'. // path and query string optional
'(#([a-z0-9$_\.\+!\*\'\(\),;:#&=-]|%[0-9a-f]{2})*)?'. // fragment
'$/i');
Here is a test program in PHP which validates a variety of URLs using the regex:
<?php
define('URL_FORMAT',
'/^(https?):\/\/'. // protocol
'(([a-z0-9$_\.\+!\*\'\(\),;\?&=-]|%[0-9a-f]{2})+'. // username
'(:([a-z0-9$_\.\+!\*\'\(\),;\?&=-]|%[0-9a-f]{2})+)?'. // password
'#)?(?#'. // auth requires #
')((([a-z0-9]\.|[a-z0-9][a-z0-9-]*[a-z0-9]\.)*'. // domain segments AND
'[a-z][a-z0-9-]*[a-z0-9]'. // top level domain OR
'|((\d|[1-9]\d|1\d{2}|2[0-4][0-9]|25[0-5])\.){3}'.
'(\d|[1-9]\d|1\d{2}|2[0-4][0-9]|25[0-5])'. // IP address
')(:\d+)?'. // port
')(((\/+([a-z0-9$_\.\+!\*\'\(\),;:#&=-]|%[0-9a-f]{2})*)*'. // path
'(\?([a-z0-9$_\.\+!\*\'\(\),;:#&=-]|%[0-9a-f]{2})*)'. // query string
'?)?)?'. // path and query string optional
'(#([a-z0-9$_\.\+!\*\'\(\),;:#&=-]|%[0-9a-f]{2})*)?'. // fragment
'$/i');
/**
* Verify the syntax of the given URL.
*
* #access public
* #param $url The URL to verify.
* #return boolean
*/
function is_valid_url($url) {
if (str_starts_with(strtolower($url), 'http://localhost')) {
return true;
}
return preg_match(URL_FORMAT, $url);
}
/**
* String starts with something
*
* This function will return true only if input string starts with
* niddle
*
* #param string $string Input string
* #param string $niddle Needle string
* #return boolean
*/
function str_starts_with($string, $niddle) {
return substr($string, 0, strlen($niddle)) == $niddle;
}
/**
* Test a URL for validity and count results.
* #param url url
* #param expected expected result (true or false)
*/
$numtests = 0;
$passed = 0;
function test_url($url, $expected) {
global $numtests, $passed;
$numtests++;
$valid = is_valid_url($url);
echo "URL Valid?: " . ($valid?"yes":"no") . " for URL: $url. Expected: ".($expected?"yes":"no").". ";
if($valid == $expected) {
echo "PASS\n"; $passed++;
} else {
echo "FAIL\n";
}
}
echo "URL Tests:\n\n";
test_url("http://localserver/projects/public/assets/javascript/widgets/UserBoxMenu/widget.css", true);
test_url("http://www.google.com", true);
test_url("http://www.google.co.uk/projects/my%20folder/test.php", true);
test_url("https://myserver.localdomain", true);
test_url("http://192.168.1.120/projects/index.php", true);
test_url("http://192.168.1.1/projects/index.php", true);
test_url("http://projectpier-server.localdomain/projects/public/assets/javascript/widgets/UserBoxMenu/widget.css", true);
test_url("https://2.4.168.19/project-pier?c=test&a=b", true);
test_url("https://localhost/a/b/c/test.php?c=controller&arg1=20&arg2=20", true);
test_url("http://user:password#localhost/a/b/c/test.php?c=controller&arg1=20&arg2=20", true);
echo "\n$passed out of $numtests tests passed.\n\n";
?>
Thanks again to eyelidness for the regex!
The post Getting parts of a URL (Regex) discusses parsing a URL to identify its various components. If you want to check if a URL is well-formed, it should be sufficient for your needs.
If you need to check if it's actually valid, you'll eventually have to try to access whatever's on the other end.
In general, though, you'd probably be better off using a function that's supplied to you by your framework or another library. Many platforms include functions that parse URLs. For example, there's Python's urlparse module, and in .NET you could use the System.Uri class's constructor as a means of validating the URL.
This might not be a job for regexes, but for existing tools in your language of choice. You probably want to use existing code that has already been written, tested, and debugged.
In PHP, use the parse_url function.
Perl: URI module.
Ruby: URI module.
.NET: 'Uri' class
Regexes are not a magic wand you wave at every problem that happens to involve strings.
This will match all URLs
with or without http/https
with or without www
...including sub-domains and those new top-level domain name extensions such as
.museum,
.academy,
.foundation
etc. which can have up to 63 characters (not just .com, .net, .info etc.)
(([\w]+:)?//)?(([\d\w]|%[a-fA-f\d]{2,2})+(:([\d\w]|%[a-fA-f\d]{2,2})+)?#)?([\d\w][-\d\w]{0,253}[\d\w]\.)+[\w]{2,63}(:[\d]+)?(/([-+_~.\d\w]|%[a-fA-f\d]{2,2})*)*(\?(&?([-+_~.\d\w]|%[a-fA-f\d]{2,2})=?)*)?(#([-+_~.\d\w]|%[a-fA-f\d]{2,2})*)?
Because today maximum length of the available top-level domain name extension is 13 characters such as .international, you can change the number 63 in expression to 13 to prevent someone misusing it.
as javascript
var urlreg=/(([\w]+:)?\/\/)?(([\d\w]|%[a-fA-f\d]{2,2})+(:([\d\w]|%[a-fA-f\d]{2,2})+)?#)?([\d\w][-\d\w]{0,253}[\d\w]\.)+[\w]{2,63}(:[\d]+)?(\/([-+_~.\d\w]|%[a-fA-f\d]{2,2})*)*(\?(&?([-+_~.\d\w]|%[a-fA-f\d]{2,2})=?)*)?(#([-+_~.\d\w]|%[a-fA-f\d]{2,2})*)?/;
$('textarea').on('input',function(){
var url = $(this).val();
$(this).toggleClass('invalid', urlreg.test(url) == false)
});
$('textarea').trigger('input');
textarea{color:green;}
.invalid{color:red;}
<script src="https://ajax.googleapis.com/ajax/libs/jquery/2.1.1/jquery.min.js"></script>
<textarea>http://www.google.com</textarea>
<textarea>http//www.google.com</textarea>
<textarea>googlecom</textarea>
<textarea>https://www.google.com</textarea>
Wikipedia Article: List of all internet top-level domains
Non-validating URI-reference Parser
For reference purposes, here's the IETF Spec: (TXT | HTML). In particular, Appendix B. Parsing a URI Reference with a Regular Expression demonstrates how to parse a valid regex. This is described as,
for an example of a non-validating URI-reference parser that will take any given string and extract the URI components.
Here's the regex they provide:
^(([^:/?#]+):)?(//([^/?#]*))?([^?#]*)(\?([^#]*))?(#(.*))?
As someone else said, it's probably best to leave this to a lib/framework you're already using.
^(http:\/\/www\.|https:\/\/www\.|http:\/\/|https:\/\/)?[a-z0-9]+([\-\.]{1}[a-z0-9]+)*\.[a-z]{2,5}(:[0-9]{1,5})?(\/.*)?$
live demo: https://regex101.com/r/HUNasA/2
I have tested various expressions to match my requirements.
As a user I can hit browser search bar with following strings:
valid urls
https://www.google.com
http://www.google.com
http://google.com/
https://google.com/
www.google.com
google.com
https://www.google.com.ua
http://www.google.com.ua
http://google.com.ua
https://google.com.ua/
www.google.com.ua
google.com.ua
https://mail.google.com
http://mail.google.com
mail.google.com
invalid urls
http://google
https://google.c
google
google.
.google
.google.com
goole.c
...
The best regular expression for URL for me would be:
"(([\w]+:)?//)?(([\d\w]|%[a-fA-F\d]{2,2})+(:([\d\w]|%[a-fA-f\d]{2,2})+)?#)?([\d\w][-\d\w]{0,253}[\d\w]\.)+[\w]{2,4}(:[\d]+)?(/([-+_~.\d\w]|%[a-fA-f\d]{2,2})*)*(\?(&?([-+_~.\d\w]|%[a-fA-f\d]{2,2})=?)*)?(#([-+_~.\d\w]|%[a-fA-f\d]{2,2})*)?"
Here is a good rule that covers all possible cases: ports, params and etc
/(https?:\/\/(?:[a-z0-9](?:[a-z0-9-]{0,61}[a-z0-9])?\.)+[a-z0-9][a-z0-9-]{0,61}[a-z0-9])(:?\d*)\/?([a-z_\/0-9\-#.]*)\??([a-z_\/0-9\-#=&]*)/g
I wrote a little groovy version that you can run
it matches the following URLs (which is good enough for me)
public static void main(args) {
String url = "go to http://www.m.abut.ly/abc its awesome"
url = url.replaceAll(/https?:\/\/w{0,3}\w*?\.(\w*?\.)?\w{2,3}\S*|www\.(\w*?\.)?\w*?\.\w{2,3}\S*|(\w*?\.)?\w*?\.\w{2,3}[\/\?]\S*/ , { it ->
"woof${it}woof"
})
println url
}
http://google.com
http://google.com/help.php
http://google.com/help.php?a=5
http://www.google.com
http://www.google.com/help.php
http://www.google.com?a=5
google.com?a=5
google.com/help.php
google.com/help.php?a=5
http://www.m.google.com/help.php?a=5 (and all its permutations)
www.m.google.com/help.php?a=5 (and all its permutations)
m.google.com/help.php?a=5 (and all its permutations)
The important thing for any URLs that don't start with http or www is that they must include a / or ?
I bet this can be tweaked a little more but it does the job pretty nice for being so short and compact... because you can pretty much split it in 3:
find anything that starts with http:
https?:\/\/w{0,3}\w*?\.\w{2,3}\S*
find anything that starts with www:
www\.\w*?\.\w{2,3}\S*
or find anything that must have a text then a dot then at least 2 letters and then a ? or /:
\w*?\.\w{2,3}[\/\?]\S*
I was not able to find the regex I was looking for so I modified a regex to fullfill my requirements, and apparently it seems to work fine now. My requirements were:
Match URLs w/o protocol (www.gooogle.com)
Match URLs with query parameters and path (http://subdomain.web-site.com/cgi-bin/perl.cgi?key1=value1&key2=value2e)
Don't match URLs where there are not acceptable characters (e.g. "'£), for instance: (www.google.com/somthing"/somethingmore)
Here what I came up with, any suggestion is appreciated:
#Test
public void testWebsiteUrl(){
String regularExpression = "((http|ftp|https):\\/\\/)?[\\w\\-_]+(\\.[\\w\\-_]+)+([\\w\\-\\.,#?^=%&:/~\\+#]*[\\w\\-\\#?^=%&/~\\+#])?";
assertTrue("www.google.com".matches(regularExpression));
assertTrue("www.google.co.uk".matches(regularExpression));
assertTrue("http://www.google.com".matches(regularExpression));
assertTrue("http://www.google.co.uk".matches(regularExpression));
assertTrue("https://www.google.com".matches(regularExpression));
assertTrue("https://www.google.co.uk".matches(regularExpression));
assertTrue("google.com".matches(regularExpression));
assertTrue("google.co.uk".matches(regularExpression));
assertTrue("google.mu".matches(regularExpression));
assertTrue("mes.intnet.mu".matches(regularExpression));
assertTrue("cse.uom.ac.mu".matches(regularExpression));
assertTrue("http://www.google.com/path".matches(regularExpression));
assertTrue("http://subdomain.web-site.com/cgi-bin/perl.cgi?key1=value1&key2=value2e".matches(regularExpression));
assertTrue("http://www.google.com/?queryparam=123".matches(regularExpression));
assertTrue("http://www.google.com/path?queryparam=123".matches(regularExpression));
assertFalse("www..dr.google".matches(regularExpression));
assertFalse("www:google.com".matches(regularExpression));
assertFalse("https://www#.google.com".matches(regularExpression));
assertFalse("https://www.google.com\"".matches(regularExpression));
assertFalse("https://www.google.com'".matches(regularExpression));
assertFalse("http://www.google.com/path'".matches(regularExpression));
assertFalse("http://subdomain.web-site.com/cgi-bin/perl.cgi?key1=value1&key2=value2e'".matches(regularExpression));
assertFalse("http://www.google.com/?queryparam=123'".matches(regularExpression));
assertFalse("http://www.google.com/path?queryparam=12'3".matches(regularExpression));
}
function validateURL(textval) {
var urlregex = new RegExp(
"^(http|https|ftp)\://([a-zA-Z0-9\.\-]+(\:[a-zA-Z0-9\.&%\$\-]+)*#)*((25[0-5]|2[0-4][0-9]|[0-1]{1}[0-9]{2}|[1-9]{1}[0-9]{1}|[1-9])\.(25[0-5]|2[0-4][0-9]|[0-1]{1}[0-9]{2}|[1-9]{1}[0-9]{1}|[1-9]|0)\.(25[0-5]|2[0-4][0-9]|[0-1]{1}[0-9]{2}|[1-9]{1}[0-9]{1}|[1-9]|0)\.(25[0-5]|2[0-4][0-9]|[0-1]{1}[0-9]{2}|[1-9]{1}[0-9]{1}|[0-9])|localhost|([a-zA-Z0-9\-]+\.)*[a-zA-Z0-9\-]+\.(com|edu|gov|int|mil|net|org|biz|arpa|info|name|pro|aero|coop|museum|[a-zA-Z]{2}))(\:[0-9]+)*(/($|[a-zA-Z0-9\.\,\?\'\\\+&%\$#\=~_\-]+))*$");
return urlregex.test(textval);
}
Matches
http://site.com/dir/file.php?var=moo | ftp://user:pass#site.com:21/file/dir
Non-Matches
site.com | http://site.com/dir//
function validateURL(textval) {
var urlregex = new RegExp(
"^(http|https|ftp)\://[a-zA-Z0-9\-\.]+\.[a-zA-Z]{2,3}(:[a-zA-Z0-9]*)?/?([a-zA-Z0-9\-\._\?\,\'/\\\+&%\$#\=~])*$");
return urlregex.test(textval);
}
Matches
http://www.asdah.com/~joe | ftp://ftp.asdah.co.uk:2828/asdah%20asdah.gif | https://asdah.gov/asdh-ah.as
If you really search for the ultimate match, you probably find it on "A Good Url Regular Expression?".
But a regex that really matches all possible domains and allows anything that is allowed according to RFCs is horribly long and unreadable, trust me ;-)
I hope it's helpful for you...
^(http|https):\/\/+[\www\d]+\.[\w]+(\/[\w\d]+)?
Here is a regex I made which extracts the different parts from an URL:
^((?:(?:http|ftp|ws)s?|sftp):\/\/?)?([^:/\s.#?]+\.[^:/\s#?]+|localhost)(:\d+)?((?:\/\w+)*\/)?([\w\-.]+[^#?\s]+)?([^#]+)?(#[\w-]*)?$
((?:(?:http|ftp|ws)s?|sftp):\/\/?)?(group 1): extracts the protocol
([^:/\s.#?]+\.[^:/\s#?]+|localhost)(group 2): extracts the hostname
(:\d+)?(group 3): extracts the port number
((?:\/\w+)*\/)?([\w\-.]+[^#?\s]+)?(groups 4 & 5): extracts the path part
([^#]+)?(group 6): extracts the query part
(#[\w-]*)?(group 7): extracts the hash part
For every part of the regex listed above, you can remove the ending ? to force it (or add one to make it facultative). You can also remove the ^ at the beginning and $ at the end of the regex so it won't need to match the whole string.
See it on regex101.
Note: this regex is not 100% safe and may accept some strings which are not necessarily valid URLs but it does indeed validate some criterias. Its main goal was to extract the different parts of an URL not to validate it.
I've been working on an in-depth article discussing URI validation using regular expressions. It is based on RFC3986.
Regular Expression URI Validation
Although the article is not yet complete, I have come up with a PHP function which does a pretty good job of validating HTTP and FTP URLs. Here is the current version:
// function url_valid($url) { Rev:20110423_2000
//
// Return associative array of valid URI components, or FALSE if $url is not
// RFC-3986 compliant. If the passed URL begins with: "www." or "ftp.", then
// "http://" or "ftp://" is prepended and the corrected full-url is stored in
// the return array with a key name "url". This value should be used by the caller.
//
// Return value: FALSE if $url is not valid, otherwise array of URI components:
// e.g.
// Given: "http://www.jmrware.com:80/articles?height=10&width=75#fragone"
// Array(
// [scheme] => http
// [authority] => www.jmrware.com:80
// [userinfo] =>
// [host] => www.jmrware.com
// [IP_literal] =>
// [IPV6address] =>
// [ls32] =>
// [IPvFuture] =>
// [IPv4address] =>
// [regname] => www.jmrware.com
// [port] => 80
// [path_abempty] => /articles
// [query] => height=10&width=75
// [fragment] => fragone
// [url] => http://www.jmrware.com:80/articles?height=10&width=75#fragone
// )
function url_valid($url) {
if (strpos($url, 'www.') === 0) $url = 'http://'. $url;
if (strpos($url, 'ftp.') === 0) $url = 'ftp://'. $url;
if (!preg_match('/# Valid absolute URI having a non-empty, valid DNS host.
^
(?P<scheme>[A-Za-z][A-Za-z0-9+\-.]*):\/\/
(?P<authority>
(?:(?P<userinfo>(?:[A-Za-z0-9\-._~!$&\'()*+,;=:]|%[0-9A-Fa-f]{2})*)#)?
(?P<host>
(?P<IP_literal>
\[
(?:
(?P<IPV6address>
(?: (?:[0-9A-Fa-f]{1,4}:){6}
| ::(?:[0-9A-Fa-f]{1,4}:){5}
| (?: [0-9A-Fa-f]{1,4})?::(?:[0-9A-Fa-f]{1,4}:){4}
| (?:(?:[0-9A-Fa-f]{1,4}:){0,1}[0-9A-Fa-f]{1,4})?::(?:[0-9A-Fa-f]{1,4}:){3}
| (?:(?:[0-9A-Fa-f]{1,4}:){0,2}[0-9A-Fa-f]{1,4})?::(?:[0-9A-Fa-f]{1,4}:){2}
| (?:(?:[0-9A-Fa-f]{1,4}:){0,3}[0-9A-Fa-f]{1,4})?:: [0-9A-Fa-f]{1,4}:
| (?:(?:[0-9A-Fa-f]{1,4}:){0,4}[0-9A-Fa-f]{1,4})?::
)
(?P<ls32>[0-9A-Fa-f]{1,4}:[0-9A-Fa-f]{1,4}
| (?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}
(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)
)
| (?:(?:[0-9A-Fa-f]{1,4}:){0,5}[0-9A-Fa-f]{1,4})?:: [0-9A-Fa-f]{1,4}
| (?:(?:[0-9A-Fa-f]{1,4}:){0,6}[0-9A-Fa-f]{1,4})?::
)
| (?P<IPvFuture>[Vv][0-9A-Fa-f]+\.[A-Za-z0-9\-._~!$&\'()*+,;=:]+)
)
\]
)
| (?P<IPv4address>(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}
(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?))
| (?P<regname>(?:[A-Za-z0-9\-._~!$&\'()*+,;=]|%[0-9A-Fa-f]{2})+)
)
(?::(?P<port>[0-9]*))?
)
(?P<path_abempty>(?:\/(?:[A-Za-z0-9\-._~!$&\'()*+,;=:#]|%[0-9A-Fa-f]{2})*)*)
(?:\?(?P<query> (?:[A-Za-z0-9\-._~!$&\'()*+,;=:#\\/?]|%[0-9A-Fa-f]{2})*))?
(?:\#(?P<fragment> (?:[A-Za-z0-9\-._~!$&\'()*+,;=:#\\/?]|%[0-9A-Fa-f]{2})*))?
$
/mx', $url, $m)) return FALSE;
switch ($m['scheme']) {
case 'https':
case 'http':
if ($m['userinfo']) return FALSE; // HTTP scheme does not allow userinfo.
break;
case 'ftps':
case 'ftp':
break;
default:
return FALSE; // Unrecognized URI scheme. Default to FALSE.
}
// Validate host name conforms to DNS "dot-separated-parts".
if ($m['regname']) { // If host regname specified, check for DNS conformance.
if (!preg_match('/# HTTP DNS host name.
^ # Anchor to beginning of string.
(?!.{256}) # Overall host length is less than 256 chars.
(?: # Group dot separated host part alternatives.
[A-Za-z0-9]\. # Either a single alphanum followed by dot
| # or... part has more than one char (63 chars max).
[A-Za-z0-9] # Part first char is alphanum (no dash).
[A-Za-z0-9\-]{0,61} # Internal chars are alphanum plus dash.
[A-Za-z0-9] # Part last char is alphanum (no dash).
\. # Each part followed by literal dot.
)* # Zero or more parts before top level domain.
(?: # Explicitly specify top level domains.
com|edu|gov|int|mil|net|org|biz|
info|name|pro|aero|coop|museum|
asia|cat|jobs|mobi|tel|travel|
[A-Za-z]{2}) # Country codes are exactly two alpha chars.
\.? # Top level domain can end in a dot.
$ # Anchor to end of string.
/ix', $m['host'])) return FALSE;
}
$m['url'] = $url;
for ($i = 0; isset($m[$i]); ++$i) unset($m[$i]);
return $m; // return TRUE == array of useful named $matches plus the valid $url.
}
This function utilizes two regexes; one to match a subset of valid generic URIs (absolute ones having a non-empty host), and a second to validate the DNS "dot-separated-parts" host name. Although this function currently validates only HTTP and FTP schemes, it is structured such that it can be easily extended to handle other schemes.
I use this regex:
((https?:)?//)?(([\d\w]|%[a-fA-f\d]{2,2})+(:([\d\w]|%[a-fA-f\d]{2,2})+)?#)?([\d\w][-\d\w]{0,253}[\d\w]\.)+[\w]{2,63}(:[\d]+)?(/([-+_~.\d\w]|%[a-fA-f\d]{2,2})*)*(\?(&?([-+_~.\d\w]|%[a-fA-f\d]{2,2})=?)*)?(#([-+_~.\d\w]|%[a-fA-f\d]{2,2})*)?
To support both:
http://stackoverflow.com
https://stackoverflow.com
And:
//stackoverflow.com
Here's a ready-to-go Java version from the Android source code. This is the best one I've found.
public static final Matcher WEB = Pattern.compile(new StringBuilder()
.append("((?:(http|https|Http|Https|rtsp|Rtsp):")
.append("\\/\\/(?:(?:[a-zA-Z0-9\\$\\-\\_\\.\\+\\!\\*\\'\\(\\)")
.append("\\,\\;\\?\\&\\=]|(?:\\%[a-fA-F0-9]{2})){1,64}(?:\\:(?:[a-zA-Z0-9\\$\\-\\_")
.append("\\.\\+\\!\\*\\'\\(\\)\\,\\;\\?\\&\\=]|(?:\\%[a-fA-F0-9]{2})){1,25})?\\#)?)?")
.append("((?:(?:[a-zA-Z0-9][a-zA-Z0-9\\-]{0,64}\\.)+") // named host
.append("(?:") // plus top level domain
.append("(?:aero|arpa|asia|a[cdefgilmnoqrstuwxz])")
.append("|(?:biz|b[abdefghijmnorstvwyz])")
.append("|(?:cat|com|coop|c[acdfghiklmnoruvxyz])")
.append("|d[ejkmoz]")
.append("|(?:edu|e[cegrstu])")
.append("|f[ijkmor]")
.append("|(?:gov|g[abdefghilmnpqrstuwy])")
.append("|h[kmnrtu]")
.append("|(?:info|int|i[delmnoqrst])")
.append("|(?:jobs|j[emop])")
.append("|k[eghimnrwyz]")
.append("|l[abcikrstuvy]")
.append("|(?:mil|mobi|museum|m[acdghklmnopqrstuvwxyz])")
.append("|(?:name|net|n[acefgilopruz])")
.append("|(?:org|om)")
.append("|(?:pro|p[aefghklmnrstwy])")
.append("|qa")
.append("|r[eouw]")
.append("|s[abcdeghijklmnortuvyz]")
.append("|(?:tel|travel|t[cdfghjklmnoprtvwz])")
.append("|u[agkmsyz]")
.append("|v[aceginu]")
.append("|w[fs]")
.append("|y[etu]")
.append("|z[amw]))")
.append("|(?:(?:25[0-5]|2[0-4]") // or ip address
.append("[0-9]|[0-1][0-9]{2}|[1-9][0-9]|[1-9])\\.(?:25[0-5]|2[0-4][0-9]")
.append("|[0-1][0-9]{2}|[1-9][0-9]|[1-9]|0)\\.(?:25[0-5]|2[0-4][0-9]|[0-1]")
.append("[0-9]{2}|[1-9][0-9]|[1-9]|0)\\.(?:25[0-5]|2[0-4][0-9]|[0-1][0-9]{2}")
.append("|[1-9][0-9]|[0-9])))")
.append("(?:\\:\\d{1,5})?)") // plus option port number
.append("(\\/(?:(?:[a-zA-Z0-9\\;\\/\\?\\:\\#\\&\\=\\#\\~") // plus option query params
.append("\\-\\.\\+\\!\\*\\'\\(\\)\\,\\_])|(?:\\%[a-fA-F0-9]{2}))*)?")
.append("(?:\\b|$)").toString()
).matcher("");
For Python, this is the actual URL validating regex used in Django 1.5.1:
import re
regex = re.compile(
r'^(?:http|ftp)s?://' # http:// or https://
r'(?:(?:[A-Z0-9](?:[A-Z0-9-]{0,61}[A-Z0-9])?\.)+(?:[A-Z]{2,6}\.?|[A-Z0-9-]{2,}\.?)|' # domain...
r'localhost|' # localhost...
r'\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}|' # ...or ipv4
r'\[?[A-F0-9]*:[A-F0-9:]+\]?)' # ...or ipv6
r'(?::\d+)?' # optional port
r'(?:/?|[/?]\S+)$', re.IGNORECASE)
This does both ipv4 and ipv6 addresses as well as ports and GET parameters.
Found in the code here, Line 44.
This one works for me very well. (https?|ftp)://(www\d?|[a-zA-Z0-9]+)?\.[a-zA-Z0-9-]+(\:|\.)([a-zA-Z0-9.]+|(\d+)?)([/?:].*)?
For convenience here's a one-liner regexp for URL's that will also match localhost where you're more likely to have ports than .com or similar.
(http(s)?:\/\/.)?(www\.)?[-a-zA-Z0-9#:%._\+~#=]{2,256}(\.[a-z]{2,6}|:[0-9]{3,4})\b([-a-zA-Z0-9#:%_\+.~#?&\/\/=]*)
I found the following Regex for URLs, tested successfully with 500+ URLs:
/\b(?:(?:https?|ftp):\/\/)(?:\S+(?::\S*)?#)?(?:(?!10(?:\.\d{1,3}){3})(?!127(?:\.\d{1,3}){3})(?!169\.254(?:\.\d{1,3}){2})(?!192\.168(?:\.\d{1,3}){2})(?!172\.(?:1[6-9]|2\d|3[0-1])(?:\.\d{1,3}){2})(?:[1-9]\d?|1\d\d|2[01]\d|22[0-3])(?:\.(?:1?\d{1,2}|2[0-4]\d|25[0-5])){2}(?:\.(?:[1-9]\d?|1\d\d|2[0-4]\d|25[0-4]))|(?:(?:[a-z\x{00a1}-\x{ffff}0-9]+-?)*[a-z\x{00a1}-\x{ffff}0-9]+)(?:\.(?:[a-z\x{00a1}-\x{ffff}0-9]+-?)*[a-z\x{00a1}-\x{ffff}0-9]+)*(?:\.(?:[a-z\x{00a1}-\x{ffff}]{2,})))(?::\d{2,5})?(?:\/[^\s]*)?\b/gi
I know it looks ugly, but the good thing is that it works. :)
Explanation and demo with 581 random URLs on regex101.
Source: In search of the perfect URL validation regex
To Match a URL there are various option and it depend on you requirement.
below are few.
_(^|[\s.:;?\-\]<\(])(https?://[-\w;/?:#&=+$\|\_.!~*\|'()\[\]%#,☺]+[\w/#](\(\))?)(?=$|[\s',\|\(\).:;?\-\[\]>\)])_i
#\b(([\w-]+://?|www[.])[^\s()<>]+(?:\([\w\d]+\)|([^[:punct:]\s]|/)))#iS
And there is a link which gives you more than 10 different variations of validation for URL.
https://mathiasbynens.be/demo/url-regex
I tried to formulate my version of url. My requirement was to capture instances in a String where possible url can be cse.uom.ac.mu - noting that it is not preceded by http nor www
String regularExpression = "((((ht{2}ps?://)?)((w{3}\\.)?))?)[^.&&[a-zA-Z0-9]][a-zA-Z0-9.-]+[^.&&[a-zA-Z0-9]](\\.[a-zA-Z]{2,3})";
assertTrue("www.google.com".matches(regularExpression));
assertTrue("www.google.co.uk".matches(regularExpression));
assertTrue("http://www.google.com".matches(regularExpression));
assertTrue("http://www.google.co.uk".matches(regularExpression));
assertTrue("https://www.google.com".matches(regularExpression));
assertTrue("https://www.google.co.uk".matches(regularExpression));
assertTrue("google.com".matches(regularExpression));
assertTrue("google.co.uk".matches(regularExpression));
assertTrue("google.mu".matches(regularExpression));
assertTrue("mes.intnet.mu".matches(regularExpression));
assertTrue("cse.uom.ac.mu".matches(regularExpression));
//cannot contain 2 '.' after www
assertFalse("www..dr.google".matches(regularExpression));
//cannot contain 2 '.' just before com
assertFalse("www.dr.google..com".matches(regularExpression));
// to test case where url www must be followed with a '.'
assertFalse("www:google.com".matches(regularExpression));
// to test case where url www must be followed with a '.'
//assertFalse("http://wwwe.google.com".matches(regularExpression));
// to test case where www must be preceded with a '.'
assertFalse("https://www#.google.com".matches(regularExpression));
whats wrong with plain and simple FILTER_VALIDATE_URL ?
$url = "http://www.example.com";
if(!filter_var($url, FILTER_VALIDATE_URL))
{
echo "URL is not valid";
}
else
{
echo "URL is valid";
}
I know its not the question exactly but it did the job for me when I needed to validate urls so thought it might be useful to others who come across this post looking for the same thing

Update a lot of xml files with a new tag

I have around 150 xml files placed in a folder that needs to be updated with a new tag.
Current:
<entry key="mergeTemplates" value="false"/>
<entry key="sysDescriptions"/>
New:
<entry key="mergeTemplates" value="false"/>
<entry key="requestable">
<value>
<Boolean>true</Boolean>
</value>
</entry>
<entry key="sysDescriptions">
I did try java's "replace" method. But wasnt able to accomplish it.
Tried the "sed" command on Unix as well.
Any suggestions on the best way or tool to accomplish this?
In general, you should not attempt to process XML data with line-oriented tools. Use something like xmlstarlet instead:
xmlstarlet ed -i "//entry[#key='sysDescriptions']" -t elem -n "new_entry" \
-i "//new_entry" -t attr -n "key" -v "requestable" \
--subnode "//new_entry" -t elem -n "value" \
--subnode "//new_entry/value" -t elem -n "Boolean" \
--subnode "//new_entry/value/Boolean" -t text -n "dummy" -v "true" \
-r "//new_entry" -v "entry" input.xml
For the sake of readability, I inserted a new element called new_entry, and finally renamed it. Make sure that no such element exists in your input file.
You've tagged it perl, so I'll offer a perl solution. The best advice I can offer generally is to use a parser because XML is a parsable language, and good ones exist. I particularly like XML::Twig for this sort of job (XML::LibXML is pretty good too, but doesn't do inplace editing).
I strongly urge avoiding regular expressions - XML is not well suited to parsing via regex, because it's contextual and regex isn't.
here's a bunch of perfectly valid changes to XML you can make, like unary tags, indenting and line splitting that leave it semantically identical, but break regex messily. Thus a future change that someone makes - that as far as they're concerned is valid/trivial like reformatting the XML - will break 'downstream' because your script doesn't handle it properly. Furthermore - xpath is a lot like regex, but is contextual and thus well suited to XML parsing/processing.
#!/usr/bin/env perl
use warnings;
use strict;
use XML::Twig;
my $twig = XML::Twig -> parse (\*DATA);
my $to_insert = XML::Twig::Elt -> new ( 'entry', {key => "requestable"} );
$to_insert -> insert_new_elt ( 'value' ) -> insert_new_elt('Boolean', "true" );
print "Generated new XML:\n";
$to_insert -> print;
my $insert_this = $to_insert -> cut;
my $insert_after = $twig -> findnodes ('//entry[#key="mergeTemplates"]',0);
$to_insert -> paste ( after => $insert_after );
print "Generated XML:\n";
$twig -> set_pretty_print('indented');
$twig -> print;
__DATA__
<xml>
<entry key="mergeTemplates" value="false"/>
<entry key="sysDescriptions"/>
</xml>
This can be adapted to using XML::Twig's parsefile_inplace method quite handily:
#!/usr/bin/env perl
use warnings;
use strict;
use XML::Twig;
sub insert_merge {
my ( $twig, $insert_after ) = #_;
my $to_insert = XML::Twig::Elt->new( 'entry', { key => "requestable" } );
$to_insert->insert_new_elt('value')->insert_new_elt( 'Boolean', "true" );
$to_insert->paste( after => $insert_after );
$twig -> flush;
}
my $twig =
XML::Twig->new(
twig_handlers => { '//entry[#key="mergeTemplates"]' => \&insert_merge },
pretty_print => 'indented' );
#glob finds files, if you want something more extensive then File::Find::Rule
foreach my $filename ( glob ( "/path/to/dir/*xml" ) ) {
$twig->parsefile_inplace($filename);
}
With sed these things are relatively easy:
You can match an address with a regex:
/^<entry key="mergeTemplates" value="false"\/>$/
See how there is a few characters that needs to be escaped as they would have special meaning. Also uses ^ (start of input) and $ (end of input).
When you have an address you can run a command on in, in this case we want the append command:
/^<entry key="mergeTemplates" value="false"\/>$/a\
<entry key="requestable">\
<value>\
<Boolean>true</Boolean>\
</value>\
</entry>
That's is that's the full sed script. To run it you can save it in a file (insert_xml.sed), and use sed -f:
sed -f insert_xml.sed input_file.xml
Use the -i flag to make inplace edits, it will either be -i (GNU) or -i '' (Free BSD). Using -i.bak (GNU) or -i .bak (Free BSD) will create a backup with the filename plus .bak
And then write a for loop for the files needing the update:
for file in *.xml; do
sed -i.bak -f insert_xml.sed "$file"
done
This is by no means an efficient solution, but it should work just fine for 150 files. If you have SSD it should complete in a blink of an eye.
It assumes you have tags on separate lines and new tag should be inserted after every entry key="mergeTemplates" (if it is not, depending on the case, the code can be slightly modified to use Matcher with chunked read instead of lines or read by two lines to detect second tag).
public void addTextAfterLine(String inputFolder, String prefixLine,
String text) throws IOException {
// iterate over files in input dir
try (DirectoryStream<Path> dirStream = Files
.newDirectoryStream(new File(inputFolder).toPath())) {
for (Path inputPath : dirStream) {
File inputFile = inputPath.toFile();
String inputFileName = inputFile.getName();
if (!inputFileName.endsWith(".xml") || inputFile.isDirectory())
continue;
File outputTmpFile = new File(inputFolder, inputFile.getName()
+ ".tmp");
// read line by line and write to output
try (BufferedReader inputReader = new BufferedReader(
new InputStreamReader(new FileInputStream(inputFile),
StandardCharsets.UTF_8));
BufferedWriter outputWriter = new BufferedWriter(
new OutputStreamWriter(new FileOutputStream(
outputTmpFile), StandardCharsets.UTF_8))) {
String line = inputReader.readLine();
while (line != null) {
outputWriter.write(line);
outputWriter.write('\n');
if (line.equals(prefixLine)) {
// add text after prefix line
outputWriter.write(text);
}
line = inputReader.readLine();
}
}
// delete original file and rename modified to original name
Files.delete(inputPath);
outputTmpFile.renameTo(inputFile);
}
}
}
public static void main(String[] args) throws IOException {
final String inputFolder = "/tmp/xml/input";
final String prefixLine = "<entry key=\"mergeTemplates\" value=\"false\"/>";
final String newText =
"<entry key=\"requestable\">\n"
+ " <value>\n"
+ " <Boolean>true</Boolean>\n"
+ " </value>\n"
+ "</entry>\n"
;
new TagInsertSample()
.addTextAfterLine(inputFolder, prefixLine, newText);
}
You can also use an advanced editor (e.g. Notepad++ on Windows), with find and replace in files command. Just replace the line <entry key="mergeTemplates" value="false"/> with <entry key="mergeTemplates" value="false"/>\n..new entry.
There are many notes here that you should not process XML with text processing tool. This is true if you are developing a generic system or library, to process unknown files. However, just to achieve a task on your files with known format, there is no need for XML complications and text processing fits just fine.
Preempting comments with the question "how do you know it is not going to be a generic system", I'm pretty confident that while developing a generic production system nobody will ask for "java, perl, Unix sed or any other tool".

Getting output from Perl script called from Java code

I've read a lot of questions/examples on this issue, but unfortunately I have not been able to solve my problem. I need to call a Perl script (that I cannot change) from Java code, and then I need to get the output from that script.
The script is used to take student programming homework assignments, and check them for copying by comparing all of them together. The script can take ~45 seconds to run, but only requires the arguments to be properly formatted, there is no interactivity.
My issue is when I call the script from my Java code I get the first line of output from the script but nothing else. I'm using Runtime.exe() and then waitFor() to wait for the script to finish. However the waitFor() function returns before the script actually finishes. I don't know any Perl so I'm not sure if the script is doing something that 'confuses' the java Process object, or if there's an issue in my code.
Process run = Runtime.getRuntime().exec(cmd);
output = new BufferedReader(new InputStreamReader(run.getInputStream()));
run.waitFor();
String temp;
while((temp = output.readLine()) != null){
System.out.println(temp);
}
The Perl script..
use IO::Socket;
#
# As of the date this script was written, the following languages were supported. This script will work with
# languages added later however. Check the moss website for the full list of supported languages.
#
#languages = ("c", "cc", "java", "ml", "pascal", "ada", "lisp", "scheme", "haskell", "fortran", "ascii", "vhdl", "perl", "matlab", "python", "mips", "prolog", "spice", "vb", "csharp", "modula2", "a8086", "javascript", "plsql", "verilog");
$server = 'moss.stanford.edu';
$port = '7690';
$noreq = "Request not sent.";
$usage = "usage: moss [-x] [-l language] [-d] [-b basefile1] ... [-b basefilen] [-m #] [-c \"string\"] file1 file2 file3 ...";
#
# The userid is used to authenticate your queries to the server; don't change it!
#
$userid=[REDACTED];
#
# Process the command line options. This is done in a non-standard
# way to allow multiple -b's.
#
$opt_l = "c"; # default language is c
$opt_m = 10;
$opt_d = 0;
$opt_x = 0;
$opt_c = "";
$opt_n = 250;
$bindex = 0; # this becomes non-zero if we have any base files
while (#ARGV && ($_ = $ARGV[0]) =~ /^-(.)(.*)/) {
($first,$rest) = ($1,$2);
shift(#ARGV);
if ($first eq "d") {
$opt_d = 1;
next;
}
if ($first eq "b") {
if($rest eq '') {
die "No argument for option -b.\n" unless #ARGV;
$rest = shift(#ARGV);
}
$opt_b[$bindex++] = $rest;
next;
}
if ($first eq "l") {
if ($rest eq '') {
die "No argument for option -l.\n" unless #ARGV;
$rest = shift(#ARGV);
}
$opt_l = $rest;
next;
}
if ($first eq "m") {
if($rest eq '') {
die "No argument for option -m.\n" unless #ARGV;
$rest = shift(#ARGV);
}
$opt_m = $rest;
next;
}
if ($first eq "c") {
if($rest eq '') {
die "No argument for option -c.\n" unless #ARGV;
$rest = shift(#ARGV);
}
$opt_c = $rest;
next;
}
if ($first eq "n") {
if($rest eq '') {
die "No argument for option -n.\n" unless #ARGV;
$rest = shift(#ARGV);
}
$opt_n = $rest;
next;
}
if ($first eq "x") {
$opt_x = 1;
next;
}
#
# Override the name of the server. This is used for testing this script.
#
if ($first eq "s") {
$server = shift(#ARGV);
next;
}
#
# Override the port. This is used for testing this script.
#
if ($first eq "p") {
$port = shift(#ARGV);
next;
}
die "Unrecognized option -$first. $usage\n";
}
#
# Check a bunch of things first to ensure that the
# script will be able to run to completion.
#
#
# Make sure all the argument files exist and are readable.
#
print "Checking files . . . \n";
$i = 0;
while($i < $bindex)
{
die "Base file $opt_b[$i] does not exist. $noreq\n" unless -e "$opt_b[$i]";
die "Base file $opt_b[$i] is not readable. $noreq\n" unless -r "$opt_b[$i]";
die "Base file $opt_b is not a text file. $noreq\n" unless -T "$opt_b[$i]";
$i++;
}
foreach $file (#ARGV)
{
die "File $file does not exist. $noreq\n" unless -e "$file";
die "File $file is not readable. $noreq\n" unless -r "$file";
die "File $file is not a text file. $noreq\n" unless -T "$file";
}
if ("#ARGV" eq '') {
die "No files submitted.\n $usage";
}
print "OK\n";
#
# Now the real processing begins.
#
$sock = new IO::Socket::INET (
PeerAddr => $server,
PeerPort => $port,
Proto => 'tcp',
);
die "Could not connect to server $server: $!\n" unless $sock;
$sock->autoflush(1);
sub read_from_server {
$msg = <$sock>;
print $msg;
}
sub upload_file {
local ($file, $id, $lang) = #_;
#
# The stat function does not seem to give correct filesizes on windows, so
# we compute the size here via brute force.
#
open(F,$file);
$size = 0;
while (<F>) {
$size += length($_);
}
close(F);
print "Uploading $file ...";
open(F,$file);
$file =~s/\s/\_/g; # replace blanks in filename with underscores
print $sock "file $id $lang $size $file\n";
while (<F>) {
print $sock $_;
}
close(F);
print "done.\n";
}
print $sock "moss $userid\n"; # authenticate user
print $sock "directory $opt_d\n";
print $sock "X $opt_x\n";
print $sock "maxmatches $opt_m\n";
print $sock "show $opt_n\n";
#
# confirm that we have a supported languages
#
print $sock "language $opt_l\n";
$msg = <$sock>;
chop($msg);
if ($msg eq "no") {
print $sock "end\n";
die "Unrecognized language $opt_l.";
}
# upload any base files
$i = 0;
while($i < $bindex) {
&upload_file($opt_b[$i++],0,$opt_l);
}
$setid = 1;
foreach $file (#ARGV) {
&upload_file($file,$setid++,$opt_l);
}
print $sock "query 0 $opt_c\n";
print "Query submitted. Waiting for the server's response.\n";
&read_from_server();
print $sock "end\n";
close($sock);
Thank you for any input you may have on my problem.
You can use my native Java client for MOSS instead .
Don't mind the version number. I've been using it in production successfully.

Get contents of brackets using regex in a list of values

I'm trying to look for a regex (Coldfusion or Java) that can get me the contents between the brackets for each (param \d+) without fail. I've tried dozens of different types of regexes and the closest one I got is this one:
\(param \d+\) = \[(type='[^']*', class='[^']*', value='(?:[^']|'')*', sqltype='[^']*')\]
Which would be perfect, if the string that I get back from CF escaped single quotes from the value parameter. But it doesn't so it fails miserably. Going the route of a negative lookahead like so:
\[(type='[^']*', class='[^']*', value='(?:(?!', sqltype).)*', sqltype='[^']*')\]
Is great, unless for some unnatured reason there's a piece of code that quite literally has , sqltype in the value. I find it hard to believe I can't simply tell regex to scoop out the contents of every open and closed bracket it finds but then again, I don't know enough regex to know its limits.
Here's an example string of what I'm trying to parse:
(param 1) = [type='IN', class='java.lang.Integer', value='47', sqltype='cf_sql_integer'] , (param 2) = [type='IN', class='java.lang.String', value='asf , O'Reilly, really?', sqltype='cf_sql_varchar'] , (param 3) = [type='IN', class='java.lang.String', value='Th[is]is Ev'ery'thing That , []can break it ', sqltype= ', sqltype='cf_sql_varchar']
For the curious this is a sub-question to Copyable Coldfusion SQL Exception.
EDIT
This is my attempt at implementing #Mena's answer in CF9.1. Sadly it doesn't finish processing the string. I had to replace the \\ with \ just to get it to run at first, but my implementation might still be at fault.
This is the string given (pipes are just to denote boundary):
| (param 1) = [type='IN', class='java.lang.Integer', value='47', sqltype='cf_sql_integer'] , (param 2) = [type='IN', class='java.lang.String', value='asf , O'Reilly], really?', sqltype='cf_sql_varchar'] , (param 3) = [type='IN', class='java.lang.String', value='Th[is]is Ev'ery'thing That , []can break it ', sqltype ', sqltype='cf_sql_varchar'] |
This is my implementation:
<cfset var outerPat = createObject("java","java.util.regex.Pattern").compile(javaCast("string", "\((.+?)\)\s?\=\s?\[(.+?)\](\s?,|$)"))>
<cfset var innerPat = createObject("java","java.util.regex.Pattern").compile(javaCast("string", "(.+?)\s?\=\s?'(.+?)'\s?,\s?"))>
<cfset var outerMatcher = outerPat.matcher(javaCast("string", arguments.params))>
<cfdump var="Start"><br />
<cfloop condition="outerMatcher.find()">
<cfdump var="#outerMatcher.group(1)#"> (<cfdump var="#outerMatcher.group(2)#">)<br />
<cfset var innerMatcher = innerPat.matcher(javaCast("string", outerMatcher.group(2)))>
<cfloop condition="innerMatcher.find()">
<cfoutput>|__</cfoutput><cfdump var="#innerMatcher.group(1)#"> --> <cfdump var="#innerMatcher.group(2)#"><br />
</cfloop>
<br />
</cfloop>
<cfabort>
And this is what printed:
Start
param 1 ( type='IN', class='java.lang.Integer', value='47', sqltype='cf_sql_integer' )
|__ type --> IN
|__ class --> java.lang.Integer
|__ value --> 47
param 2 ( type='IN', class='java.lang.String', value='asf , O'Reilly )
|__ type --> IN
|__ class --> java.lang.String
End
Here's a Java regex pattern that works for your sample input.
(?x)
# lookbehind to check for start of string or previous param
# java lookbehinds must have max length, so limits sqltype
(?<=^|sqltype='cf_sql_[a-z]{1,16}']\ ,\ )
# capture the full string for replacing in the orig sql
# and just the position to verify against the match position
(\(param\ (\d+)\))
\ =\ \[
# type and class wont contain quotes
type='([^']++)'
,\ class='([^']++)'
# match any non-quote, then lazily keep going
,\ value='([^']++.*?)'
# sqltype is always alphanumeric
,\ sqltype='cf_sql_[a-z]+'
\]
# lookahead to check for end of string or next param
(?=$|\ ,\ \(param\ \d+\)\ =\ \[)
(The (?x) flag is for comment mode, which ignores unescaped whitespace and between a hash and end of line.)
And here's that pattern implemented in CFML (tested on CF9,0,1,274733). It uses cfRegex (a library which makes it easier to work with Java regex in CFML) to get the results of that pattern, and then does a couple of checks to make sure the expected number of params are found.
<cfsavecontent variable="Input">
(param 1) = [type='IN', class='java.lang.Integer', value='47', sqltype='cf_sql_integer']
, (param 2) = [type='IN', class='java.lang.String', value='asf , O'Reilly, really?', sqltype='cf_sql_varchar']
, (param 3) = [type='IN', class='java.lang.String', value='Th[is]is Ev'ery'thing That , []can break it ', sqltype= ', sqltype='cf_sql_varchar']
</cfsavecontent>
<cfset Input = trim(Input).replaceall('\n','')>
<cfset cfcatch =
{ params = input
, sql = 'SELECT stuff FROM wherever WHERE (param 3) is last param'
}/>
<cfsavecontent variable="ParamRx">(?x)
# lookbehind to check for start or previous param
# java lookbehinds must have max length, so limits sqltype
(?<=^|sqltype='cf_sql_[a-z]{1,16}']\ ,\ )
# capture the full string for replacing in the orig sql
# and just the position to verify against the match position
(\(param\ (\d+)\))
\ =\ \[
# type and class wont contain quotes
type='([^']++)'
,\ class='([^']++)'
# match any non-quote, then lazily keep going if needed
,\ value='([^']++.*?)'
# sqltype is always alphanumeric
,\ sqltype='cf_sql_[a-z]+'
\]
# lookahead to check for end or next param
(?=$|\ ,\ \(param\ \d+\)\ =\ \[)
</cfsavecontent>
<cfset FoundParams = new Regex(ParamRx).match
( text = cfcatch.params
, returntype = 'full'
)/>
<cfset LastParamPos = cfcatch.sql.lastIndexOf('(param ') + 7 />
<cfset LastParam = ListFirst( Mid(cfcatch.sql,LastParamPos,3) , ')' ) />
<cfif LastParam NEQ ArrayLen(FoundParams) >
<cfset ProblemsDetected = true />
<cfelse>
<cfset ProblemsDetected = false />
<cfloop index="i" from=1 to=#ArrayLen(FoundParams)# >
<cfif i NEQ FoundParams[i].Groups[2] >
<cfset ProblemsDetected = true />
</cfif>
</cfloop>
</cfif>
<cfif ProblemsDetected>
<big>Something went wrong!</big>
<cfelse>
<big>All seems fine</big>
</cfif>
<cfdump var=#FoundParams# />
This will actually work if you embed an entire param inside the value of another param. It fails if you try two (or more), but at least least the checks should detect this failure.
Here's what the dump output should look like:
Hopefully everything here makes sense - let me know if any questions.
I would probably use a dedicated parser for that, but here's an example on how to do it with two Patterns and nested loops:
// the input String
String input = "(param 1) = " +
"[type='IN', class='java.lang.Integer', value='47', sqltype='cf_sql_integer'] , " +
"(param 2) = " +
"[type='IN', class='java.lang.String', value='asf , O'Reilly, really?', " +
"sqltype='cf_sql_varchar'] , " +
"(param 3) = " +
"[type='IN', class='java.lang.String', value='Th[is]is Ev'ery'thing That , " "[]can break it ', sqltype= ', sqltype='cf_sql_varchar']";
// the Pattern defining the round-bracket expression and the following
// square-bracket list. Both values within the brackets are grouped for back-reference
// note that what prevents the 3rd case from breaking is that the closing square bracket
// is expected to be either followed by optional space + comma, or end of input
Pattern outer = Pattern.compile("\\((.+?)\\)\\s?\\=\\s?\\[(.+?)\\](\\s?,|$)");
// the Pattern defining the key-value pairs within the square-bracket groups
// note that both key and value are grouped for back-reference
Pattern inner = Pattern.compile("(.+?)\\s?\\=\\s?'(.+?)'\\s?,\\s?");
Matcher outerMatcher = outer.matcher(input);
// iterating over the outer Pattern (type x) = [myKey = myValue, ad lib.], or end of input
while (outerMatcher.find()) {
System.out.println(outerMatcher.group(1));
Matcher innerMatcher = inner.matcher(outerMatcher.group(2));
// iterating over the inner Pattern myKey = myValue
while (innerMatcher.find()) {
System.out.println("\t" + innerMatcher.group(1) + " --> " + innerMatcher.group(2));
}
}
Output:
param 1
type --> IN
class --> java.lang.Integer
value --> 47
param 2
type --> IN
class --> java.lang.String
value --> asf , O'Reilly, really?
param 3
type --> IN
class --> java.lang.String
value --> Th[is]is Ev'ery'thing That , []can break it

How can I access blocks of text as an attribute that are matched using a greedy=false option in ANTLR?

I have a rule in my ANTLR grammar like this:
COMMENT : '/*' (options {greedy=false;} : . )* '*/' ;
This rule simply matches c-style comments, so it will accept any pair of /* and */ with any arbitrary text lying in between, and it works fine.
What I want to do now is capture all the text between the /* and the */ when the rule matches, to make it accessible to an action. Something like this:
COMMENT : '/*' e=((options {greedy=false;} : . )*) '*/' {System.out.println("got: " + $e.text);
This approach doesn't work, during parsing it gives "no viable alternative" upon reaching the first character after the "/*"
I'm not really clear on if/how this can be done - any suggestions or guidance welcome, thanks.
Note that you can simply do:
getText().substring(2, getText().length()-2)
on the COMMENT token since the first and the last 2 characters will always be /* and */.
You could also remove the options {greedy=false;} : since both .* and .+ are ungreedy (although without the . they are greedy) (i).
EDIT
Or use setText(...) on the Comment token to discard the /* and */ immediately. A little demo:
file T.g:
grammar T;
#parser::members {
public static void main(String[] args) throws Exception {
ANTLRStringStream in = new ANTLRStringStream(
"/* abc */ \n" +
" \n" +
"/* \n" +
" DEF \n" +
"*/ "
);
TLexer lexer = new TLexer(in);
CommonTokenStream tokens = new CommonTokenStream(lexer);
TParser parser = new TParser(tokens);
parser.parse();
}
}
parse
: ( Comment {System.out.printf("parsed :: >\%s<\%n", $Comment.getText());} )+ EOF
;
Comment
: '/*' .* '*/' {setText(getText().substring(2, getText().length()-2));}
;
Space
: (' ' | '\t' | '\r' | '\n') {skip();}
;
Then generate a parser & lexer, compile all .java files and run the parser containing the main method:
java -cp antlr-3.2.jar org.antlr.Tool T.g
javac -cp antlr-3.2.jar *.java
java -cp .:antlr-3.2.jar TParser
(or `java -cp .;antlr-3.2.jar TParser` on Windows)
which will produce the following output:
parsed :: > abc <
parsed :: >
DEF
<
(i) The Definitive ANTLR Reference, Chapter 4, Extended BNF Subrules, page 86.
Try this:
COMMENT :
'/*' {StringBuilder comment = new StringBuilder();} ( options {greedy=false;} : c=. {comment.appendCodePoint(c);} )* '*/' {System.out.println(comment.toString());};
Another way which will actually return the StringBuilder object so you can use it in your program:
COMMENT returns [StringBuilder comment]:
'/*' {comment = new StringBuilder();} ( options {greedy=false;} : c=. {comment.append((char)c);} )* '*/';

Categories