Regex to extract valid Http or Https - java

I'm currently having some issues with a regex to extract a URL.
I want my regex to take URLS such as:
http://stackoverflow.com/questions/ask
https://stackoverflow.com
http://local:1000
https://local:1000
Through some tutorials, I've learned that this regex will find all the above: ^(http|https)\://.*$ however, it will also take http://local:1000;http://invalid http://khttp://as a single string when it shouldn't take it at all.
I understand that my expression isn't written to exclude this, but my issue is I cannot think of how to write it so it checks for this scenario.
Any help is greatly appreciated!
Edit:
Looking at my issue, it seems that I could eliminate my issue as long as I can implement a check to make sure '//' doesn't occur in my string after the initial http:// or https://, any ideas on how to implement?
Sorry this will be done with Java
I also need to add the following constraint: a string such as http://local:80/test:90 fails because of the duplicate of port...aka I need to have a constraint that only allows two total : symbols in a valid string (one after http/s) and one before port.

This will only produce a match if if there is no :// after its first appearance in the string.
^https?:\/\/(?!.*:\/\/)\S+
Note that trying to parse a valid url from within a string is very complex, see
In search of the perfect URL validation regex, so the above does not attempt to do that.
It will just match the protocol and following non-space characters.
In Java
Pattern reg = Pattern.compile("^https?:\\/\\/(?!.*:\\/\\/)\\S+");
Matcher m = reg.matcher("http://somesite.com");
if (m.find()) {
System.out.println(m.group());
} else {
System.out.println("No match");
}

Check your programming language to see if it already has a parser. E.g. php has parse_url()

From http://net.tutsplus.com/tutorials/other/8-regular-expressions-you-should-know/
/^(https?:\/\/)?([\da-z\.-]+)\.([a-z\.]{2,6})([\/\w \.-]*)*\/?$/
This may change based on the programming language/tool

/[A-Za-z]+://[A-Za-z0-9-]+.[A-Za-z0-9-:%&;?#/.=]+/g

Related

Simple Java regular expression matching fails

Before y'all jump on me for posting something similar to previous questions asked, yes, there seem to be a number of regex related questions but nothing which seems to help me, or at least that I can see.
I am trying to parse strings in JAVA using PATTERN and MATCHER and am really having no joy. My regular expression seems to match my input string when I use a few of the online regular expression testing websites but Java simply does not match my expression.
My input string is:
"Big apple" title="Little Apple" type="Container" url="http://malcolm.com/testing"
The regular expression I am using to match is ".*" title="(.*)" type="Container" url="(.*)"
Essentially I want to pull out the text within the second and the fourth set of quotes. There will always be 4 sets of quotes with text within and around.
I am coding as follows:
Variable XMLSubstring contains the string above (including the quotes) and is as stated, even when I print it out.
Pattern p = Pattern.compile(".* title=\"(.*)\" type=\"Container\" url=\"(.*)\"");
m = p.matcher(XMLSubstring);
It doesn't appear to be rocket science I'm attempting but I'm pulling my hair out trying to debug the bloody thing.
Is there something wrong with my regex pattern?
Is there something wrong with the code I am using?
Am I simply a moron and should stop coding with immediate effect?
EDIT & UPDATE: I have found the problem. My string had a space at the end of it which was breaking the parser! How silly, and I think based on that, I need to accept the third suggestion of mine and give up programming. Thanks all for your assistance.
Try this,
String str="\"Big apple\" title=\"Little Apple\" type=\"Container\" url=\"http://malcolm.com/testing\"";
Pattern p=Pattern.compile(".* title=\\\".*\\\" type=\\\"Container\\\" url=\\\".*\\\"");
Matcher m=p.matcher(str);

Need to write a java regex expression which matches urls with http or https but does not contains specific file extensions

I need to create a java regex expression which matches URLS with http or https but should not match urls with specific file extensions.
I can get the urls with http or https using the following expression and I am unable to complete the 2nd part that is eliminating urls with certain extensions (eg:- js|css|jpg etc..)
I guess I need to use negations but I am not sure how to do it.
String regex = "\\s*(?i)(http|https)\\s*://\\s*((\"[^\"]*\"|'[^']*'|([^'\">\\s]+)))";
Please help me to modify this regex to meet this requirement.
An easy way to implement this in Java is to use the Pattern class (from java.util.regex). To accomplish what you're suggesting, you could use two separate regex objects to check the conditions for the URL. For example (using the string regex from your question):
Scanner in = new Scanner(System.in);
String input = in.nextLine();
Pattern one = Pattern.compile(regex);
Pattern two = Pattern.compile("([^\s]+(\.(?i)(js|css|jpg|etc))$)");
if(one.matcher(input).matches() && !two.matcher(input).matches())
System.out.println("It matches!");
else System.out.println("Nope!");
In short, using two Pattern objects makes your code more readable and easy to manage, since you're considering multiple aspects about an input string of a URL.
You need an anchor to look behind - see regex to match url that should give you the expression you need. The regex you have currently will match malformed urls with disallowed characters.
Here's a good site to check your expressions: http://www.regexplanet.com/advanced/java/index.html

Android: Matcher.find() never returns

First of all, here is a chunk of affected code:
// (somewhere above, data is initialized as a String with a value)
Pattern detailsPattern = Pattern.compile("**this is a valid regex, omitted due to length**", Pattern.DOTALL | Pattern.CASE_INSENSITIVE);
Matcher detailsMatcher = detailsPattern.matcher(data);
Log.i("Scraper", "Initialized pattern and matcher, data length "+data.length());
boolean found = detailsMatcher.find();
Log.i("Scraper", "Found? "+((found)?"yep":"nope"));
I omitted the regex inside Pattern.compile because it's very long, but I know it works with the given data set; or if it doesn't, it shoudn't break anything anyway.
The trouble is, I do get the feedback I/Scraper(23773): Initialized pattern and matcher, data length 18861 but I never see the "Found?" line, it is just stuck on the find() call.
Is this a known Android bug? I've tried it over and over and just can't get it to work. Somehow, I think something over the past few days broke this because my app was working fine before, and I have in the past couple days received several comments of the app not working so it is clearly affecting other users as well.
How can I further debug this?
Some regexes can take a very, very long time to evaluate. In particular, regexes that have lots of quantifiers can cause the regex engine to do a huge amount of backtracking to explore all of the possible ways that the input string might match. And if it is going to fail, it has to explore all of those possibilities.
(Here is an example:
regex = "a*a*a*a*a*a*b"; // 6 quantifiers
input = "aaaaaaaaaaaaaaaaaaaa"; // 20 characters
A typical regex engine will do in the region of 20^6 character comparisons before deciding that the input string does not match.)
If you showed us the regex and the string you are trying to match, we could give a better diagnosis, and possibly offer some alternatives. But if you are trying to extract information from HTML, then the best solution is to not use regexes at all. There are HTML parsers that are specifically designed to deal with real-world HTML.
How long is the string you are trying to parse ?
How long and how complicated is the regex you are trying to match ?
Have you tried to break down your regex down to simpler bits ? Adding up the bits one after another will let you see when it breaks and maybe why.
make some RE like [a-zA-Z]* pass it as argument to compile(),here this example allows only characters small & cap.
Read my blogpost on android validation for more info.
I had the same issue and I solved it replacing all the wildchart . with [\s\S]. I really don't know why it worked for me but it did. I come from Javascript world and I know in there that expression is faster for being evaluated.

Java - Regex problem

I have a list of URLs of type
http://www.example.com/pk/ca,
http://www.example.com/pk,
http://www.example.com/anthingcangoeshere/pk, and
http://www.example.com/pkisnotnecessaryhere.
Now, I want to find out only those URLs that ends with /pk or /pk/ and don't have anything in between .com and /pk
Your problem isn't fully defined so I can't give you an exact answer but this should be a start you can use:
^[^:]+://[^/]+\.com/pk/?$
These strings will match:
http://www.example.com/pk
http://www.example.com/pk/
https://www.example.com/pk
These strings won't match:
http://www.example.co.uk/pk
http://www.example.com/pk/ca
http://www.example.com/anthingcangoeshere/pk
http://www.example.com/pkisnotnecessaryhere
String pattern = "^http://www.example.com/pk/?$";
Hope this helps.
Some details: if you don't add ^ to the beginning of the pattern, then foobarhttp://www.example.com/pk/ will be accepted too. If you don't add $ to the end of the pattern, then http://www.exampke.com/pk/foobar will be accepted too.
Directly translating your request "[...] URLs that ends with /pk or /pk/ and don't have anything in between .com and /pk", with the additional assumption that there shall always be a ".com", yields this regex:
If you use find():
\.com/pk/?$
If you use matches():
.*\.com/pk/?
Other answers given here give more restrictive patterns, allowing only URLs that are more close to your examples. Especially my pattern does not validate that the given string is a syntactically valid URL.
String pattern = "^https?://(www\.)?.+\\.com/pk/?$";

How to change this regex to properly extract tag attributes - should be simple

I need to "grab" an attribute of a custom HTML tag. I know this sort of question has been asked many times before, but regex really messes with my head, and I can't seem to get it working.
A sample of XML that I need to work with is
<!-- <editable name="nameValue"> --> - content goes here - <!-- </editable> -->
I want to be able to grab the value of the name attribute, which in this case is nameValue. What I have is shown below but this returns a null value.
My regex string (for a Java app, hence the \ to escape the ") is:
"(.)?<!-- <editable name=(\".*\")?> -->.*<!-- </editable> -->(.)?"
I am trying to grab the attribute with quotation marks I figure this is the easiest and most general pattern to match. Well it just doesn't work, any help will help me keep my hair.
I don't think you need the (.)?s at the beginning and end of your regex. And you need to put in a capturing group for getting only the content-goes-here bit:
This worked for me:
String xml = "RANDOM STUFF<!-- <editable name=\"nameValue\"> --> - content goes here - <!-- </editable> -->RANDOM STUFF";
Pattern p = Pattern.compile("<!-- <editable name=(\".*\")?> -->(.*)<!-- </editable> -->");
Matcher m = p.matcher(xml);
if (m.find()) {
System.out.println(m.group(2));
} else {
System.out.println("no match found");
}
This prints:
- content goes here -
Your search is greedy. Use "\<\!-- \<editable name=\"(.*?)\"\> --\>.*?\<\!-- \<\/editable\> --\>" (added ?). Please note that this one will not work correctly with nested <editable> elements.
If you don't want to perform syntax checking, you could also simply go with: "\<\!-- \<editable name=\"(.*?)\"\> --\>" or even "\<editable name=\"(.*?)\"\>" for better simplicity and performance.
Edit: should be
Pattern re = Pattern.compile( "\\<editable name=\"(.*?)\"\\>" );
I use JavaScript, but it should help to make the expression non-greedy where possible and use not matches instead of any character matches. Not sure how similar regexps are with Java, but instead of using the expression \".*\" try using \"[^\"]*\". That will search for any character within the attribute value that isn't a quote, meaning the expression can't match beyond the attribute value.
Hope that helps
Regexes are fundamentally bad at parsing HTML (see Can you provide some examples of why it is hard to parse XML and HTML with a regex? for why). What you need is an HTML parser. See Can you provide an example of parsing HTML with your favorite parser? for examples using a variety of parsers.
You may find the answer using TagSoup helpful.

Categories