Java Regexp to match domain of url - java

I would like to use Java regex to match a domain of a url, for example,
for www.table.google.com, I would like to get 'google' out of the url, namely, the second last word in this URL string.
Any help will be appreciated !!!

It really depends on the complexity of your inputs...
Here is a pretty simple regex:
.+\\.(.+)\\..+
It fetches something that is inside dots \\..
And here are some examples for that pattern: https://regex101.com/r/L52oz6/1.
As you can see, it works for simple inputs but not for complex urls.
But why reinventing the wheel, there are plenty of really good libraries that correctly parse any complex url. But sure, for simple inputs a small regex is easily build. So if that does not solve the problem for your inputs then please callback, I will adjust the regex pattern then.
Note that you can also just use simple splitting like:
String[] elements = input.split("\\.");
String secondToLastElement = elements[elements.length - 2];
But don't forget the index-bound checking.
Or if you search for a very quick solution than walk through the input starting from the last position. Work your way through until you found the first dot, continue until the second dot was found. Then extract that part with input.substring(index1, index2);.
There is also already a delegate method for exactly that purpose, namely String#lastIndexOf (see the documentation).
Take a look at this code snippet:
String input = ...
int indexLastDot = input.lastIndexOf('.');
int indexSecondToLastDot = input.lastIndexOf('.', indexLastDot);
String secondToLastWord = input.substring(indexLastDot, indexSecondToLastDot);
Maybe the bounds are off by 1, haven't tested the code, but you get the idea. Also don't forget bound checking.
The advantage of this approach is that it is really fast, it can directly work on the internal structures of Strings without creating copies.

My attempt:
(?<scheme>https?:\/\/)?(?<subdomain>\S*?)(?<domainword>[^.\s]+)(?<tld>\.[a-z]+|\.[a-z]{2,3}\.[a-z]{2,3})(?=\/|$)
Demo. Works correctly for:
http://www.foo.stackoverflow.com
http://www.stackoverflow.com
http://www.stackoverflow.com/
http://stackoverflow.com
https://www.stackoverflow.com
www.stackoverflow.com
stackoverflow.com
http://www.stackoverflow.com
http://www.stackoverflow.co.uk
foo.www.stackoverflow.com
foo.www.stackoverflow.co.uk
foo.www.stackoverflow.co.uk/a/b/c

private static final Pattern URL_MATCH_GET_SECOND_AND_LAST =
Pattern.compile("www.(.*)//.google.(.*)", Pattern.CASE_INSENSITIVE);
String sURL = "www.table.google.com";
if (URL_MATCH_GET_SECOND_AND_LAST.matcher(sURL).find()){
Matcher matchURL = URL_MATCH_GET_SECOND_AND_LAST .matcher(sURL);
if (matchURL .find()) {
String sFirst = matchURL.group(1);
String sSecond= matchURL.group(2);
}
}

Related

Regex pattern to split colon char with a condition

I have a string like this :
http://schemas/identity/claims/usertype:External
Then my goal is to split that string into 2 words by colon delimiter, but in need to specified how the regex worked, it will be split the colon but not including colon in "http://", so those strings will be split into :
http://schemas/identity/claims/usertype
External
I have tried regex like this :
(http:\/\/+schemas\/identity\/claims\/usertype)
So it will be :
http://schemas/identity/claims/usertype
:External
then after that i will replace the remaining colon with empty string.
but i think its not a best practice for this, because i rarely used regex.
Do you have any suggestion to simplified the regex ?
Thanks in advance
This is an X/Y problem. Fortunately, you asked the question in a great way, by explaining the underlying problem you are trying to solve (namely: Pull some string out of a URL), and then describing the direction you've chosen to solve your problem (which is bad, see below), and then asking about a problem you have with this solution (which is irrelevant, as the entire solution is bad).
URLs aren't parsable like this. You shouldn't treat them as a string you can lop into pieces like this. For example, the server part can contain colons too: For port number. In front of the server part, there can be an authentication which can also contain a colon. It's rarely used, of course.
Try this one, which shows the problem with your approach:
https://joe:joe#google.com:443/
That link just works. Port 443 was the default anyway, and google ignores the authentication header that ends up sending, but the point is, a URL may contain this stuff.
But rzwitserloot, it.. won't! I know!
That's bad programming mindset. That mindset leads to security issues. Why go for a solution that burdens your codebase with unstated assumptions (assumption: The places that provide a URL to this code are under my control and will never send port or auth headers)? If the 'server' part is configurable in a config file, will you mention in said config file that you cannot add a port? Will you remember 4 years from now?
The solution that does it right isn't going to burden your code with all these unstated (or very unwieldy if stated) assumptions.
Okay, so what is the right way?
First, toss that string into the constructor of java.net.URI. Then, use the methods there to get what you actually want, which is the path part. That is a string you can pull apart:
URI uri = new URI("http://schemas/identity/claims/usertype:External");
String path = uri.getPath();
String newPath = path.replaceAll(":.*", "");
String type = path.replaceAll(".*?:", "");
URI newUri = uri.resolve(newPath);
System.out.println(newUri);
System.out.println(type);
prints:
http://schemas/identity/claims/usertype
External
NB: Toss some ports or auth stuff in there, or make it a relative URL - do whatever you like, this code is far more robust in the face of changing the base URL than any attempt to count colons is going to be.
Use Negative Lookbehind and split
Regex:
"(?<!(http|https)):"
Regex in context:
public static void main(String[] args) {
String input = "http://schemas/identity/claims/usertype:External";
validateURI(input);
List<String> result = Arrays.asList(input.split("(?<!(http|https)):"));
result.forEach(System.out::println);
}
private static void validateURI(String input) {
try {
new URI(input);
} catch (URISyntaxException e) {
System.out.println("Invalid URI!!!");
e.printStackTrace();
}
}
Output:
http://schemas/identity/claims/usertype
External
I think this might help you:
public class Separator{
public static void main(String[] args) {
String input = "http://schemas/identity/claims/usertype:External";
String[] splitted = input.split("\\:");
System.out.println(splitted[splitted.length-1]);
}
}
Output
External

find the path param using regex in the url

what is the regular expression to find the path param from the url?
http://localhost:8080/domain/v1/809pA8
https://localhost:8080/domain/v1/809pA8
Want to retrieve the value(809pA8) from the above URL using regular expression, java is preferable.
I would suggest you do something like
url.substring(url.lastIndexOf('/') + 1);
If you really prefer regexps, you could do
Matcher m = Pattern.compile("/([^/]+)$").matcher(url);
if (m.find())
value = m.group(1);
I would try:
String url = "http://localhost:8080/domain/v1/809pA8";
String value = String.valueOf(url.subSequence(url.lastIndexOf('/'), url.length()-1));
No need for regex here, I think.
EDIT: I'm sorry I made a mistake:
String url = "http://localhost:8080/domain/v1/809pA8";
String value = String.valueOf(url.subSequence(url.lastIndexOf('/')+1, url.length()));
See this code working here: https://ideone.com/E30ddC
For your simple case, regex is an overkill, as others noted. But, if you have more cases and this is why you prefer regex, give Spring's AntPathMatcher#extractUriTemplateVariables a look, if you're using Spring. It's actually better equipped for extracting path variables than regex directly. Here are some good examples.

How to replace double slash with single slash for an url

For the given url like "http://google.com//view/All/builds", i want to replace the double slash with single slash. For example the above url should display as "http://google.com/view/All/builds"
I dint know regular expressions. Can any one help me, how can i achieve this using regular expressions.
To avoid replacing the first // in http:// use the following regex :
String to = from.replaceAll("(?<!http:)//", "/");
PS: if you want to handle https use (?<!(http:|https:))// instead.
Is Regex the right approach?
In case you wanted this solution as part of an exercise to improve your regex skills, then fine. But what is it that you're really trying to achieve? You're probably trying to normalize a URL. Replacing // with / is one aspect of normalizing a URL. But what about other aspects, like removing redundant ./ and collapsing ../ with their parent directories? What about different protocols? What about ///? What about the // at the start? What about /// at the start in case of file:///?
If you want to write a generic, reusable piece of code, using a regular expression is probably not the best appraoch. And it's reinventing the wheel. Instead, consider java.net.URI.normalize().
java.net.URI.normalize()
java.lang.String
String inputUrl = "http://localhost:1234//foo//bar//buzz";
String normalizedUrl = new URI(inputUrl).normalize().toString();
java.net.URL
URL inputUrl = new URL("http://localhost:1234//foo//bar//buzz");
URL normalizedUrl = inputUrl.toURI().normalize().toURL();
java.net.URI
URI inputUri = new URI("http://localhost:1234//foo//bar//buzz");
URI normalizedUri = inputUri.normalize();
Regex
In case you do want to use a regular expression, think of all possibilities. What if, in future, this should also process other protocols, like https, file, ftp, fish, and so on? So, think again, and probably use URI.normalize(). But if you insist on a regular expression, maybe use this one:
String noramlizedUri = uri.replaceAll("(?<!\\w+:/?)//+", "/");
Compared to other solutions, this works with all URLs that look similar to HTTP URLs just with different protocols instead of http, like https, file, ftp and so on, and it will keep the triple-slash /// in case of file:///. But, unlike java.net.URI.normalize(), this does not remove redundant ./, it does not collapse ../ with their parent directories, it does not other aspects of URL normalization that you and I might have forgotten about, and it will not be updated automatically with newer RFCs about URLs, URIs, and such.
String to = from.replaceAll("(?<!(http:|https:))[//]+", "/");
will match two or more slashes.
Here is the regexp:
/(?<=[^:\s])(\/+\/)/g
It finds multiple slashes in url preserving ones after protocol regardless of it.
Handles also protocol relative urls which start from //.
#Test
public void shouldReplaceMultipleSlashes() {
assertEquals("http://google.com/?q=hi", replaceMultipleSlashes("http://google.com///?q=hi"));
assertEquals("https://google.com/?q=hi", replaceMultipleSlashes("https:////google.com//?q=hi"));
assertEquals("//somecdn.com/foo/", replaceMultipleSlashes("//somecdn.com/foo///"));
}
private static String replaceMultipleSlashes(String url) {
return url.replaceAll("(?<=[^:\\s])(\\/+\\/)", "/");
}
Literally means:
(\/+\/) - find group: /+ one or more slashes followed by / slash
(?<=[^:\s]) - which follows the group (*posiive lookbehind) of this (*negated set) [^:\s] that excludes : colon and \s whitespace
g - global search flag
I suggest you simply use String.replace which documentation is http://docs.oracle.com/javase/7/docs/api/java/lang/String.html#replace(java.lang.CharSequence, java.lang.CharSequence)
Something like
`myString.replace("//", "/");
If you want to remove the first occurence:
String[] parts = str.split("//", 2);
str = parts[0] + "//" + parts[1].replaceAll("//", "/");
Which is the simplest way (without regular expression). I don't know the regular expression corresponding, if there is an expert looking at the thread.... ;)

Extracting an "encompassing" string based on a term within the string

I have a java function to extract a string out of the HTML Page source for any website...The function basically accepts the site name, along with a term to search for. Now, this search term is always contained within javascript tags. What I need to do is to pull the entire javascript (within the tags) that contains the search term.
Here is an example -
<script type="text/javascript">
//Roundtrip
rtTop = Number(new Date());
document.documentElement.className += ' jsenabled';
</script>
For the javascript snippet above, my search term would be "rtTop". Once found, I want my function to return the string containing everything within the script tags.
Any novel solution? Thanks.
You could use a regular expression along the lines of
String someHTML = //get your HTML from wherever
Pattern pattern = Pattern.compile("<script type=\"text/javascript\">(.*?rtTop.*?)</script>",Pattern.DOTALL);
Matcher myMatcher = pattern.matcher(someHTML);
myMatcher.find();
String result = myMatcher.group(1);
I wish I could just comment on JacobM's answer, but I think I need more stackCred.
You could use an HTML parser, that's usually the better solution. That said, for limited scopes, I often use regEx. It's a mean beast though. One change I would make to JacobM's pattern is to replace the attributes within the opening element with [^<]+
That will allow you to match even if the "type" isn't present or if it has some other oddities. I'd also wrap the .*? with parens to make using the values later a little easier.
* UPDATE *
Borrowing from JacobM's answer. I'd change the pattern a little bit to handle multiple elements.
String someHTML = //get your HTML from wherever
String lKeyword = "rtTop";
String lRegexPattern = "(.*)(<script[^>]*>(((?!</).)*)"+lKeyword +"(((?!</).)*)</script>)(.*)";
Pattern pattern = Pattern.compile(lRegexPattern ,Pattern.DOTALL);
Matcher myMatcher = pattern.matcher(someHTML);
myMatcher.find();
String lPreKeyword = myMatcher.group(3);
String lPostKeyword = myMatcher.group(5);
String result = lPreKeyword + lKeyword + lPostKeyword;
An example of this pattern in action can be found here. Like I said, parsing HTML via regex can get real ugly real fast.

Very Simple Regex Question

I have a very simple regex question. Suppose I have 2 conditions:
url =http://www.abc.com/cde/def
url =https://www.abc.com/sadfl/dsaf
How can I extract the baseUrl using regex?
Sample output:
http://www.abc.com
https://www.abc.com
Like this:
String baseUrl;
Pattern p = Pattern.compile("^(([a-zA-Z]+://)?[a-zA-Z0-9.-]+\\.[a-zA-Z]+(:\d+)?/");
Matcher m = p.matcher(str);
if (m.matches())
baseUrl = m.group(1);
However, you should use the URI class instead, like this:
URI uri = new URI(str);
A one liner without regexp:
String baseUrl = url.substring(0, url.indexOf('/', url.indexOf("//")+2));
/^(https?\:\/\/[^\/]+).*/$1/
This will capture ANYTHING that starts with http and $1 will contain everything from the beginning to the first / after the //
Except for write-and-throw-away scripts, you should always refrain from parsing complex syntaxes (e-mail addresses, urls, html pages, etc etc) using regexes.
believe me, you will get bitten eventually.
I'm pretty sure that there is a Java class that will allow path manipulations, but if it has to be a regex,
https?://[^/]+
would work. (s? included to also handle https:)
Looks like the simplest solution to your two specific examples would be the pattern:
[^/]_//[^/]+
i.e.: non-slash (0 or more times), two slashes, non-slash (0 or more times). You can be stricter than that if you wish, as the two existing answers are doing in different ways -- one will reject e.g. URLs starting with ftp:, the other will reject domains with underscores (but accept URLs without a leading protocol://, thereby being even broader than mine in that respect). This variety of answers (all correct wrt your scant specs;-) should suggest to you that your specs are too vague and should be tightened.
Here's a regex that should satisfy the problem as given.
https?://[^/]*
I'm assuming you're asking this partly to gain more knowledge of regexes. If, however, you're trying to pull the host from a URL, it's arguably much more correct to use Java's more robust parsing methods:
String urlStr = "https://www.abc.com/stuff";
URL url = new URL(urlStr);
String host = url.getHost();
String protocol = url.getProtocol();
URL baseUrl = new URL (protocol, host);
This is better, as it should catch more cases if your input URL isn't as strict as described above.
Old post.. thought I might as well put a simple answer to a simple regex Q:
(http|https):\/\/(www.)?(\w+)?\.(\w+)?

Categories