I am developing a http robot, and I developed this regular expression
(((?:f|ht)tp(?:s)?\\://)?|www)([^/]+) to detect and extract the hostname from a link (href).
Now I put here the results of the tests:
String -> http://www.meloteca.com/empresas-editoras.htm
Returns http://www.meloteca.com
String -> www.meloteca.com/empresas-editoras.htm
Returns www.meloteca.com
String -> /empresas-editoras.htm
Returns empresas-editoras.htm (without the slash)
In this case I was expecting that the regular expressions did not return any value? Why is this happening?
The same thing if I try with the following String
String -> empresas-editoras.htm
Returns empresas-editoras.htm
The snippet of code :
Pattern padrao = Pattern.compile("(((?:f|ht)tp(?:s)?\\://)?|www)([^/]+)");
Matcher mat = padrao.matcher("empresas-editoras.htm");
if(mat.find())
System.out.println("Host->"+mat.group());
It'd be better to use the URI class, and its methods like getHost() and getPath(), rather than a regular expression. The rules for constructing URIs are more complex than you probably realize, and your regex is likely to have lots of corner cases that won't be handled correctly.
If you remove one of the question marks, like this:
(((?:f|ht)tp(?:s)?\\://)|www)([^/]+)
it should work better.
The alternative ((?:f|ht)tp(?:s)?\\://)? is optional, so it can be the empty string, and then ([^/]+) just will match any string not containing /.
Related
I have a part of HTML from a website in the below String format:
srcset=" /tesla_theme/assets/img/homepage/mobile/homepage-models--touch#200w.jpg?20170808 200w, /tesla_theme/assets/img/homepage/mobile/homepage-models--touch#338w.jpg?20170808 338w, /tesla_theme/assets/img/homepage/mobile/homepage-models--touch#445w.jpg?20170808 445w, tesla_theme/assets/img/homepage/mobile/homepage-models--touch#542w.jpg?20170808 542w, /tesla_theme/assets/img/homepage/mobile/homepage-models--touch#750w.jpg?20170808 750w"
I want to add http://tesla.com in front of all the urls in the srcset element like http://tesla_theme/assets/img/homepage/mobile/homepage-models--touch#750w.jpg?20170808 750w
I believe this could be done using regex, but I am not sure.
How do I do this using Java if I have multiple srcset elements in a html string variable, and I want to replace all of the srcset url.'s and add the server url in front?
Note: The /tesla_theme will not be consistent, so I cannot use replaceAll, instead, i will have to use regex.
You can simply use String Class replace method as below, It will replace all "/_tesla" in the given String. No special regex required unless you have a kind of pattern instead of "/tesla"
String srcset=" /tesla_theme/assets/img/homepage/mobile/homepage-models--touch#200w.jpg?20170808 200w, /tesla_theme/assets/img/homepage/mobile/homepage-models--touch#338w.jpg?20170808 338w, /tesla_theme/assets/img/homepage/mobile/homepage-models--touch#445w.jpg?20170808 445w, tesla_theme/assets/img/homepage/mobile/homepage-models--touch#542w.jpg?20170808 542w, /tesla_theme/assets/img/homepage/mobile/homepage-models--touch#750w.jpg?20170808 750w";
String requiredSrcSet = srcset.replace("/tesla_", "http://tesla_");
I would like to use Java regex to match a domain of a url, for example,
for www.table.google.com, I would like to get 'google' out of the url, namely, the second last word in this URL string.
Any help will be appreciated !!!
It really depends on the complexity of your inputs...
Here is a pretty simple regex:
.+\\.(.+)\\..+
It fetches something that is inside dots \\..
And here are some examples for that pattern: https://regex101.com/r/L52oz6/1.
As you can see, it works for simple inputs but not for complex urls.
But why reinventing the wheel, there are plenty of really good libraries that correctly parse any complex url. But sure, for simple inputs a small regex is easily build. So if that does not solve the problem for your inputs then please callback, I will adjust the regex pattern then.
Note that you can also just use simple splitting like:
String[] elements = input.split("\\.");
String secondToLastElement = elements[elements.length - 2];
But don't forget the index-bound checking.
Or if you search for a very quick solution than walk through the input starting from the last position. Work your way through until you found the first dot, continue until the second dot was found. Then extract that part with input.substring(index1, index2);.
There is also already a delegate method for exactly that purpose, namely String#lastIndexOf (see the documentation).
Take a look at this code snippet:
String input = ...
int indexLastDot = input.lastIndexOf('.');
int indexSecondToLastDot = input.lastIndexOf('.', indexLastDot);
String secondToLastWord = input.substring(indexLastDot, indexSecondToLastDot);
Maybe the bounds are off by 1, haven't tested the code, but you get the idea. Also don't forget bound checking.
The advantage of this approach is that it is really fast, it can directly work on the internal structures of Strings without creating copies.
My attempt:
(?<scheme>https?:\/\/)?(?<subdomain>\S*?)(?<domainword>[^.\s]+)(?<tld>\.[a-z]+|\.[a-z]{2,3}\.[a-z]{2,3})(?=\/|$)
Demo. Works correctly for:
http://www.foo.stackoverflow.com
http://www.stackoverflow.com
http://www.stackoverflow.com/
http://stackoverflow.com
https://www.stackoverflow.com
www.stackoverflow.com
stackoverflow.com
http://www.stackoverflow.com
http://www.stackoverflow.co.uk
foo.www.stackoverflow.com
foo.www.stackoverflow.co.uk
foo.www.stackoverflow.co.uk/a/b/c
private static final Pattern URL_MATCH_GET_SECOND_AND_LAST =
Pattern.compile("www.(.*)//.google.(.*)", Pattern.CASE_INSENSITIVE);
String sURL = "www.table.google.com";
if (URL_MATCH_GET_SECOND_AND_LAST.matcher(sURL).find()){
Matcher matchURL = URL_MATCH_GET_SECOND_AND_LAST .matcher(sURL);
if (matchURL .find()) {
String sFirst = matchURL.group(1);
String sSecond= matchURL.group(2);
}
}
what is the regular expression to find the path param from the url?
http://localhost:8080/domain/v1/809pA8
https://localhost:8080/domain/v1/809pA8
Want to retrieve the value(809pA8) from the above URL using regular expression, java is preferable.
I would suggest you do something like
url.substring(url.lastIndexOf('/') + 1);
If you really prefer regexps, you could do
Matcher m = Pattern.compile("/([^/]+)$").matcher(url);
if (m.find())
value = m.group(1);
I would try:
String url = "http://localhost:8080/domain/v1/809pA8";
String value = String.valueOf(url.subSequence(url.lastIndexOf('/'), url.length()-1));
No need for regex here, I think.
EDIT: I'm sorry I made a mistake:
String url = "http://localhost:8080/domain/v1/809pA8";
String value = String.valueOf(url.subSequence(url.lastIndexOf('/')+1, url.length()));
See this code working here: https://ideone.com/E30ddC
For your simple case, regex is an overkill, as others noted. But, if you have more cases and this is why you prefer regex, give Spring's AntPathMatcher#extractUriTemplateVariables a look, if you're using Spring. It's actually better equipped for extracting path variables than regex directly. Here are some good examples.
For the given url like "http://google.com//view/All/builds", i want to replace the double slash with single slash. For example the above url should display as "http://google.com/view/All/builds"
I dint know regular expressions. Can any one help me, how can i achieve this using regular expressions.
To avoid replacing the first // in http:// use the following regex :
String to = from.replaceAll("(?<!http:)//", "/");
PS: if you want to handle https use (?<!(http:|https:))// instead.
Is Regex the right approach?
In case you wanted this solution as part of an exercise to improve your regex skills, then fine. But what is it that you're really trying to achieve? You're probably trying to normalize a URL. Replacing // with / is one aspect of normalizing a URL. But what about other aspects, like removing redundant ./ and collapsing ../ with their parent directories? What about different protocols? What about ///? What about the // at the start? What about /// at the start in case of file:///?
If you want to write a generic, reusable piece of code, using a regular expression is probably not the best appraoch. And it's reinventing the wheel. Instead, consider java.net.URI.normalize().
java.net.URI.normalize()
java.lang.String
String inputUrl = "http://localhost:1234//foo//bar//buzz";
String normalizedUrl = new URI(inputUrl).normalize().toString();
java.net.URL
URL inputUrl = new URL("http://localhost:1234//foo//bar//buzz");
URL normalizedUrl = inputUrl.toURI().normalize().toURL();
java.net.URI
URI inputUri = new URI("http://localhost:1234//foo//bar//buzz");
URI normalizedUri = inputUri.normalize();
Regex
In case you do want to use a regular expression, think of all possibilities. What if, in future, this should also process other protocols, like https, file, ftp, fish, and so on? So, think again, and probably use URI.normalize(). But if you insist on a regular expression, maybe use this one:
String noramlizedUri = uri.replaceAll("(?<!\\w+:/?)//+", "/");
Compared to other solutions, this works with all URLs that look similar to HTTP URLs just with different protocols instead of http, like https, file, ftp and so on, and it will keep the triple-slash /// in case of file:///. But, unlike java.net.URI.normalize(), this does not remove redundant ./, it does not collapse ../ with their parent directories, it does not other aspects of URL normalization that you and I might have forgotten about, and it will not be updated automatically with newer RFCs about URLs, URIs, and such.
String to = from.replaceAll("(?<!(http:|https:))[//]+", "/");
will match two or more slashes.
Here is the regexp:
/(?<=[^:\s])(\/+\/)/g
It finds multiple slashes in url preserving ones after protocol regardless of it.
Handles also protocol relative urls which start from //.
#Test
public void shouldReplaceMultipleSlashes() {
assertEquals("http://google.com/?q=hi", replaceMultipleSlashes("http://google.com///?q=hi"));
assertEquals("https://google.com/?q=hi", replaceMultipleSlashes("https:////google.com//?q=hi"));
assertEquals("//somecdn.com/foo/", replaceMultipleSlashes("//somecdn.com/foo///"));
}
private static String replaceMultipleSlashes(String url) {
return url.replaceAll("(?<=[^:\\s])(\\/+\\/)", "/");
}
Literally means:
(\/+\/) - find group: /+ one or more slashes followed by / slash
(?<=[^:\s]) - which follows the group (*posiive lookbehind) of this (*negated set) [^:\s] that excludes : colon and \s whitespace
g - global search flag
I suggest you simply use String.replace which documentation is http://docs.oracle.com/javase/7/docs/api/java/lang/String.html#replace(java.lang.CharSequence, java.lang.CharSequence)
Something like
`myString.replace("//", "/");
If you want to remove the first occurence:
String[] parts = str.split("//", 2);
str = parts[0] + "//" + parts[1].replaceAll("//", "/");
Which is the simplest way (without regular expression). I don't know the regular expression corresponding, if there is an expert looking at the thread.... ;)
I need to check if an URL is valid or not. The URL should contain some subdirectories like as:
example.com/test/test1/example/a.html
The URL should contain subdirectories test, test1 and example. How can I check if the URL is valid using regex in Java?
String url = "example.com/test/test1/example/a.html";
List<String> parts = Arrays.asList(url.split("/"));
return (parts.contains("test") && parts.contains("test1") && parts.contains("example"));
Since you want to do in regex, how about this...
Pattern p = Pattern.compile("example\\.com/test/test1/example/[\\w\\W]*");
System.out.println("OK: " + p.matcher("example.com/test/test1/example/a.html").find());
System.out.println("KO: " + p.matcher("example.com/test/test2/example/a.html").find());
You can simply pass your URL as an argument to the java.net.URL(String) constructor and check if the constructor throws java.net.MalformedURLException.
EDIT If, however, you simply want to check if a given string contains a given substring, use the String.contains(CharSequence) method. For example:
String url = "example.com/test/test1/example/a.html";
if (url.contains("/test/test1/")) {
// ...
}
This question is answered here using regular expressions:
Regular expression to match URLs in Java
But you can use the library Apache Commons Validators to use some tested validators instead to write your own.
Here is the library:
http://commons.apache.org/validator/
And here the javadoc of the URL Validator.
http://commons.apache.org/validator/apidocs/org/apache/commons/validator/UrlValidator.html