Getting file extension from http url using Java - java

Now I know about FilenameUtils.getExtension() from apache.
But in my case I'm processing extensions from http(s) urls, so in case I have something like
https://your_url/logo.svg?position=5
this method is gonna return svg?position=5
Is there the best way to handle this situation? I mean without writing this logic by myself.

You can use the URL library from JAVA. It has a lot of utility in this cases. You should do something like this:
String url = "https://your_url/logo.svg?position=5";
URL fileIneed = new URL(url);
Then, you have a lot of getter methods for the "fileIneed" variable. In your case the "getPath()" will retrieve this:
fileIneed.getPath() ---> "/logo.svg"
And then use the Apache library that you are using, and you will have the "svg" String.
FilenameUtils.getExtension(fileIneed.getPath()) ---> "svg"
JAVA URL library docs >>>
https://docs.oracle.com/javase/7/docs/api/java/net/URL.html

If you want a brandname® solution, then consider using the Apache method after stripping off the query string, if it exists:
String url = "https://your_url/logo.svg?position=5";
url = url.replaceAll("\\?.*$", "");
String ext = FilenameUtils.getExtension(url);
System.out.println(ext);
If you want a one-liner which does not even require an external library, then consider this option using String#replaceAll:
String url = "https://your_url/logo.svg?position=5";
String ext = url.replaceAll(".*/[^.]+\\.([^?]+)\\??.*", "$1");
System.out.println(ext);
svg
Here is an explanation of the regex pattern used above:
.*/ match everything up to, and including, the LAST path separator
[^.]+ then match any number of non dots, i.e. match the filename
\. match a dot
([^?]+) match AND capture any non ? character, which is the extension
\??.* match an optional ? followed by the rest of the query string, if present

Related

Trying to replace part of a string starts with /x2D

In JMeter, I used a Regular Expression Extractor to extract part of an HTML response. I then passed that to a BeanShell Post Processor. However, having trouble replacing \x2D to -. Is there a way to do this or perhaps do I need to extract the response as
String yourvar = vars.get("accessToken");
String anotherVar = yourvar.replace("data.access_token = '","");
String finalAccessToken = anotherVar.replace("\x2D","-");
vars.put("finalAccessToken",finalAccessToken);
It is not liking the "\x2D" part. It works if I find \x2D but the original string only has .
You need to escape your target String parameter.
final String finalAccessToken = anotherVar.replace("\\x2D", "-");
If it's not what you're asking for, add more info to the question. That's all what I was able to understand.
It is recommended to use JMeter's built-in test elements where possible. In particular your case you might be interested in __strReplace() custom JMeter Function
Install Custom JMeter Functions bundle using JMeter Plugins Manager
Use the following expression to make the replacement:
${__strReplace(${anotherVar},\\\x2D,-,)}
If you want to go for scripting - make sure to use JSR223 PostProcessor and Groovy language. Be aware that you will still need to escape backslash with another backslash like:
String finalAccessToken = anotherVar.replace("\\x2D","-");

How to replace double slash with single slash for an url

For the given url like "http://google.com//view/All/builds", i want to replace the double slash with single slash. For example the above url should display as "http://google.com/view/All/builds"
I dint know regular expressions. Can any one help me, how can i achieve this using regular expressions.
To avoid replacing the first // in http:// use the following regex :
String to = from.replaceAll("(?<!http:)//", "/");
PS: if you want to handle https use (?<!(http:|https:))// instead.
Is Regex the right approach?
In case you wanted this solution as part of an exercise to improve your regex skills, then fine. But what is it that you're really trying to achieve? You're probably trying to normalize a URL. Replacing // with / is one aspect of normalizing a URL. But what about other aspects, like removing redundant ./ and collapsing ../ with their parent directories? What about different protocols? What about ///? What about the // at the start? What about /// at the start in case of file:///?
If you want to write a generic, reusable piece of code, using a regular expression is probably not the best appraoch. And it's reinventing the wheel. Instead, consider java.net.URI.normalize().
java.net.URI.normalize()
java.lang.String
String inputUrl = "http://localhost:1234//foo//bar//buzz";
String normalizedUrl = new URI(inputUrl).normalize().toString();
java.net.URL
URL inputUrl = new URL("http://localhost:1234//foo//bar//buzz");
URL normalizedUrl = inputUrl.toURI().normalize().toURL();
java.net.URI
URI inputUri = new URI("http://localhost:1234//foo//bar//buzz");
URI normalizedUri = inputUri.normalize();
Regex
In case you do want to use a regular expression, think of all possibilities. What if, in future, this should also process other protocols, like https, file, ftp, fish, and so on? So, think again, and probably use URI.normalize(). But if you insist on a regular expression, maybe use this one:
String noramlizedUri = uri.replaceAll("(?<!\\w+:/?)//+", "/");
Compared to other solutions, this works with all URLs that look similar to HTTP URLs just with different protocols instead of http, like https, file, ftp and so on, and it will keep the triple-slash /// in case of file:///. But, unlike java.net.URI.normalize(), this does not remove redundant ./, it does not collapse ../ with their parent directories, it does not other aspects of URL normalization that you and I might have forgotten about, and it will not be updated automatically with newer RFCs about URLs, URIs, and such.
String to = from.replaceAll("(?<!(http:|https:))[//]+", "/");
will match two or more slashes.
Here is the regexp:
/(?<=[^:\s])(\/+\/)/g
It finds multiple slashes in url preserving ones after protocol regardless of it.
Handles also protocol relative urls which start from //.
#Test
public void shouldReplaceMultipleSlashes() {
assertEquals("http://google.com/?q=hi", replaceMultipleSlashes("http://google.com///?q=hi"));
assertEquals("https://google.com/?q=hi", replaceMultipleSlashes("https:////google.com//?q=hi"));
assertEquals("//somecdn.com/foo/", replaceMultipleSlashes("//somecdn.com/foo///"));
}
private static String replaceMultipleSlashes(String url) {
return url.replaceAll("(?<=[^:\\s])(\\/+\\/)", "/");
}
Literally means:
(\/+\/) - find group: /+ one or more slashes followed by / slash
(?<=[^:\s]) - which follows the group (*posiive lookbehind) of this (*negated set) [^:\s] that excludes : colon and \s whitespace
g - global search flag
I suggest you simply use String.replace which documentation is http://docs.oracle.com/javase/7/docs/api/java/lang/String.html#replace(java.lang.CharSequence, java.lang.CharSequence)
Something like
`myString.replace("//", "/");
If you want to remove the first occurence:
String[] parts = str.split("//", 2);
str = parts[0] + "//" + parts[1].replaceAll("//", "/");
Which is the simplest way (without regular expression). I don't know the regular expression corresponding, if there is an expert looking at the thread.... ;)

Regex for match web extensions

I want to check whether a url is a image, java script, pdf etc
String url = "www.something.com/script/sample.js?xyz=xyz;
Below regex works fine but only with out xyz=zyz
".*(mov|jpg|gif|pdf|js)$"
When i remove $ at the end to eliminate regex requirement for .js to be in end but then it gives false
.*(mov|jpg|gif|pdf|js).*$ allows you to have any optional text after the file extension. The capturing group captures the file extension. You can see this here.
Use the regex as below:
.*\\.(mov|jpg|gif|pdf|js)\\?
This matches for dot(.) followed by your extension and terminated by ?
The first dot(.) is matching any character while second dot(.) prefixed by \\ match for dot(.) as literal just before your extension list.
Why not use java.net.URL to parse the url string, it could avoid lots of mismatching problems:
try {
URL url = new URL(urlString);
String filename = url.getFile();
// now test if the filename ends with your desired extensions.
} catch (Exception e) {
// This case the url cannot be parsed.
}
I'm not a big fan of this, but try:
.*\\.(mov|jpg|gif|pdf|js).*$
The problem is that it will accept things like "our.moving.day"
and post your code. there is always more than one way to skin a cat and perhaps there is something wrong with your code, not the regex.
Also, try regex testers...theres a ton of them out there. i'm a big fan of:
http://rubular.com/ and http://gskinner.com/RegExr/ (but they are mostly for php/ruby)

How to validate the URL using regex in Java?

I need to check if an URL is valid or not. The URL should contain some subdirectories like as:
example.com/test/test1/example/a.html
The URL should contain subdirectories test, test1 and example. How can I check if the URL is valid using regex in Java?
String url = "example.com/test/test1/example/a.html";
List<String> parts = Arrays.asList(url.split("/"));
return (parts.contains("test") && parts.contains("test1") && parts.contains("example"));
Since you want to do in regex, how about this...
Pattern p = Pattern.compile("example\\.com/test/test1/example/[\\w\\W]*");
System.out.println("OK: " + p.matcher("example.com/test/test1/example/a.html").find());
System.out.println("KO: " + p.matcher("example.com/test/test2/example/a.html").find());
You can simply pass your URL as an argument to the java.net.URL(String) constructor and check if the constructor throws java.net.MalformedURLException.
EDIT If, however, you simply want to check if a given string contains a given substring, use the String.contains(CharSequence) method. For example:
String url = "example.com/test/test1/example/a.html";
if (url.contains("/test/test1/")) {
// ...
}
This question is answered here using regular expressions:
Regular expression to match URLs in Java
But you can use the library Apache Commons Validators to use some tested validators instead to write your own.
Here is the library:
http://commons.apache.org/validator/
And here the javadoc of the URL Validator.
http://commons.apache.org/validator/apidocs/org/apache/commons/validator/UrlValidator.html

Very Simple Regex Question

I have a very simple regex question. Suppose I have 2 conditions:
url =http://www.abc.com/cde/def
url =https://www.abc.com/sadfl/dsaf
How can I extract the baseUrl using regex?
Sample output:
http://www.abc.com
https://www.abc.com
Like this:
String baseUrl;
Pattern p = Pattern.compile("^(([a-zA-Z]+://)?[a-zA-Z0-9.-]+\\.[a-zA-Z]+(:\d+)?/");
Matcher m = p.matcher(str);
if (m.matches())
baseUrl = m.group(1);
However, you should use the URI class instead, like this:
URI uri = new URI(str);
A one liner without regexp:
String baseUrl = url.substring(0, url.indexOf('/', url.indexOf("//")+2));
/^(https?\:\/\/[^\/]+).*/$1/
This will capture ANYTHING that starts with http and $1 will contain everything from the beginning to the first / after the //
Except for write-and-throw-away scripts, you should always refrain from parsing complex syntaxes (e-mail addresses, urls, html pages, etc etc) using regexes.
believe me, you will get bitten eventually.
I'm pretty sure that there is a Java class that will allow path manipulations, but if it has to be a regex,
https?://[^/]+
would work. (s? included to also handle https:)
Looks like the simplest solution to your two specific examples would be the pattern:
[^/]_//[^/]+
i.e.: non-slash (0 or more times), two slashes, non-slash (0 or more times). You can be stricter than that if you wish, as the two existing answers are doing in different ways -- one will reject e.g. URLs starting with ftp:, the other will reject domains with underscores (but accept URLs without a leading protocol://, thereby being even broader than mine in that respect). This variety of answers (all correct wrt your scant specs;-) should suggest to you that your specs are too vague and should be tightened.
Here's a regex that should satisfy the problem as given.
https?://[^/]*
I'm assuming you're asking this partly to gain more knowledge of regexes. If, however, you're trying to pull the host from a URL, it's arguably much more correct to use Java's more robust parsing methods:
String urlStr = "https://www.abc.com/stuff";
URL url = new URL(urlStr);
String host = url.getHost();
String protocol = url.getProtocol();
URL baseUrl = new URL (protocol, host);
This is better, as it should catch more cases if your input URL isn't as strict as described above.
Old post.. thought I might as well put a simple answer to a simple regex Q:
(http|https):\/\/(www.)?(\w+)?\.(\w+)?

Categories