Regex for match web extensions - java

I want to check whether a url is a image, java script, pdf etc
String url = "www.something.com/script/sample.js?xyz=xyz;
Below regex works fine but only with out xyz=zyz
".*(mov|jpg|gif|pdf|js)$"
When i remove $ at the end to eliminate regex requirement for .js to be in end but then it gives false

.*(mov|jpg|gif|pdf|js).*$ allows you to have any optional text after the file extension. The capturing group captures the file extension. You can see this here.

Use the regex as below:
.*\\.(mov|jpg|gif|pdf|js)\\?
This matches for dot(.) followed by your extension and terminated by ?
The first dot(.) is matching any character while second dot(.) prefixed by \\ match for dot(.) as literal just before your extension list.

Why not use java.net.URL to parse the url string, it could avoid lots of mismatching problems:
try {
URL url = new URL(urlString);
String filename = url.getFile();
// now test if the filename ends with your desired extensions.
} catch (Exception e) {
// This case the url cannot be parsed.
}

I'm not a big fan of this, but try:
.*\\.(mov|jpg|gif|pdf|js).*$
The problem is that it will accept things like "our.moving.day"
and post your code. there is always more than one way to skin a cat and perhaps there is something wrong with your code, not the regex.
Also, try regex testers...theres a ton of them out there. i'm a big fan of:
http://rubular.com/ and http://gskinner.com/RegExr/ (but they are mostly for php/ruby)

Related

Getting file extension from http url using Java

Now I know about FilenameUtils.getExtension() from apache.
But in my case I'm processing extensions from http(s) urls, so in case I have something like
https://your_url/logo.svg?position=5
this method is gonna return svg?position=5
Is there the best way to handle this situation? I mean without writing this logic by myself.
You can use the URL library from JAVA. It has a lot of utility in this cases. You should do something like this:
String url = "https://your_url/logo.svg?position=5";
URL fileIneed = new URL(url);
Then, you have a lot of getter methods for the "fileIneed" variable. In your case the "getPath()" will retrieve this:
fileIneed.getPath() ---> "/logo.svg"
And then use the Apache library that you are using, and you will have the "svg" String.
FilenameUtils.getExtension(fileIneed.getPath()) ---> "svg"
JAVA URL library docs >>>
https://docs.oracle.com/javase/7/docs/api/java/net/URL.html
If you want a brandname® solution, then consider using the Apache method after stripping off the query string, if it exists:
String url = "https://your_url/logo.svg?position=5";
url = url.replaceAll("\\?.*$", "");
String ext = FilenameUtils.getExtension(url);
System.out.println(ext);
If you want a one-liner which does not even require an external library, then consider this option using String#replaceAll:
String url = "https://your_url/logo.svg?position=5";
String ext = url.replaceAll(".*/[^.]+\\.([^?]+)\\??.*", "$1");
System.out.println(ext);
svg
Here is an explanation of the regex pattern used above:
.*/ match everything up to, and including, the LAST path separator
[^.]+ then match any number of non dots, i.e. match the filename
\. match a dot
([^?]+) match AND capture any non ? character, which is the extension
\??.* match an optional ? followed by the rest of the query string, if present

How to replace double slash with single slash for an url

For the given url like "http://google.com//view/All/builds", i want to replace the double slash with single slash. For example the above url should display as "http://google.com/view/All/builds"
I dint know regular expressions. Can any one help me, how can i achieve this using regular expressions.
To avoid replacing the first // in http:// use the following regex :
String to = from.replaceAll("(?<!http:)//", "/");
PS: if you want to handle https use (?<!(http:|https:))// instead.
Is Regex the right approach?
In case you wanted this solution as part of an exercise to improve your regex skills, then fine. But what is it that you're really trying to achieve? You're probably trying to normalize a URL. Replacing // with / is one aspect of normalizing a URL. But what about other aspects, like removing redundant ./ and collapsing ../ with their parent directories? What about different protocols? What about ///? What about the // at the start? What about /// at the start in case of file:///?
If you want to write a generic, reusable piece of code, using a regular expression is probably not the best appraoch. And it's reinventing the wheel. Instead, consider java.net.URI.normalize().
java.net.URI.normalize()
java.lang.String
String inputUrl = "http://localhost:1234//foo//bar//buzz";
String normalizedUrl = new URI(inputUrl).normalize().toString();
java.net.URL
URL inputUrl = new URL("http://localhost:1234//foo//bar//buzz");
URL normalizedUrl = inputUrl.toURI().normalize().toURL();
java.net.URI
URI inputUri = new URI("http://localhost:1234//foo//bar//buzz");
URI normalizedUri = inputUri.normalize();
Regex
In case you do want to use a regular expression, think of all possibilities. What if, in future, this should also process other protocols, like https, file, ftp, fish, and so on? So, think again, and probably use URI.normalize(). But if you insist on a regular expression, maybe use this one:
String noramlizedUri = uri.replaceAll("(?<!\\w+:/?)//+", "/");
Compared to other solutions, this works with all URLs that look similar to HTTP URLs just with different protocols instead of http, like https, file, ftp and so on, and it will keep the triple-slash /// in case of file:///. But, unlike java.net.URI.normalize(), this does not remove redundant ./, it does not collapse ../ with their parent directories, it does not other aspects of URL normalization that you and I might have forgotten about, and it will not be updated automatically with newer RFCs about URLs, URIs, and such.
String to = from.replaceAll("(?<!(http:|https:))[//]+", "/");
will match two or more slashes.
Here is the regexp:
/(?<=[^:\s])(\/+\/)/g
It finds multiple slashes in url preserving ones after protocol regardless of it.
Handles also protocol relative urls which start from //.
#Test
public void shouldReplaceMultipleSlashes() {
assertEquals("http://google.com/?q=hi", replaceMultipleSlashes("http://google.com///?q=hi"));
assertEquals("https://google.com/?q=hi", replaceMultipleSlashes("https:////google.com//?q=hi"));
assertEquals("//somecdn.com/foo/", replaceMultipleSlashes("//somecdn.com/foo///"));
}
private static String replaceMultipleSlashes(String url) {
return url.replaceAll("(?<=[^:\\s])(\\/+\\/)", "/");
}
Literally means:
(\/+\/) - find group: /+ one or more slashes followed by / slash
(?<=[^:\s]) - which follows the group (*posiive lookbehind) of this (*negated set) [^:\s] that excludes : colon and \s whitespace
g - global search flag
I suggest you simply use String.replace which documentation is http://docs.oracle.com/javase/7/docs/api/java/lang/String.html#replace(java.lang.CharSequence, java.lang.CharSequence)
Something like
`myString.replace("//", "/");
If you want to remove the first occurence:
String[] parts = str.split("//", 2);
str = parts[0] + "//" + parts[1].replaceAll("//", "/");
Which is the simplest way (without regular expression). I don't know the regular expression corresponding, if there is an expert looking at the thread.... ;)

Regular expression String for URL in JAVA

I would like to check URL Validation in JAVA with regular-expression. I found this comment and I tried to use it in my code as follow...
private static final String PATTERN_URL = "/((([A-Za-z]{3,9}:(?:\/\/)?)(?:[-;:&=\+\$,\w]+#)?[A-Za-z0-9.-]+(:[0-9]+)?|(?:ww‌​w.|[-;:&=\+\$,\w]+#)[A-Za-z0-9.-]+)((?:\/[\+~%\/.\w-_]*)?\??(?:[-\+=&;%#.\w_]*)#?‌​(?:[\w]*))?)/";
.....
if (!urlString.matches(PATTERN_URL)) {
System.err.println("Invalid URL");
return false;
}
But I got compile time exception for writing my PATTERN_URL variable. I have no idea how to format it and I am worried about will it become invalid regex if I have modified. Can anyone fix it for me without losing original ? Thanks for your helps.
Your regex looks fine. You just need to format it for a Java string, by escaping all the escape-slashes:
\ --> \\
Resulting in this:
"/((([A-Za-z]{3,9}:(?:\\/\\/)?)(?:[-;:&=\\+\\$,\\w]+#)?[A-Za-z0-9.-]+(:[0-9]+)?|(?:ww‌​w.|[-;:&=\\+\\$,\\w]+#)[A-Za-z0-9.-]+)((?:\\/[\\+~%\\/.\\w-_]*)?\\??(?:[-\\+=&;%#.\\w_]*)#?‌​(?:[\\w]*))?)/"
After Java interprets this string into a java.util.regex.Pattern, it will strip out those extra escape-slashes and become exactly the regex you want. You can prove this by printing it:
System.out.println(Pattern.compile(PATTERN_URL));

Use RegEx in Java to extract parameters in between parentheses

I'm writing a utility to extract the names of header files from JSPs. I have no problem reading the JSPs line by line and finding the lines I need. I am having a problem extracting the specific text needed using regex. After looking at many similar questions I'm hitting a brick wall.
An example of the String I'll be matching from within is:
<jsp:include page="<%=Pages.getString(\"MY_HEADER\")%>" flush="true"></jsp:include>
All I need is MY_HEADER for this example. Any time I have this tag:
<%=Pages.getString
I need what comes between this:
<%=Pages.getString(\" and this: )%>
Here is what I have currently (which is not working, I might add) :
String currentLine;
while ((currentLine = fileReader.readLine()) != null)
{
Pattern pattern = Pattern.compile("<%=Pages\\.getString\\(\\\\\"([^\\\\]*)");
Matcher matcher = pattern.matcher(currentLine);
while(matcher.find()) {
System.out.println(matcher.group(1).toString());
}}
I need to be able to use the Java RegEx API and regex to extract those header names.
Any help on this issue is greatly appreciated. Thanks!
EDIT:
Resolved this issue, thankfully. The tricky part was, after being given the right regex, it had to be taken into account that the String I was feeding to the regex was always going to have two " / " characters ( (/"MY_HEADER"/) ) that needed to be escaped in the pattern.
Here is what worked (thanks to the help ;-)):
Pattern pattern = Pattern.compile("<%=Pages\\.getString\\(\\\\\"([^\\\\\"]*)");
This should do the trick:
<%=Pages\\.getString\\(\\\\\"([^\\\\]*)
Yeah that's a scary number of back slashes. matcher.group(1) should return MY_HEADER. It starts at the \" and matches everything until the next \ (which I assume here will be at \")%>.)
Of course, if your target text contains a backslash (\), this will not work. But you didn't give an indication that you'd ever be looking for something like <%=Pages.getString(\"Fun!\Yay!\")%> -- where this regex would only return Fun! and ignore the rest.
EDIT
The reason your test case was failing is because you were using this test string:
String currentLine = "<%=Pages.getString(\"MY_HEADER\")%>";
This is the equivalent of reading it in from a file and seeing:
<%=Pages.getString("MY_HEADER")%>
Note the lack of any \. You need to use this instead:
String sCurrentLine = "<%=Pages.getString(\\\"MY_HEADER\\\")%>";
Which is the equivalent of what you want.
This is test code that works:
String currentLine = "<%=Pages.getString(\\\"MY_HEADER\\\")%>";
Pattern pattern = Pattern.compile("<%=Pages\\.getString\\(\\\\\"([^\\\\]*)");
Matcher matcher = pattern.matcher(currentLine);
while(matcher.find()) {
System.out.println(matcher.group(1).toString());
}

Simple regex extract folders

What would be the most efficient way to cover all cases for a retrieve of folder1/folder22
from:
http://localhost:8080/folder1/folder22/file.jpg
or
http://domain.com/folder1/folder22/file.jpg
or
http://127.0.0.0.1:8080/folder1/folder22/file.jpg
so there may be one or more folders/sub-folders. Basically I would like to strip the domain name and port if available and the file name at the end.
Thank for your time.
What about the URL class and getPath()?
Maybe it's not the most efficient way, but one of the simplest I think:
String[] urls = {
"http://localhost:8080/folder1/folder22/file.jpg",
"http://domain.com/folder1/folder22/file.jpg",
"http://127.0.0.0.1:8080/folder1/folder22/file.jpg" };
for (String url : urls)
System.out.println(new File(new URL(url).getPath()).getParent());
You should probably use Java's URL parser for this, but if it has to be a regex:
\b(?=/).*(?=/[^/\r\n]*)
will match /folder1/folder22 in all your examples.
try {
Pattern regex = Pattern.compile("\\b(?=/).*(?=/[^/\r\n]*)");
Matcher regexMatcher = regex.matcher(subjectString);
if (regexMatcher.find()) {
ResultString = regexMatcher.group();
}
Explanation:
\b: Assert position at a word boundary (this will work before a single slash, but not between slashes or after a :)
(?=/): Assert that the next character is a slash.
.*: Match anything until...
(?=/[^/\r\n]*): ...exactly one last / (and anything else except slashes or newlines) follows.
^.+/([^/]+/[^/]+)/[^/]+$
The best way to get the last two directories from a url is the following:
preg_match("/\/([^\/]+\/){2}[^\/]+$/", $path, $matches);
If matched, And $matches[1] will always contain what you want, no matter filename of full url.

Categories