Regex to match S3 file and a directory - java

I have the following pattern that used to match a S3 URL
Pattern.compile("^s3://([^/]+)/(.*?([^/]+))$");
This matches
s3://bucket/path/key
But does not match a directory
s3://bucket/path/directory/
Is there an easy way to change the pattern to match a directory?

There's only the final slash missing in the regex. You might try this:
^s3://([^/]+)/(.*?([^/]+)/?)$
^^
regex101 demo.

This works just fine in JS but will work for all languages supporting RegEx too I guess.Somebody might want to use this for a sample path ,like this
s3://bucket_name/folder1/folder2/file1.json
^s3:\/\/([^/]+)\/([\w\W]+)\.(.*)

Related

How to replace double slash with single slash for an url

For the given url like "http://google.com//view/All/builds", i want to replace the double slash with single slash. For example the above url should display as "http://google.com/view/All/builds"
I dint know regular expressions. Can any one help me, how can i achieve this using regular expressions.
To avoid replacing the first // in http:// use the following regex :
String to = from.replaceAll("(?<!http:)//", "/");
PS: if you want to handle https use (?<!(http:|https:))// instead.
Is Regex the right approach?
In case you wanted this solution as part of an exercise to improve your regex skills, then fine. But what is it that you're really trying to achieve? You're probably trying to normalize a URL. Replacing // with / is one aspect of normalizing a URL. But what about other aspects, like removing redundant ./ and collapsing ../ with their parent directories? What about different protocols? What about ///? What about the // at the start? What about /// at the start in case of file:///?
If you want to write a generic, reusable piece of code, using a regular expression is probably not the best appraoch. And it's reinventing the wheel. Instead, consider java.net.URI.normalize().
java.net.URI.normalize()
java.lang.String
String inputUrl = "http://localhost:1234//foo//bar//buzz";
String normalizedUrl = new URI(inputUrl).normalize().toString();
java.net.URL
URL inputUrl = new URL("http://localhost:1234//foo//bar//buzz");
URL normalizedUrl = inputUrl.toURI().normalize().toURL();
java.net.URI
URI inputUri = new URI("http://localhost:1234//foo//bar//buzz");
URI normalizedUri = inputUri.normalize();
Regex
In case you do want to use a regular expression, think of all possibilities. What if, in future, this should also process other protocols, like https, file, ftp, fish, and so on? So, think again, and probably use URI.normalize(). But if you insist on a regular expression, maybe use this one:
String noramlizedUri = uri.replaceAll("(?<!\\w+:/?)//+", "/");
Compared to other solutions, this works with all URLs that look similar to HTTP URLs just with different protocols instead of http, like https, file, ftp and so on, and it will keep the triple-slash /// in case of file:///. But, unlike java.net.URI.normalize(), this does not remove redundant ./, it does not collapse ../ with their parent directories, it does not other aspects of URL normalization that you and I might have forgotten about, and it will not be updated automatically with newer RFCs about URLs, URIs, and such.
String to = from.replaceAll("(?<!(http:|https:))[//]+", "/");
will match two or more slashes.
Here is the regexp:
/(?<=[^:\s])(\/+\/)/g
It finds multiple slashes in url preserving ones after protocol regardless of it.
Handles also protocol relative urls which start from //.
#Test
public void shouldReplaceMultipleSlashes() {
assertEquals("http://google.com/?q=hi", replaceMultipleSlashes("http://google.com///?q=hi"));
assertEquals("https://google.com/?q=hi", replaceMultipleSlashes("https:////google.com//?q=hi"));
assertEquals("//somecdn.com/foo/", replaceMultipleSlashes("//somecdn.com/foo///"));
}
private static String replaceMultipleSlashes(String url) {
return url.replaceAll("(?<=[^:\\s])(\\/+\\/)", "/");
}
Literally means:
(\/+\/) - find group: /+ one or more slashes followed by / slash
(?<=[^:\s]) - which follows the group (*posiive lookbehind) of this (*negated set) [^:\s] that excludes : colon and \s whitespace
g - global search flag
I suggest you simply use String.replace which documentation is http://docs.oracle.com/javase/7/docs/api/java/lang/String.html#replace(java.lang.CharSequence, java.lang.CharSequence)
Something like
`myString.replace("//", "/");
If you want to remove the first occurence:
String[] parts = str.split("//", 2);
str = parts[0] + "//" + parts[1].replaceAll("//", "/");
Which is the simplest way (without regular expression). I don't know the regular expression corresponding, if there is an expert looking at the thread.... ;)

Regex for match web extensions

I want to check whether a url is a image, java script, pdf etc
String url = "www.something.com/script/sample.js?xyz=xyz;
Below regex works fine but only with out xyz=zyz
".*(mov|jpg|gif|pdf|js)$"
When i remove $ at the end to eliminate regex requirement for .js to be in end but then it gives false
.*(mov|jpg|gif|pdf|js).*$ allows you to have any optional text after the file extension. The capturing group captures the file extension. You can see this here.
Use the regex as below:
.*\\.(mov|jpg|gif|pdf|js)\\?
This matches for dot(.) followed by your extension and terminated by ?
The first dot(.) is matching any character while second dot(.) prefixed by \\ match for dot(.) as literal just before your extension list.
Why not use java.net.URL to parse the url string, it could avoid lots of mismatching problems:
try {
URL url = new URL(urlString);
String filename = url.getFile();
// now test if the filename ends with your desired extensions.
} catch (Exception e) {
// This case the url cannot be parsed.
}
I'm not a big fan of this, but try:
.*\\.(mov|jpg|gif|pdf|js).*$
The problem is that it will accept things like "our.moving.day"
and post your code. there is always more than one way to skin a cat and perhaps there is something wrong with your code, not the regex.
Also, try regex testers...theres a ton of them out there. i'm a big fan of:
http://rubular.com/ and http://gskinner.com/RegExr/ (but they are mostly for php/ruby)

Search and replace "/" at end of url's using regular expressions in java

Below is my regular expression :-
\\bhttps?://[-a-zA-Z0-9+&##/%?=~_|!:,.;]*[-a-zA-Z0-9+&##/%=~_|]\\b
when the request url is of type http://www.example.com/ , the last character is not replaced in my shortner url and / is appended at end.
The regex is not able to find the last /.
Please help with this.
I think that / would be a word boundary, so maybe it works better if you add a ? to the and, so it reads:
\\bhttps?://[-a-zA-Z0-9+&##/%?=~_|!:,.;]*[-a-zA-Z0-9+&##/%=~_|]\\b?
what about:
if(url.endsWith("/"))
url = url.substring(0,url.length()-1);
or if you need to use regular expressions you can do something like this:
url = url.replaceAll("(\\bhttps?://[-a-zA-Z0-9+&##/%?=~_|!:,.;]*)/(\\b?)","$1$2");
If all you want is to replace the trailing / (which is what your question directly asks), you can simply do:
url = url.substring(0, url.lastIndexOf('/'));
Remember to KISS often.
You could simply use:
url = url.replaceAll("\/+$","");

Simple regex extract folders

What would be the most efficient way to cover all cases for a retrieve of folder1/folder22
from:
http://localhost:8080/folder1/folder22/file.jpg
or
http://domain.com/folder1/folder22/file.jpg
or
http://127.0.0.0.1:8080/folder1/folder22/file.jpg
so there may be one or more folders/sub-folders. Basically I would like to strip the domain name and port if available and the file name at the end.
Thank for your time.
What about the URL class and getPath()?
Maybe it's not the most efficient way, but one of the simplest I think:
String[] urls = {
"http://localhost:8080/folder1/folder22/file.jpg",
"http://domain.com/folder1/folder22/file.jpg",
"http://127.0.0.0.1:8080/folder1/folder22/file.jpg" };
for (String url : urls)
System.out.println(new File(new URL(url).getPath()).getParent());
You should probably use Java's URL parser for this, but if it has to be a regex:
\b(?=/).*(?=/[^/\r\n]*)
will match /folder1/folder22 in all your examples.
try {
Pattern regex = Pattern.compile("\\b(?=/).*(?=/[^/\r\n]*)");
Matcher regexMatcher = regex.matcher(subjectString);
if (regexMatcher.find()) {
ResultString = regexMatcher.group();
}
Explanation:
\b: Assert position at a word boundary (this will work before a single slash, but not between slashes or after a :)
(?=/): Assert that the next character is a slash.
.*: Match anything until...
(?=/[^/\r\n]*): ...exactly one last / (and anything else except slashes or newlines) follows.
^.+/([^/]+/[^/]+)/[^/]+$
The best way to get the last two directories from a url is the following:
preg_match("/\/([^\/]+\/){2}[^\/]+$/", $path, $matches);
If matched, And $matches[1] will always contain what you want, no matter filename of full url.

How to select a file path using regex

I would like like to create a java regular expression that selects everything from file: to the last forward slash (/) in the file path. This is so I can replace it with a different path.
<!DOCTYPE "file:C:/Documentum/XML%20Applications/joesdev/goodnews/book.dtd"/>
<myBook>cool book</myBook>
Does anyone have any ideas? Thanks!!
You just want to go to the last slash before the end-quote, right? If so:
file:[^"]+/
(the string "file:", then anything but ", ending with a /)
Properly escaped:
String regex = "file:[^\"]+/";
You could try to process this yourself, but a better scheme would be to just pick out the parts between the quotes and use java.util.File to separate the directory name from the filename. That way you don't have to worry about / vs \ or various escape characters.
String newPath = "C:/Documentum/badnews";
String originalPath = "<!DOCTYPE \"file:C:/Documentum/XML%20Applications/joesdev/goodnews/book.dtd\"/>";
System.out.println(originalPath.replaceFirst("file:C:((/[/\\w%]+))", newPath));
Try this:
"file:.*/[^/]*"/>

Categories