Preserving escaped characters when constructing URIs in Java

Preserving escaped characters when constructing URIs in Java - java

The documentation for java.net.URI specifies that
For any URI u that ... and that does not encode characters except those that must be quoted, the following identities also hold...
But what about URIs that do encode characters that don't need to be quoted?
URI test1 = new URI("http://foo.bar.baz/%E2%82%AC123");
URI test2 = new URI(test1.getScheme(), test1.getUserInfo(), test1.getHost(), test1.getPort(), test1.getPath(), test1.getQuery(), test1.getFragment());
assert test1.equals(test2); // blows up
This fails, because what test2 comes out as, is http://foo.bar.baz/€123 -- with the escaped characters un-escaped.
My question, then, is: how can I construct a URI equal to test1 -- preserving the escaped characters -- out of its components? It's no good using getRawPath() instead of getPath(), because then the escaping characters themselves get escaped, and you end up with http://foo.bar.baz/%25E2%2582%25AC123.
Additional notes:
Don't ask why I need to preserve escaped characters that in theory don't need to be escaped -- trust me, you don't want to know.
In reality I don't want to preserve all of the original URL, just most of it -- possibly replacing the host, port, protocol, even parts of the path, so new URI(test1.toString()) is not the answer. Maybe the answer is to do everything with strings and replicate the URI class's ability to parse and construct URIs in my own code, but that seems daft.
Updated to add:
Note that the same issue exists with query parameters etc. -- it's not just the path.

I think this hack will work for you:
URI test1 = new URI("http://foo.bar.baz/example%E2%82%AC123");
URI test2 = new URI(test1.getScheme(),
test1.getUserInfo(),
test1.getHost(),
test1.getPort(),
test1.getPath(),
test1.getQuery(),
test1.getFragment());
test2 = new URI(test2.toASCIIString());
assert test1.equals(test2);
System.out.println(test1);
System.out.println(test2);
}
I use an additional step using toASCIIString()

Related

Getting file extension from http url using Java

Now I know about FilenameUtils.getExtension() from apache.
But in my case I'm processing extensions from http(s) urls, so in case I have something like
https://your_url/logo.svg?position=5
this method is gonna return svg?position=5
Is there the best way to handle this situation? I mean without writing this logic by myself.

You can use the URL library from JAVA. It has a lot of utility in this cases. You should do something like this:
String url = "https://your_url/logo.svg?position=5";
URL fileIneed = new URL(url);
Then, you have a lot of getter methods for the "fileIneed" variable. In your case the "getPath()" will retrieve this:
fileIneed.getPath() ---> "/logo.svg"
And then use the Apache library that you are using, and you will have the "svg" String.
FilenameUtils.getExtension(fileIneed.getPath()) ---> "svg"
JAVA URL library docs >>>
https://docs.oracle.com/javase/7/docs/api/java/net/URL.html

If you want a brandname® solution, then consider using the Apache method after stripping off the query string, if it exists:
String url = "https://your_url/logo.svg?position=5";
url = url.replaceAll("\\?.*$", "");
String ext = FilenameUtils.getExtension(url);
System.out.println(ext);
If you want a one-liner which does not even require an external library, then consider this option using String#replaceAll:
String url = "https://your_url/logo.svg?position=5";
String ext = url.replaceAll(".*/[^.]+\\.([^?]+)\\??.*", "$1");
System.out.println(ext);
svg
Here is an explanation of the regex pattern used above:
.*/ match everything up to, and including, the LAST path separator
[^.]+ then match any number of non dots, i.e. match the filename
\. match a dot
([^?]+) match AND capture any non ? character, which is the extension
\??.* match an optional ? followed by the rest of the query string, if present

Spring RestTemplate getForObject URL not working for Apple iTunes

I created the following simple test to query iTunes:
#Test
fun loadArtist()
{
val restTemplate = RestTemplate()
val builder = UriComponentsBuilder.fromHttpUrl("https://itunes.apple.com/search")
builder.queryParam("term", "howling wolf")
builder.queryParam("entity", "allArtist")
builder.queryParam("limit", 1)
println("\n\nURL ${builder.toUriString()}")
val result = restTemplate.getForObject(builder.toUriString(), String::class.java);
println("Got artist: $result")
}
And the output was unexpected:
URL https://itunes.apple.com/search?term=howling%20wolf&entity=allArtist&limit=1
Got artist:
{
"resultCount":0,
"results": []
}
Pasting the generated URL into a browser does give expected results - artist returned.
https://itunes.apple.com/search?term=howling%20wolf&entity=allArtist&limit=1
Also, hard-coding the query works:
val result = restTemplate.getForObject("https://itunes.apple.com/search?term=howling%20wolf&entity=allArtist&limit=1", String::class.java);
. . the problem only seems to occur for term queries that include spaces.
What went wrong? Other than assemble the URL by hand, how to fix?

Seems like a case of double encoding the whitespace. From the RestTemplate Javadoc:
For each HTTP method there are three variants: two accept a URI
template string and URI variables (array or map) while a third accepts
a URI. Note that for URI templates it is assumed encoding is
necessary, e.g. restTemplate.getForObject("http://example.com/hotel
list") becomes "http://example.com/hotel%20list". This also means if
the URI template or URI variables are already encoded, double encoding
will occur, e.g. http://example.com/hotel%20list becomes
http://example.com/hotel%2520list). To avoid that use a URI method
variant to provide (or re-use) a previously encoded URI. To prepare
such an URI with full control over encoding, consider using
UriComponentsBuilder.
So it looks like getForObject will actually query for https://itunes.apple.com/search?term=howling%2520wolf&entity=allArtist&limit=1 and thus result in an empty result. You can always just replace whitespaces with a "+" in your term or try to make one of those classes skip the encoding process.

How to replace double slash with single slash for an url

For the given url like "http://google.com//view/All/builds", i want to replace the double slash with single slash. For example the above url should display as "http://google.com/view/All/builds"
I dint know regular expressions. Can any one help me, how can i achieve this using regular expressions.

To avoid replacing the first // in http:// use the following regex :
String to = from.replaceAll("(?<!http:)//", "/");
PS: if you want to handle https use (?<!(http:|https:))// instead.

Is Regex the right approach?
In case you wanted this solution as part of an exercise to improve your regex skills, then fine. But what is it that you're really trying to achieve? You're probably trying to normalize a URL. Replacing // with / is one aspect of normalizing a URL. But what about other aspects, like removing redundant ./ and collapsing ../ with their parent directories? What about different protocols? What about ///? What about the // at the start? What about /// at the start in case of file:///?
If you want to write a generic, reusable piece of code, using a regular expression is probably not the best appraoch. And it's reinventing the wheel. Instead, consider java.net.URI.normalize().
java.net.URI.normalize()
java.lang.String
String inputUrl = "http://localhost:1234//foo//bar//buzz";
String normalizedUrl = new URI(inputUrl).normalize().toString();
java.net.URL
URL inputUrl = new URL("http://localhost:1234//foo//bar//buzz");
URL normalizedUrl = inputUrl.toURI().normalize().toURL();
java.net.URI
URI inputUri = new URI("http://localhost:1234//foo//bar//buzz");
URI normalizedUri = inputUri.normalize();
Regex
In case you do want to use a regular expression, think of all possibilities. What if, in future, this should also process other protocols, like https, file, ftp, fish, and so on? So, think again, and probably use URI.normalize(). But if you insist on a regular expression, maybe use this one:
String noramlizedUri = uri.replaceAll("(?<!\\w+:/?)//+", "/");
Compared to other solutions, this works with all URLs that look similar to HTTP URLs just with different protocols instead of http, like https, file, ftp and so on, and it will keep the triple-slash /// in case of file:///. But, unlike java.net.URI.normalize(), this does not remove redundant ./, it does not collapse ../ with their parent directories, it does not other aspects of URL normalization that you and I might have forgotten about, and it will not be updated automatically with newer RFCs about URLs, URIs, and such.

String to = from.replaceAll("(?<!(http:|https:))[//]+", "/");
will match two or more slashes.

Here is the regexp:
/(?<=[^:\s])(\/+\/)/g
It finds multiple slashes in url preserving ones after protocol regardless of it.
Handles also protocol relative urls which start from //.
#Test
public void shouldReplaceMultipleSlashes() {
assertEquals("http://google.com/?q=hi", replaceMultipleSlashes("http://google.com///?q=hi"));
assertEquals("https://google.com/?q=hi", replaceMultipleSlashes("https:////google.com//?q=hi"));
assertEquals("//somecdn.com/foo/", replaceMultipleSlashes("//somecdn.com/foo///"));
}
private static String replaceMultipleSlashes(String url) {
return url.replaceAll("(?<=[^:\\s])(\\/+\\/)", "/");
}
Literally means:
(\/+\/) - find group: /+ one or more slashes followed by / slash
(?<=[^:\s]) - which follows the group (*posiive lookbehind) of this (*negated set) [^:\s] that excludes : colon and \s whitespace
g - global search flag

I suggest you simply use String.replace which documentation is http://docs.oracle.com/javase/7/docs/api/java/lang/String.html#replace(java.lang.CharSequence, java.lang.CharSequence)
Something like
`myString.replace("//", "/");
If you want to remove the first occurence:
String[] parts = str.split("//", 2);
str = parts[0] + "//" + parts[1].replaceAll("//", "/");
Which is the simplest way (without regular expression). I don't know the regular expression corresponding, if there is an expert looking at the thread.... ;)

Encode URL query parameters

How can I encode URL query parameter values? I need to replace spaces with %20, accents, non-ASCII characters etc.
I tried to use URLEncoder but it also encodes / character and if I give a string encoded with URLEncoder to the URL constructor I get a MalformedURLException (no protocol).

URLEncoder has a very misleading name. It is according to the Javadocs used encode form parameters using MIME type application/x-www-form-urlencoded.
With this said it can be used to encode e.g., query parameters. For instance if a parameter looks like &/?# its encoded equivalent can be used as:
String url = "http://host.com/?key=" + URLEncoder.encode("&/?#");
Unless you have those special needs the URL javadocs suggests using new URI(..).toURL which performs URI encoding according to RFC2396.
The recommended way to manage the encoding and decoding of URLs is to use URI
The following sample
new URI("http", "host.com", "/path/", "key=| ?/#ä", "fragment").toURL();
produces the result http://host.com/path/?key=%7C%20?/%23ä#fragment. Note how characters such as ?&/ are not encoded.
For further information, see the posts HTTP URL Address Encoding in Java or how to encode URL to avoid special characters in java.
EDIT
Since your input is a string URL, using one of the parameterized constructor of URI will not help you. Neither can you use new URI(strUrl) directly since it doesn't quote URL parameters.
So at this stage we must use a trick to get what you want:
public URL parseUrl(String s) throws Exception {
URL u = new URL(s);
return new URI(
u.getProtocol(),
u.getAuthority(),
u.getPath(),
u.getQuery(),
u.getRef()).
toURL();
}
Before you can use this routine you have to sanitize your string to ensure it represents an absolute URL. I see two approaches to this:
Guessing. Prepend http:// to the string unless it's already present.
Construct the URI from a context using new URL(URL context, String spec)

So what you're saying is that you want to encode part of your URL but not the whole thing. Sounds to me like you'll have to break it up into parts, pass the ones that you want encoded through the encoder, and re-assemble it to get your whole URL.

How to find out if string has already been URL encoded?

How could I check if string has already been encoded?
For example, if I encode TEST==, I get TEST%3D%3D. If I again encode last string, I get TEST%253D%253D, I would have to know before doing that if it is already encoded...
I have encoded parameters saved, and I need to search for them. I don't know for input parameters, what will they be - encoded or not, so I have to know if I have to encode or decode them before search.

Decode, compare to original. If it does differ, original is encoded. If it doesn't differ, original isn't encoded. But still it says nothing about whether the newly decoded version isn't still encoded. A good task for recursion.
I hope one can't write a quine in urlencode, or this algorithm would get stuck.
Exception: When a string contains "+" character url decoder replaces it with a space even though the string is not url encoded

Use regexp to check if your string contains illegal characters (i.e. characters which cannot be found in URL-encoded string, like whitespace).

Try decoding the url. If the resulting string is shorter than the original then the original URL was already encoded, else you can safely encode it (either it is not encoded, or even post encoding the url stays as is, so encoding again will not result in a wrong url). Below is sample pseudo (inspired by ruby) code:
# Returns encoded URL for any given URL after determining whether it is already encoded or not
def escape(url)
unescaped_url = URI.unescape(url)
if (unescaped_url.length < url.length)
return url
else
return URI.escape(url)
end
end

You can't know for sure, unless your strings conform to a certain pattern, or you keep track of your strings. As you noted by yourself, a String that is encoded can also be encoded, so you can't be 100% sure by looking at the string itself.

Check your URL for suspicious characters[1].
List of candidates:
WHITE_SPACE ,", < , > , { , } , | , \ , ^ , ~ , [ , ] , . and `
I use:
private static boolean isAlreadyEncoded(String passedUrl) {
boolean isEncoded = true;
if (passedUrl.matches(".*[\\ \"\\<\\>\\{\\}|\\\\^~\\[\\]].*")) {
isEncoded = false;
}
return isEncoded;
}
For the actual encoding I proceed with:
https://stackoverflow.com/a/49796882/1485527
Note: Even if your URL doesn't contain unsafe characters you might want to apply, e.g. Punnycode encoding to the host name. So there is still much space for additional checks.
[1] A list of candidates can be found in the section "unsafe" of the URL spec at Page 2.
In my understanding '%' or '#' should be left out in the encoding check, since these characters can occur in encoded URLs as well.

Using Spring UriComponentsBuilder:
import java.net.URI;
import org.springframework.web.util.UriComponentsBuilder;
private URI getProperlyEncodedUri(String uriString) {
try {
return URI.create(uriString);
} catch (IllegalArgumentException e) {
return UriComponentsBuilder.fromUriString(uriString).build().toUri();
}
}

If you want to be sure that string is encoded correctly (if it needs to be encoded) - just decode and encode it once again.
metacode:
100%_correctly_encoded_string = encode(decode(input_string))
already encoded string will remain untouched. Unencoded string will be encoded. String with only url-allowed characters will remain untouched too.

According to the spec (https://www.rfc-editor.org/rfc/rfc3986) all URLs MUST start with a scheme followed by a :
Since colons are required as the delimiter between a scheme and the rest of the URI, any string that contains a colon is not encoded.
(This assumes you will not be given an incomplete URI with no scheme.)
So you can test if the string contains a colon, if not, urldecode it, and if that string contains a colon, the original string was url encoded, if not, check if the strings are different and if so, urldecode again and if not, it is not a valid URI.
You can make this loop simpler if you know what schemes you can expect.

Thanks to this answer I coded a function (JS Language) that encodes the URL just once with encodeURI so you can call it to make sure is encoded just once and you don't need to know if the URL is already encoded.
ES6:
var getUrlEncoded = sURL => {
if (decodeURI(sURL) === sURL) return encodeURI(sURL)
return getUrlEncoded(decodeURI(sURL))
}
Pre ES6:
var getUrlEncoded = function(sURL) {
if (decodeURI(sURL) === sURL) return encodeURI(sURL)
return getUrlEncoded(decodeURI(sURL))
}
Here are some tests so you can see the URL is only encoded once:
getUrlEncoded("https://example.com/media/Screenshot27 UI Home.jpg")
//"https://example.com/media/Screenshot27%20UI%20Home.jpg"
getUrlEncoded(encodeURI("https://example.com/media/Screenshot27 UI Home.jpg"))
//"https://example.com/media/Screenshot27%20UI%20Home.jpg"
getUrlEncoded(encodeURI(encodeURI("https://example.com/media/Screenshot27 UI Home.jpg")))
//"https://example.com/media/Screenshot27%20UI%20Home.jpg"
getUrlEncoded(decodeURI("https://example.com/media/Screenshot27 UI Home.jpg"))
//"https://example.com/media/Screenshot27%20UI%20Home.jpg"
getUrlEncoded(decodeURI(decodeURI("https://example.com/media/Screenshot27 UI Home.jpg")))
//"https://example.com/media/Screenshot27%20UI%20Home.jpg"

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Preserving escaped characters when constructing URIs in Java - java

Related

Getting file extension from http url using Java

Spring RestTemplate getForObject URL not working for Apple iTunes

How to replace double slash with single slash for an url

Encode URL query parameters

How to find out if string has already been URL encoded?

Categories

Resources