Encoding a URL in Java while leaving special characters untouched - java

I am trying to construct a URL in Java, with the query parameters encoded.The encoder should escape the character '+'. All the methods that I've come across encode it. Is there any way I can accomplish this or do I need to write a custom encoder?
Thanks!

Try "UriComponentsBuilder" if your using Spring.This package also contains other options that may fullfill your requirement.
UriComponentsBuilder.fromPath("URL String");
If your not using Spring can use "UrlValidator" given by "apache.commons"
String[] schemes = {"http","https"}; // DEFAULT schemes = "http", "https", "ftp"
UrlValidator urlValidator = new UrlValidator(schemes);
if (urlValidator.isValid("urlString")) {
System.out.println("URL is valid");
} else {
System.out.println("URL is invalid");
}

The URI class should be able to do this for you. Specifically, use
URI​(String scheme, String userInfo, String host,
int port, String path, String query, String fragment)
or
URI​(String scheme, String authority, String path, String query,
String fragment)
to create the URI object, where the components that you provided are not encoded. Then use URI.toString() or URI.toASCIIString() to produce the properly encoded URL.
The encoder should escape the character '+'. All the methods that I've come across encode it.
Encoding the '+' as %2B% is the correct behavior. The URL / URI specifications do not support escaping. Any unencoded '+' in a URL will decode as a space character, no matter how you try to escape it. That's what the spec says.

Related

Getting file extension from http url using Java

Now I know about FilenameUtils.getExtension() from apache.
But in my case I'm processing extensions from http(s) urls, so in case I have something like
https://your_url/logo.svg?position=5
this method is gonna return svg?position=5
Is there the best way to handle this situation? I mean without writing this logic by myself.
You can use the URL library from JAVA. It has a lot of utility in this cases. You should do something like this:
String url = "https://your_url/logo.svg?position=5";
URL fileIneed = new URL(url);
Then, you have a lot of getter methods for the "fileIneed" variable. In your case the "getPath()" will retrieve this:
fileIneed.getPath() ---> "/logo.svg"
And then use the Apache library that you are using, and you will have the "svg" String.
FilenameUtils.getExtension(fileIneed.getPath()) ---> "svg"
JAVA URL library docs >>>
https://docs.oracle.com/javase/7/docs/api/java/net/URL.html
If you want a brandname® solution, then consider using the Apache method after stripping off the query string, if it exists:
String url = "https://your_url/logo.svg?position=5";
url = url.replaceAll("\\?.*$", "");
String ext = FilenameUtils.getExtension(url);
System.out.println(ext);
If you want a one-liner which does not even require an external library, then consider this option using String#replaceAll:
String url = "https://your_url/logo.svg?position=5";
String ext = url.replaceAll(".*/[^.]+\\.([^?]+)\\??.*", "$1");
System.out.println(ext);
svg
Here is an explanation of the regex pattern used above:
.*/ match everything up to, and including, the LAST path separator
[^.]+ then match any number of non dots, i.e. match the filename
\. match a dot
([^?]+) match AND capture any non ? character, which is the extension
\??.* match an optional ? followed by the rest of the query string, if present

Extract parameters from URL

I have problems with the character +(and maybe others) at the URIBuilder is suppose to get a decoded url but when I extract the query the + is replaced
String decodedUrl = "www.foo.com?sign=AZrhQaTRSiys5GZtlwZ+H3qUyIY=&more=boo";
URIBuilder builder = new URIBuilder(decodedUrl);
List<NameValuePair> params = builder.getQueryParams();
String sign = params.get(0).getValue();
the value of sing is AZrhQaTRSiys5GZtlwZ H3qUyIY= with a space instead +. How can I extract the correct value?
other way is:
URI uri = new URI(decodedUrl);
String query = uri.getQuery();
the value of query is sign=AZrhQaTRSiys5GZtlwZ+H3qUyIY=&more=boo in this case is correct, but I have to strip it. Is there another way to do that?
Use it differently:
String decodedUrl = "www.foo.com";
URIBuilder builder = new URIBuilder(decodedUrl);
builder.addParameter("sign", "AZrhQaTRSiys5GZtlwZ+H3qUyIY=");
builder.addParameter("more", "boo");
List<NameValuePair> params = builder.getQueryParams();
String sign = params.get(0).getValue();
addParameter method is responsible for adding parameters as to the builder. The constructor of the builder should include the base URL only.
If this URL is given to you as is, then the + is already decoded and stands for the space character. If you are the one who generates this URL, you probably skipped the URL encoding step (which can be done using the code snipped above).
Read a bit about URL encoding: http://en.wikipedia.org/wiki/Query_string#URL_encoding
That is because if you send space as parameter in url it is encoded as +. This happens because there are some rules which characters are valid in URL. See URL RFC.
It is necessary to encode any characters disallowed in a URL, including spaces and other binary data not in the allowed character set, using the standard convention of the "%" character followed by two hexadecimal digits.
If you want to have + as symbol in url you need to encode it into %2B. For example 2+2 is encoded as 2%2B2 and i am as i+am. So in your case I believe you have to correct result as AZrhQaTRSiys5GZtlwZ+H3qUyIY decodes into AZrhQaTRSiys5GZtlwZ H3qUyIY.

URL encode/decode on file name replace spaces with +, need alternative.

My product is a web application.
I have files that I upload and download later on, to/from my server.
I am using java.net.URLDecoder.decode() when uploading files with unicode characters and java.net.URLDecoder.encode() when downloading files in order to save the file name and finally return it to the client as expected with no question marks and stuff (?????) .
The problem is that if the file name consists spaces then the encode/decode replace them with + character which is perfectly normal because that's their business implementation, but clearly as you can understand it does not fit to my purpose.
The question is what alternative do I have to overcome this situation?
Is there build-in method for that or 3rd party package?
You could also convert a space to %20.
See: URL encoding the space character: + or %20?
There are also various other Java libraries that do URL encoding, with %20. Here are a two examples:
Guava:
UrlEscapers.urlPathSegmentEscaper().escape(urlToEscape);
Spring Framework:
UriUtils.encodePath(urlToEscape, Charsets.UTF_8.toString());
You don't tell where this filename is used. The characters to encode will be different whether, for instance, it is in a URI query string or fragment part.
You probably want to have a look at Guava's (15.0+) Escapers; and, in particular here, UnicodeEscaper implementations and its derived class PercentEscaper. Guava already provides a few of them usable in various parts of URLs.
EDIT: here is how to do with Guava:
public final class FilenameEscaper
extends PercentEscaper
{
public PercentEscaper()
{
super("", false);
}
}
Done! See here. Of course, you may want to declare that some more characters than the default ones are safe.
Also have a look at RFC 5987 to make a better encoder.
This worked for me:
URLEncoder.encode(someString, "UTF-8").replace("+", "%20");
I found the cure!
I was just needed to use java.net.URI for that:
public static String encode(String urlString) throws UnsupportedEncodingException
{
try
{
URI uri = new URI(urlString);
return uri.toASCIIString();
}
catch (URISyntaxException e)
{
e.printStackTrace();
}
}
The toASCIIString() escapes the special characters so when the string arrives to the browser it is shown correctly.
Had the same problem with spaces. Combination of URL and URI solved it:
URL url = new URL("file:/E:/Program Files/IBM/SDP/runtimes/base");
URI uri = new URI(url.getProtocol(), url.getUserInfo(), url.getHost(), url.getPort(), url.getPath(), url.getQuery(), url.getRef());
* Please note that URLEncoder is used for web forms application/x-www-form-urlencoded mime-type - not http network addresses.
* Source: https://stackoverflow.com/a/749829/435605

Encode URL query parameters

How can I encode URL query parameter values? I need to replace spaces with %20, accents, non-ASCII characters etc.
I tried to use URLEncoder but it also encodes / character and if I give a string encoded with URLEncoder to the URL constructor I get a MalformedURLException (no protocol).
URLEncoder has a very misleading name. It is according to the Javadocs used encode form parameters using MIME type application/x-www-form-urlencoded.
With this said it can be used to encode e.g., query parameters. For instance if a parameter looks like &/?# its encoded equivalent can be used as:
String url = "http://host.com/?key=" + URLEncoder.encode("&/?#");
Unless you have those special needs the URL javadocs suggests using new URI(..).toURL which performs URI encoding according to RFC2396.
The recommended way to manage the encoding and decoding of URLs is to use URI
The following sample
new URI("http", "host.com", "/path/", "key=| ?/#ä", "fragment").toURL();
produces the result http://host.com/path/?key=%7C%20?/%23ä#fragment. Note how characters such as ?&/ are not encoded.
For further information, see the posts HTTP URL Address Encoding in Java or how to encode URL to avoid special characters in java.
EDIT
Since your input is a string URL, using one of the parameterized constructor of URI will not help you. Neither can you use new URI(strUrl) directly since it doesn't quote URL parameters.
So at this stage we must use a trick to get what you want:
public URL parseUrl(String s) throws Exception {
URL u = new URL(s);
return new URI(
u.getProtocol(),
u.getAuthority(),
u.getPath(),
u.getQuery(),
u.getRef()).
toURL();
}
Before you can use this routine you have to sanitize your string to ensure it represents an absolute URL. I see two approaches to this:
Guessing. Prepend http:// to the string unless it's already present.
Construct the URI from a context using new URL(URL context, String spec)
So what you're saying is that you want to encode part of your URL but not the whole thing. Sounds to me like you'll have to break it up into parts, pass the ones that you want encoded through the encoder, and re-assemble it to get your whole URL.

How to find out if string has already been URL encoded?

How could I check if string has already been encoded?
For example, if I encode TEST==, I get TEST%3D%3D. If I again encode last string, I get TEST%253D%253D, I would have to know before doing that if it is already encoded...
I have encoded parameters saved, and I need to search for them. I don't know for input parameters, what will they be - encoded or not, so I have to know if I have to encode or decode them before search.
Decode, compare to original. If it does differ, original is encoded. If it doesn't differ, original isn't encoded. But still it says nothing about whether the newly decoded version isn't still encoded. A good task for recursion.
I hope one can't write a quine in urlencode, or this algorithm would get stuck.
Exception: When a string contains "+" character url decoder replaces it with a space even though the string is not url encoded
Use regexp to check if your string contains illegal characters (i.e. characters which cannot be found in URL-encoded string, like whitespace).
Try decoding the url. If the resulting string is shorter than the original then the original URL was already encoded, else you can safely encode it (either it is not encoded, or even post encoding the url stays as is, so encoding again will not result in a wrong url). Below is sample pseudo (inspired by ruby) code:
# Returns encoded URL for any given URL after determining whether it is already encoded or not
def escape(url)
unescaped_url = URI.unescape(url)
if (unescaped_url.length < url.length)
return url
else
return URI.escape(url)
end
end
You can't know for sure, unless your strings conform to a certain pattern, or you keep track of your strings. As you noted by yourself, a String that is encoded can also be encoded, so you can't be 100% sure by looking at the string itself.
Check your URL for suspicious characters[1].
List of candidates:
WHITE_SPACE ,", < , > , { , } , | , \ , ^ , ~ , [ , ] , . and `
I use:
private static boolean isAlreadyEncoded(String passedUrl) {
boolean isEncoded = true;
if (passedUrl.matches(".*[\\ \"\\<\\>\\{\\}|\\\\^~\\[\\]].*")) {
isEncoded = false;
}
return isEncoded;
}
For the actual encoding I proceed with:
https://stackoverflow.com/a/49796882/1485527
Note: Even if your URL doesn't contain unsafe characters you might want to apply, e.g. Punnycode encoding to the host name. So there is still much space for additional checks.
[1] A list of candidates can be found in the section "unsafe" of the URL spec at Page 2.
In my understanding '%' or '#' should be left out in the encoding check, since these characters can occur in encoded URLs as well.
Using Spring UriComponentsBuilder:
import java.net.URI;
import org.springframework.web.util.UriComponentsBuilder;
private URI getProperlyEncodedUri(String uriString) {
try {
return URI.create(uriString);
} catch (IllegalArgumentException e) {
return UriComponentsBuilder.fromUriString(uriString).build().toUri();
}
}
If you want to be sure that string is encoded correctly (if it needs to be encoded) - just decode and encode it once again.
metacode:
100%_correctly_encoded_string = encode(decode(input_string))
already encoded string will remain untouched. Unencoded string will be encoded. String with only url-allowed characters will remain untouched too.
According to the spec (https://www.rfc-editor.org/rfc/rfc3986) all URLs MUST start with a scheme followed by a :
Since colons are required as the delimiter between a scheme and the rest of the URI, any string that contains a colon is not encoded.
(This assumes you will not be given an incomplete URI with no scheme.)
So you can test if the string contains a colon, if not, urldecode it, and if that string contains a colon, the original string was url encoded, if not, check if the strings are different and if so, urldecode again and if not, it is not a valid URI.
You can make this loop simpler if you know what schemes you can expect.
Thanks to this answer I coded a function (JS Language) that encodes the URL just once with encodeURI so you can call it to make sure is encoded just once and you don't need to know if the URL is already encoded.
ES6:
var getUrlEncoded = sURL => {
if (decodeURI(sURL) === sURL) return encodeURI(sURL)
return getUrlEncoded(decodeURI(sURL))
}
Pre ES6:
var getUrlEncoded = function(sURL) {
if (decodeURI(sURL) === sURL) return encodeURI(sURL)
return getUrlEncoded(decodeURI(sURL))
}
Here are some tests so you can see the URL is only encoded once:
getUrlEncoded("https://example.com/media/Screenshot27 UI Home.jpg")
//"https://example.com/media/Screenshot27%20UI%20Home.jpg"
getUrlEncoded(encodeURI("https://example.com/media/Screenshot27 UI Home.jpg"))
//"https://example.com/media/Screenshot27%20UI%20Home.jpg"
getUrlEncoded(encodeURI(encodeURI("https://example.com/media/Screenshot27 UI Home.jpg")))
//"https://example.com/media/Screenshot27%20UI%20Home.jpg"
getUrlEncoded(decodeURI("https://example.com/media/Screenshot27 UI Home.jpg"))
//"https://example.com/media/Screenshot27%20UI%20Home.jpg"
getUrlEncoded(decodeURI(decodeURI("https://example.com/media/Screenshot27 UI Home.jpg")))
//"https://example.com/media/Screenshot27%20UI%20Home.jpg"

Categories