How to find out if string has already been URL encoded?

How to find out if string has already been URL encoded? - java

How could I check if string has already been encoded?
For example, if I encode TEST==, I get TEST%3D%3D. If I again encode last string, I get TEST%253D%253D, I would have to know before doing that if it is already encoded...
I have encoded parameters saved, and I need to search for them. I don't know for input parameters, what will they be - encoded or not, so I have to know if I have to encode or decode them before search.

Decode, compare to original. If it does differ, original is encoded. If it doesn't differ, original isn't encoded. But still it says nothing about whether the newly decoded version isn't still encoded. A good task for recursion.
I hope one can't write a quine in urlencode, or this algorithm would get stuck.
Exception: When a string contains "+" character url decoder replaces it with a space even though the string is not url encoded

Use regexp to check if your string contains illegal characters (i.e. characters which cannot be found in URL-encoded string, like whitespace).

Try decoding the url. If the resulting string is shorter than the original then the original URL was already encoded, else you can safely encode it (either it is not encoded, or even post encoding the url stays as is, so encoding again will not result in a wrong url). Below is sample pseudo (inspired by ruby) code:
# Returns encoded URL for any given URL after determining whether it is already encoded or not
def escape(url)
unescaped_url = URI.unescape(url)
if (unescaped_url.length < url.length)
return url
else
return URI.escape(url)
end
end

You can't know for sure, unless your strings conform to a certain pattern, or you keep track of your strings. As you noted by yourself, a String that is encoded can also be encoded, so you can't be 100% sure by looking at the string itself.

Check your URL for suspicious characters[1].
List of candidates:
WHITE_SPACE ,", < , > , { , } , | , \ , ^ , ~ , [ , ] , . and `
I use:
private static boolean isAlreadyEncoded(String passedUrl) {
boolean isEncoded = true;
if (passedUrl.matches(".*[\\ \"\\<\\>\\{\\}|\\\\^~\\[\\]].*")) {
isEncoded = false;
}
return isEncoded;
}
For the actual encoding I proceed with:
https://stackoverflow.com/a/49796882/1485527
Note: Even if your URL doesn't contain unsafe characters you might want to apply, e.g. Punnycode encoding to the host name. So there is still much space for additional checks.
[1] A list of candidates can be found in the section "unsafe" of the URL spec at Page 2.
In my understanding '%' or '#' should be left out in the encoding check, since these characters can occur in encoded URLs as well.

Using Spring UriComponentsBuilder:
import java.net.URI;
import org.springframework.web.util.UriComponentsBuilder;
private URI getProperlyEncodedUri(String uriString) {
try {
return URI.create(uriString);
} catch (IllegalArgumentException e) {
return UriComponentsBuilder.fromUriString(uriString).build().toUri();
}
}

If you want to be sure that string is encoded correctly (if it needs to be encoded) - just decode and encode it once again.
metacode:
100%_correctly_encoded_string = encode(decode(input_string))
already encoded string will remain untouched. Unencoded string will be encoded. String with only url-allowed characters will remain untouched too.

According to the spec (https://www.rfc-editor.org/rfc/rfc3986) all URLs MUST start with a scheme followed by a :
Since colons are required as the delimiter between a scheme and the rest of the URI, any string that contains a colon is not encoded.
(This assumes you will not be given an incomplete URI with no scheme.)
So you can test if the string contains a colon, if not, urldecode it, and if that string contains a colon, the original string was url encoded, if not, check if the strings are different and if so, urldecode again and if not, it is not a valid URI.
You can make this loop simpler if you know what schemes you can expect.

Thanks to this answer I coded a function (JS Language) that encodes the URL just once with encodeURI so you can call it to make sure is encoded just once and you don't need to know if the URL is already encoded.
ES6:
var getUrlEncoded = sURL => {
if (decodeURI(sURL) === sURL) return encodeURI(sURL)
return getUrlEncoded(decodeURI(sURL))
}
Pre ES6:
var getUrlEncoded = function(sURL) {
if (decodeURI(sURL) === sURL) return encodeURI(sURL)
return getUrlEncoded(decodeURI(sURL))
}
Here are some tests so you can see the URL is only encoded once:
getUrlEncoded("https://example.com/media/Screenshot27 UI Home.jpg")
//"https://example.com/media/Screenshot27%20UI%20Home.jpg"
getUrlEncoded(encodeURI("https://example.com/media/Screenshot27 UI Home.jpg"))
//"https://example.com/media/Screenshot27%20UI%20Home.jpg"
getUrlEncoded(encodeURI(encodeURI("https://example.com/media/Screenshot27 UI Home.jpg")))
//"https://example.com/media/Screenshot27%20UI%20Home.jpg"
getUrlEncoded(decodeURI("https://example.com/media/Screenshot27 UI Home.jpg"))
//"https://example.com/media/Screenshot27%20UI%20Home.jpg"
getUrlEncoded(decodeURI(decodeURI("https://example.com/media/Screenshot27 UI Home.jpg")))
//"https://example.com/media/Screenshot27%20UI%20Home.jpg"

Related

How to handle (object replacement character) in URL

Using Jsoup to scrape URLS and one of the URLS I keep getting has this  symbol in it. I have tried decoding the URL:
url = URLDecoder.decode(url, "UTF-8" );
but it still remains in the code looking like this:
I cant find much online about this other than it is "The object replacement character, sometimes used to represent an embedded object in a document when it is converted to plain text."
But if this is the case I should be able to print the symbol if it is plain text but when I run
System.out.println("");
I get the following complication error:
and it reverts back to the last save.
Sample URL: https://www.breightgroup.com/job/hse-advisor-embedded-contract-roles%ef%bf%bc/
NOTE: If you decode the url then compare it to the decoded url it comes back as not the same e.g.:
String url = URLDecoder.decode("https://www.breightgroup.com/job/hse-advisor-embedded-contract-roles%ef%bf%bc/", "UTF-8");
if(url.contains("https://www.breightgroup.com/job/hse-advisor-embedded-contract-roles?/")){
System.out.println("The same");
}else {
System.out.println("Not the same");
}

That's not a compilation error. That's the eclipse code editor telling you it can't save the source code to a file, because you have told it to save the file in a cp1252 encoding, but that encoding can't express a .
Put differently, your development environment is currently configured to store source code in the cp1252 encoding, which doesn't support the character you want, so you either configure your development environment to store source code using a more flexible encoding (such as UTF-8 the error message suggests), or avoid having that character in your source code, for instance by using its unicode escape sequence instead:
System.out.println("\ufffc");
Note that as far as the Java language and runtime are concerned,  is a character like any other, so there may not be a particular need to "handle" it. Also, I am unsure why you'd expect URLDecoder to do anything if the URL hasn't been URL-encoded to begin with.

"ef bf bc" is a 3 bytes UTF-8 character so as the error says, there's no representation for that character in "CP1252" Windows page encoding.
An option could be to replace that percent encoding sequence with an ascii representation to make the filename for saving:
String url = URLDecoder.decode("https://www.breightgroup.com/job/hse-advisor-embedded-contract-roles%ef%bf%bc/".replace("%ef%bf%bc", "-xEFxBFxBC"), "UTF-8");
url ==> "https://www.breightgroup.com/job/hse-advisor-emb ... contract-roles-xEFxBFxBC/"
Another option using CharsetDecoder
String urlDec = URLDecoder.decode("https://www.breightgroup.com/job/hse-advisor-embedded-contract-roles%ef%bf%bc/", "UTF-8");
CharsetDecoder decoder = Charset.forName("CP1252").newDecoder().onMalformedInput(CodingErrorAction.REPLACE).onUnmappableCharacter(CodingErrorAction.REPLACE);
String urlDec = URLDecoder.decode("https://www.breightgroup.com/job/hse-advisor-embedded-contract-roles%ef%bf%bc/", "UTF-8");
ByteBuffer buffer = ByteBuffer.wrap(urlDec.getBytes(Charset.forName("UTF-8")));
decoder.decode(buffer).toString();
Result
"https://www.breightgroup.com/job/hse-advisor-embedded-contract-rolesï¿¼/"

I found the issue resolved by just replacing URLs with this symbol because there are other URLs with Unicode symbols that were invisible that couldnt be converted ect..
So I just compared the urls to the following regex if it returns false then I just bypass it. Hope this helps someone out:
boolean newURL = url.matches("^[a-zA-Z0-9_:;/.&|%!+=#?-]*$");

File.toURI does not encode plus sign

I just want to check my own sanity with this question here. I have a filename which has a + (plus) character in it, which is perfectly valid on some operating systems and filesystems (e.g. MacOS and HFS+).
However, I am seeing an issue where I think that java.io.File#toURI() is not operating correctly.
For example:
new File("hello+world.txt").toURI().toString()
On my Mac machine returns:
file:/Users/aretter/code/rocksdb/hello+world.txt
However IMHO, that is not correct, because the + (plus) character from the filename has not been encoded in the URI. The URI does not represent the original filename at all, a + in a URI has a very different meaning to a + character in a filename.
So if we decode the URI, the plus will now be replaced with a (space) character, and we have lost information. e.g.:
URLDecoder.decode(new File("hello+world.txt").toURI().toURL().toString)
Which results in:
file:/Users/aretter/code/rocksdb/hello world.txt
What I would have expected instead would be something like:
new File("hello+world.txt").toURI().toString()
resulting in:
file:/Users/aretter/code/rocksdb/hello%2Bworld.txt
So that when it is later used and decoded the plus sign is preserved.
I am struggling to believe that such an obvious bug could be present in Java SE. Can someone point out where I am mistaken?
Also, if there is a workaround, I would like to hear about it please? Keep in mind that I am not actually providing static strings as filenames to File, but rather reading a directory of files from disk, of which some of those files may contain a + (plus) character.

Let me try to clarify,
'+' plus character is used as encoding character to encode ' ' space in context of HTML form (a.k.a. application/x-www-form-urlencoded MIME format).
'%20' character is used as encoding character to encode ' ' space in context of URL/URI format.
'+' plus character is threat as a normal character in context of URL and it is not encoded in any form (e.g. %20).
So when you call the new File("hello+world.txt").toURI().toString() does not perform any encoding for '+' character(simply because it is not required).
Now come to URLDecoder, this class is an utility class for HTML form decoding. It treat the '+' plus as encoded character and hence decode it to ' ' space character. In your example, this class tread the URI's to string value as normal html form field's value (not the URI value). This class should never be used to decode the full URI/URL value as it is not designed for this purpose)
From java docs of URLDecoder#decode(String),
Decodes a x-www-form-urlencoded string. The platform's default
encoding is used to determine what characters are represented by any
consecutive sequences of the form "%xy".
Hope it helps.
Update #1 based on comments:
As per section 2.2, If data for a URI component has conflicts with a reserved character, then the conflicting data must be percent-encoded before the URI is formed.
It is also an important point that different parts of URI has different set of reserved words depending on the their context. For example, / sign is reserved only in path part of URI, + sign is reserved in query string part. So there is no need to escape / in query part and similarly there is no need to escape + in path part.
In your example, URI producer File.toURI does not encode + sign in path part of URI (since +' is not considered as reserved word in path part) and you see the +' sign in to URI's to string representation.
You may refers to URI recommendation for more details.
Related answer:
https://stackoverflow.com/a/1006074/1700467
https://stackoverflow.com/a/2678602/1700467
https://stackoverflow.com/a/4571518/1700467

I'm assuming, you wanted to encode + sign in your filename to %2B. So, that you get back it as + sign when you decode it back.
If that is the case, then you need to use URLEncoder.encode
System.out.println(URLEncoder.encode(new File("hello+world.txt").toURI().toString()));
It will encode all special characters including + sign. The output would be
file%3A%2Fhome%2FT8hvs7%2Fhello%2Bworld.txt
Now, to decode use URLDecoder.decode
System.out.println(URLDecoder.decode("file%3A%2Fhome%2FwQCXni%2Fhello%2Bworld.txt"));
It will display
file:/home/wQCXni/hello+world.txt

Obviously this is not a bug, documentation clearly says
The plus sign "+" is converted into a space character " " .
You can do something like that: https://ideone.com/JHDkM4
import java.util.*;
import java.lang.*;
import java.io.*;
import static java.lang.System.out;
class Ideone
{
public static void main (String[] args) throws java.lang.Exception
{
out.println(new File("hello+world.txt").toURI().toString());
out.println(java.net.URLDecoder.decode(new File("hello+world.txt").toURI().toURL().toString()));
out.println(new File("hello+world.txt").toURI().toString().replaceAll("\\+", "%2B"));
}
}

If the URI represents a file, let the File class decode the URI.
Let's say we have a URI for a file, for example to get the filepath of a jar file :
URI uri = MyClass.class.getProtectionDomain().getCodeSource().getLocation().toURI();
System.out.println(uri.toString());
=> BAD : will display the plus sign, but %20 for spaces
System.out.println(URLDecoder.decode(uri.toString(), StandardCharsets.UTF_8.toString()));
=> BAD : will display spaces instead of %20, but also instead of the plus sign
System.out.println(new File(uri).getAbsolutePath());
=> GOOD

Try to escape the plus sign with a backslash \
So do
new File("hello\+world.txt").toURI().toString()

Encode URL with US-ASCII character set

I refer to the following web site:
http://coderstoolbox.net/string/#!encoding=xml&action=encode&charset=us_ascii
Choosing "URL", "Encode", and "US-ASCII", the input is converted to the desired output.
How do I produce the same output with Java codes?
Thanks in advance.

I used this and it seems to work fine.
public static String encode(String input) {
Pattern doNotReplace = Pattern.compile("[a-zA-Z0-9]");
return input.chars().mapToObj(c->{
if(!doNotReplace.matcher(String.valueOf((char)c)).matches()){
return "%" + (c<256?Integer.toHexString(c):"u"+Integer.toHexString(c));
}
return String.valueOf((char)c);
}).collect(Collectors.joining("")).toUpperCase();
}
PS: I'm using 256 to limit the placement of the prefix U to non-ASCII characters. No need of prefix U for standard ASCII characters which are within 256.
Alternate option:
There is a built-in Java class (java.net.URLEncoder) that does URL Encoding. But it works a little differently (For example, it does not replace the Space character with %20, but replaces with a + instead. Something similar happens with other characters too). See if it helps:
String encoded = URLEncoder.encode(input, "US-ASCII");
Hope this helps!

You can use ESAPi.encoder().encodeForUrl(linkString)
Check more details on encodeForUrl https://en.wikipedia.org/wiki/Percent-encoding
please comment if that does not satisfy your requirement or face any other issue.
Thanks

Any concerns about executing URLDecoder against a URL that was not encoded?

Currently incorporating the URLEncoder and URLDecoder into some code.
There are numerous URLs already saved that will get processed by the URLDecoder routine that was not initially processed by the URLEncoder routine.
Based on some testing it doesn't appear there will be an issue, but granted I have not tested all the scenarios.
I did notice some characters like the / which would normally get encoded are processed just find by the decoding routine even if not initially encoded.
This lead me to an oversimplified analysis. It appears the URLDecoder routine essentially checks the URL for a % and the next 2 bytes (provided UTF-8 is used). As long as there aren't any % within the previously saved off URLs then there shouldn't be an issue when processed by the URLDecoder routine. Does that sound about right?

Yes, while it will work for "simple" cases, you might encounter a) exceptions or b) unexpected behaviour if calling URLDecoder.decode for an unencoded URL that contains certain special chars.
Consider the following example: It will throw a java.lang.IllegalArgumentException: URLDecoder: Incomplete trailing escape (%) pattern for the third test and it will alter the URL without exception for the second test (while the regular encoding/decoding works without issues):
import java.net.URLDecoder;
import java.net.URLEncoder;
public class Test {
public static void main(String[] args) throws Exception {
test("http://www.foo.bar/");
test("http://www.foo.bar/?q=a+b");
test("http://www.foo.bar/?q=äöüß%"); // Will throw exception
}
private static void test(String url) throws Exception {
String encoded = URLEncoder.encode(url, "UTF-8");
String decoded = URLDecoder.decode(encoded, "UTF-8");
System.out.println("encoded: " + encoded);
System.out.println("decoded: " + decoded);
System.out.println(URLDecoder.decode(decoded, "UTF-8"));
}
}
Output (notice how the + sign disappears):
encoded: http%3A%2F%2Fwww.foo.bar%2F
decoded: http://www.foo.bar/
http://www.foo.bar/
encoded: http%3A%2F%2Fwww.foo.bar%2F%3Fq%3Da%2Bb
decoded: http://www.foo.bar/?q=a+b
http://www.foo.bar/?q=a b
encoded: http%3A%2F%2Fwww.foo.bar%2F%3Fq%3D%C3%A4%C3%B6%C3%BC%C3%9F%25
decoded: http://www.foo.bar/?q=äöüß%
Exception in thread "main" java.lang.IllegalArgumentException: URLDecoder: Incomplete trailing escape (%) pattern
at java.net.URLDecoder.decode(Unknown Source)
at Test.test(Test.java:16)
See the javadoc of URLDecoder for the two cases as well:
The plus sign "+" is converted into a space character " " .
A sequence of the form "%xy" will be treated as representing a byte where xy is the two-digit hexadecimal representation of the 8 bits.
Then, all substrings that contain one or more of these byte sequences
consecutively will be replaced by the character(s) whose encoding
would result in those consecutive bytes. The encoding scheme used to
decode these characters may be specified, or if unspecified, the
default encoding of the platform will be used.
If you are sure that your unencoded URLs do not contain + or % then I'd say it's safe to call URLDecoder.decode. Otherwise I'd advise to implement additional checks, e.g. try to decode and compare with the original (cf. this question on SO).

Encode URL query parameters

How can I encode URL query parameter values? I need to replace spaces with %20, accents, non-ASCII characters etc.
I tried to use URLEncoder but it also encodes / character and if I give a string encoded with URLEncoder to the URL constructor I get a MalformedURLException (no protocol).

URLEncoder has a very misleading name. It is according to the Javadocs used encode form parameters using MIME type application/x-www-form-urlencoded.
With this said it can be used to encode e.g., query parameters. For instance if a parameter looks like &/?# its encoded equivalent can be used as:
String url = "http://host.com/?key=" + URLEncoder.encode("&/?#");
Unless you have those special needs the URL javadocs suggests using new URI(..).toURL which performs URI encoding according to RFC2396.
The recommended way to manage the encoding and decoding of URLs is to use URI
The following sample
new URI("http", "host.com", "/path/", "key=| ?/#ä", "fragment").toURL();
produces the result http://host.com/path/?key=%7C%20?/%23ä#fragment. Note how characters such as ?&/ are not encoded.
For further information, see the posts HTTP URL Address Encoding in Java or how to encode URL to avoid special characters in java.
EDIT
Since your input is a string URL, using one of the parameterized constructor of URI will not help you. Neither can you use new URI(strUrl) directly since it doesn't quote URL parameters.
So at this stage we must use a trick to get what you want:
public URL parseUrl(String s) throws Exception {
URL u = new URL(s);
return new URI(
u.getProtocol(),
u.getAuthority(),
u.getPath(),
u.getQuery(),
u.getRef()).
toURL();
}
Before you can use this routine you have to sanitize your string to ensure it represents an absolute URL. I see two approaches to this:
Guessing. Prepend http:// to the string unless it's already present.
Construct the URI from a context using new URL(URL context, String spec)

So what you're saying is that you want to encode part of your URL but not the whole thing. Sounds to me like you'll have to break it up into parts, pass the ones that you want encoded through the encoder, and re-assemble it to get your whole URL.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

How to find out if string has already been URL encoded? - java

Use regexp to check if your string contains illegal characters (i.e. characters which cannot be found in URL-encoded string, like whitespace).

You can't know for sure, unless your strings conform to a certain pattern, or you keep track of your strings. As you noted by yourself, a String that is encoded can also be encoded, so you can't be 100% sure by looking at the string itself.

Related

How to handle (object replacement character) in URL

File.toURI does not encode plus sign

Encode URL with US-ASCII character set

Any concerns about executing URLDecoder against a URL that was not encoded?

Encode URL query parameters

Categories

Resources

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

How to find out if string has already been URL encoded? - java

Use regexp to check if your string contains illegal characters (i.e. characters which cannot be found in URL-encoded string, like whitespace).

You can't know for sure, unless your strings conform to a certain pattern, or you keep track of your strings. As you noted by yourself, a String that is encoded can also be encoded, so you can't be 100% sure by looking at the string itself.

Related

How to handle ￼ (object replacement character) in URL

File.toURI does not encode plus sign

Encode URL with US-ASCII character set

Any concerns about executing URLDecoder against a URL that was not encoded?

Encode URL query parameters

Categories

Resources

How to handle (object replacement character) in URL