File.toURI does not encode plus sign - java

I just want to check my own sanity with this question here. I have a filename which has a + (plus) character in it, which is perfectly valid on some operating systems and filesystems (e.g. MacOS and HFS+).
However, I am seeing an issue where I think that java.io.File#toURI() is not operating correctly.
For example:
new File("hello+world.txt").toURI().toString()
On my Mac machine returns:
file:/Users/aretter/code/rocksdb/hello+world.txt
However IMHO, that is not correct, because the + (plus) character from the filename has not been encoded in the URI. The URI does not represent the original filename at all, a + in a URI has a very different meaning to a + character in a filename.
So if we decode the URI, the plus will now be replaced with a (space) character, and we have lost information. e.g.:
URLDecoder.decode(new File("hello+world.txt").toURI().toURL().toString)
Which results in:
file:/Users/aretter/code/rocksdb/hello world.txt
What I would have expected instead would be something like:
new File("hello+world.txt").toURI().toString()
resulting in:
file:/Users/aretter/code/rocksdb/hello%2Bworld.txt
So that when it is later used and decoded the plus sign is preserved.
I am struggling to believe that such an obvious bug could be present in Java SE. Can someone point out where I am mistaken?
Also, if there is a workaround, I would like to hear about it please? Keep in mind that I am not actually providing static strings as filenames to File, but rather reading a directory of files from disk, of which some of those files may contain a + (plus) character.

Let me try to clarify,
'+' plus character is used as encoding character to encode ' ' space in context of HTML form (a.k.a. application/x-www-form-urlencoded MIME format).
'%20' character is used as encoding character to encode ' ' space in context of URL/URI format.
'+' plus character is threat as a normal character in context of URL and it is not encoded in any form (e.g. %20).
So when you call the new File("hello+world.txt").toURI().toString() does not perform any encoding for '+' character(simply because it is not required).
Now come to URLDecoder, this class is an utility class for HTML form decoding. It treat the '+' plus as encoded character and hence decode it to ' ' space character. In your example, this class tread the URI's to string value as normal html form field's value (not the URI value). This class should never be used to decode the full URI/URL value as it is not designed for this purpose)
From java docs of URLDecoder#decode(String),
Decodes a x-www-form-urlencoded string. The platform's default
encoding is used to determine what characters are represented by any
consecutive sequences of the form "%xy".
Hope it helps.
Update #1 based on comments:
As per section 2.2, If data for a URI component has conflicts with a reserved character, then the conflicting data must be percent-encoded before the URI is formed.
It is also an important point that different parts of URI has different set of reserved words depending on the their context. For example, / sign is reserved only in path part of URI, + sign is reserved in query string part. So there is no need to escape / in query part and similarly there is no need to escape + in path part.
In your example, URI producer File.toURI does not encode + sign in path part of URI (since +' is not considered as reserved word in path part) and you see the +' sign in to URI's to string representation.
You may refers to URI recommendation for more details.
Related answer:
https://stackoverflow.com/a/1006074/1700467
https://stackoverflow.com/a/2678602/1700467
https://stackoverflow.com/a/4571518/1700467

I'm assuming, you wanted to encode + sign in your filename to %2B. So, that you get back it as + sign when you decode it back.
If that is the case, then you need to use URLEncoder.encode
System.out.println(URLEncoder.encode(new File("hello+world.txt").toURI().toString()));
It will encode all special characters including + sign. The output would be
file%3A%2Fhome%2FT8hvs7%2Fhello%2Bworld.txt
Now, to decode use URLDecoder.decode
System.out.println(URLDecoder.decode("file%3A%2Fhome%2FwQCXni%2Fhello%2Bworld.txt"));
It will display
file:/home/wQCXni/hello+world.txt

Obviously this is not a bug, documentation clearly says
The plus sign "+" is converted into a space character " " .
You can do something like that: https://ideone.com/JHDkM4
import java.util.*;
import java.lang.*;
import java.io.*;
import static java.lang.System.out;
class Ideone
{
public static void main (String[] args) throws java.lang.Exception
{
out.println(new File("hello+world.txt").toURI().toString());
out.println(java.net.URLDecoder.decode(new File("hello+world.txt").toURI().toURL().toString()));
out.println(new File("hello+world.txt").toURI().toString().replaceAll("\\+", "%2B"));
}
}

If the URI represents a file, let the File class decode the URI.
Let's say we have a URI for a file, for example to get the filepath of a jar file :
URI uri = MyClass.class.getProtectionDomain().getCodeSource().getLocation().toURI();
System.out.println(uri.toString());
=> BAD : will display the plus sign, but %20 for spaces
System.out.println(URLDecoder.decode(uri.toString(), StandardCharsets.UTF_8.toString()));
=> BAD : will display spaces instead of %20, but also instead of the plus sign
System.out.println(new File(uri).getAbsolutePath());
=> GOOD

Try to escape the plus sign with a backslash \
So do
new File("hello\+world.txt").toURI().toString()

Related

How to handle  (object replacement character) in URL

Using Jsoup to scrape URLS and one of the URLS I keep getting has this  symbol in it. I have tried decoding the URL:
url = URLDecoder.decode(url, "UTF-8" );
but it still remains in the code looking like this:
I cant find much online about this other than it is "The object replacement character, sometimes used to represent an embedded object in a document when it is converted to plain text."
But if this is the case I should be able to print the symbol if it is plain text but when I run
System.out.println("");
I get the following complication error:
and it reverts back to the last save.
Sample URL: https://www.breightgroup.com/job/hse-advisor-embedded-contract-roles%ef%bf%bc/
NOTE: If you decode the url then compare it to the decoded url it comes back as not the same e.g.:
String url = URLDecoder.decode("https://www.breightgroup.com/job/hse-advisor-embedded-contract-roles%ef%bf%bc/", "UTF-8");
if(url.contains("https://www.breightgroup.com/job/hse-advisor-embedded-contract-roles?/")){
System.out.println("The same");
}else {
System.out.println("Not the same");
}
That's not a compilation error. That's the eclipse code editor telling you it can't save the source code to a file, because you have told it to save the file in a cp1252 encoding, but that encoding can't express a .
Put differently, your development environment is currently configured to store source code in the cp1252 encoding, which doesn't support the character you want, so you either configure your development environment to store source code using a more flexible encoding (such as UTF-8 the error message suggests), or avoid having that character in your source code, for instance by using its unicode escape sequence instead:
System.out.println("\ufffc");
Note that as far as the Java language and runtime are concerned,  is a character like any other, so there may not be a particular need to "handle" it. Also, I am unsure why you'd expect URLDecoder to do anything if the URL hasn't been URL-encoded to begin with.
"ef bf bc" is a 3 bytes UTF-8 character so as the error says, there's no representation for that character in "CP1252" Windows page encoding.
An option could be to replace that percent encoding sequence with an ascii representation to make the filename for saving:
String url = URLDecoder.decode("https://www.breightgroup.com/job/hse-advisor-embedded-contract-roles%ef%bf%bc/".replace("%ef%bf%bc", "-xEFxBFxBC"), "UTF-8");
url ==> "https://www.breightgroup.com/job/hse-advisor-emb ... contract-roles-xEFxBFxBC/"
Another option using CharsetDecoder
String urlDec = URLDecoder.decode("https://www.breightgroup.com/job/hse-advisor-embedded-contract-roles%ef%bf%bc/", "UTF-8");
CharsetDecoder decoder = Charset.forName("CP1252").newDecoder().onMalformedInput(CodingErrorAction.REPLACE).onUnmappableCharacter(CodingErrorAction.REPLACE);
String urlDec = URLDecoder.decode("https://www.breightgroup.com/job/hse-advisor-embedded-contract-roles%ef%bf%bc/", "UTF-8");
ByteBuffer buffer = ByteBuffer.wrap(urlDec.getBytes(Charset.forName("UTF-8")));
decoder.decode(buffer).toString();
Result
"https://www.breightgroup.com/job/hse-advisor-embedded-contract-roles/"
I found the issue resolved by just replacing URLs with this symbol because there are other URLs with Unicode symbols that were invisible that couldnt be converted ect..
So I just compared the urls to the following regex if it returns false then I just bypass it. Hope this helps someone out:
boolean newURL = url.matches("^[a-zA-Z0-9_:;/.&|%!+=#?-]*$");

Need help on validating domains on the basis of ASCII and BASE64 encoded UTF-8 string

I am doing some tests related to ldap in java using JDK 1.7
I have configuration file from which I am reading value of one property like "dc=domain1,dc=com" to pass that later to ldap for searching operations.
Here I want to validate the value which is coming from properties file and that value should be only ASCII or Base64 encoded UTF-8 strings.
I have written following regex to validate the string but seems like it is having some issues.
here is my sample code:
public class ValidateDN {
public static void main(String[] args) {
String istr = "dc=domain1,dc=com";
String myregex = "^dc=[a-zA-Z0-9\\-\\.]*[,dc=[a-zA-Z0-9\\-\\.]*]*";
if (istr.matches(myregex)){
System.out.println("String matches");
}
else{
System.out.println("String not matching");
}
}
}
It should pass all strings like:
dc=com
dc=domain1,dc=com
dc=domain2,dc=domain1,dc=com
It should fail for the values:
dc=domain1,dc=com,d
dc=domain1,dc=com,dc
(incomplete key or invalid syntax)
Can anyone suggest what should be done here to validate this properly?
You have a major error in your regex - you're using square brackets instead of parenthesis. Square brackets mean: "Any character", not a sequence of characters.
Further, your regex can be simplified to:
(dc=[\w-]+,?)*
As LDAP DNs may contain spaces, you may want to consider using:
(\s*dc\s*=\s*[\w-]+\s*,?)*
Remember to escape the slashes as necessary when inserting into your code.
I believe the problem you are having is due to the structure of your regex.
Your regex:
"^dc=[a-zA-Z0-9\\-\\.]*[,dc=[a-zA-Z0-9\\-\\.]*]*"
has a flaw with the second character class. Specifically:
(`[,dc=[a-zA-Z0-9\\-\\.]*]*.
It should be changed to (,dc=[a-zA-Z0-9\\-\\.]*)* for the sake of having the literal ",dc=" match as well as the inner character class match.
The complete regex that should work is:
^dc=[a-zA-Z0-9\\-\\.]*(,dc=[a-zA-Z0-9\\-\\.]*)*

Any concerns about executing URLDecoder against a URL that was not encoded?

Currently incorporating the URLEncoder and URLDecoder into some code.
There are numerous URLs already saved that will get processed by the URLDecoder routine that was not initially processed by the URLEncoder routine.
Based on some testing it doesn't appear there will be an issue, but granted I have not tested all the scenarios.
I did notice some characters like the / which would normally get encoded are processed just find by the decoding routine even if not initially encoded.
This lead me to an oversimplified analysis. It appears the URLDecoder routine essentially checks the URL for a % and the next 2 bytes (provided UTF-8 is used). As long as there aren't any % within the previously saved off URLs then there shouldn't be an issue when processed by the URLDecoder routine. Does that sound about right?
Yes, while it will work for "simple" cases, you might encounter a) exceptions or b) unexpected behaviour if calling URLDecoder.decode for an unencoded URL that contains certain special chars.
Consider the following example: It will throw a java.lang.IllegalArgumentException: URLDecoder: Incomplete trailing escape (%) pattern for the third test and it will alter the URL without exception for the second test (while the regular encoding/decoding works without issues):
import java.net.URLDecoder;
import java.net.URLEncoder;
public class Test {
public static void main(String[] args) throws Exception {
test("http://www.foo.bar/");
test("http://www.foo.bar/?q=a+b");
test("http://www.foo.bar/?q=äöüß%"); // Will throw exception
}
private static void test(String url) throws Exception {
String encoded = URLEncoder.encode(url, "UTF-8");
String decoded = URLDecoder.decode(encoded, "UTF-8");
System.out.println("encoded: " + encoded);
System.out.println("decoded: " + decoded);
System.out.println(URLDecoder.decode(decoded, "UTF-8"));
}
}
Output (notice how the + sign disappears):
encoded: http%3A%2F%2Fwww.foo.bar%2F
decoded: http://www.foo.bar/
http://www.foo.bar/
encoded: http%3A%2F%2Fwww.foo.bar%2F%3Fq%3Da%2Bb
decoded: http://www.foo.bar/?q=a+b
http://www.foo.bar/?q=a b
encoded: http%3A%2F%2Fwww.foo.bar%2F%3Fq%3D%C3%A4%C3%B6%C3%BC%C3%9F%25
decoded: http://www.foo.bar/?q=äöüß%
Exception in thread "main" java.lang.IllegalArgumentException: URLDecoder: Incomplete trailing escape (%) pattern
at java.net.URLDecoder.decode(Unknown Source)
at Test.test(Test.java:16)
See the javadoc of URLDecoder for the two cases as well:
The plus sign "+" is converted into a space character " " .
A sequence of the form "%xy" will be treated as representing a byte where xy is the two-digit hexadecimal representation of the 8 bits.
Then, all substrings that contain one or more of these byte sequences
consecutively will be replaced by the character(s) whose encoding
would result in those consecutive bytes. The encoding scheme used to
decode these characters may be specified, or if unspecified, the
default encoding of the platform will be used.
If you are sure that your unencoded URLs do not contain + or % then I'd say it's safe to call URLDecoder.decode. Otherwise I'd advise to implement additional checks, e.g. try to decode and compare with the original (cf. this question on SO).

Japanese Character Encoding in Base64

I have been asked to fix a bug in our email processing software.
When a message whose subject is encoded in RFC 2047 like this:
=?ISO-2022-JP?B?GyRCR1s/LiVGJTklSC1qRnxLXDhsGyhC?=
is received, it is incorrectly decoded - one of the Japanese characters is not rendered properly. It is rendered like this: 配信テスト?日本語 when it should be 配信テスト㈱日本語
(I do not understand Japanese) - clearly one of the characters, the one which looks its in brackets, has not been rendered.
The decoding is carried out by javax.mail.internet.MimeUtility.decodeText()
If I try it with an on-line decoder (the only one I've found is here) it seems to work OK, so I was suspecting a bug in MimeUtility.
So I tried some experiments, in the form of this little program:
public class Encoding {
private static final Charset CHARSET = Charset.forName("ISO-2022-JP");
public static void main(String[] args) throws UnsupportedEncodingException {
String control = "繋がって";
String subject= "配信テスト㈱日本語";
String controlBase64 = japaneseToBase64(control);
System.out.println(controlBase64);
System.out.println(base64ToJapanese(controlBase64));
String subjectBase64 = japaneseToBase64(subject);
System.out.println(subjectBase64);
System.out.println(base64ToJapanese(subjectBase64));
}
private static String japaneseToBase64(String in) {
return Base64.encodeBase64String(in.getBytes(CHARSET));
}
private static String base64ToJapanese(String in) {
return new String(Base64.decodeBase64(in), CHARSET);
}
}
(The Base64 and Hex classes are in org.apache.commons.codec)
When I run it, here's the output:
GyRCN1IkLCRDJEYbKEI=
繋がって
GyRCR1s/LiVGJTklSCEpRnxLXDhsGyhC
配信テスト?日本語
The first, shorter Japanese string is a control, and this returns the same as the input, having been converted into Base64 and back again, using Charset ISO-2022-JP. All OK there.
The second Japanese string is the one with the dodgy character. As you see, it returns with a ? instead of the character. The Base64 encoding output is also different from the original subject encoding.
Sorry if this is long, I wanted to be thorough. What's going on, and how can I decode this character correctly?
The bug is not in your software, but the subject string itself is incorrectly encoded. Other software may be able to decode the text by making further assumptions about the content, just as it is often assumed that characters in the range 0x80-0x9f are Cp1252-encoded, although ISO-8859-1 or ISO-8859-15 is specified.
ISO-2022-JP is a multi-charset encoding, using escape sequences to switch between the actually used character set. Your encoded string starts with ESC $ B, indicating that the character set JIS X 0208-1983 is used. The offending character is encoded as 0x2d6a. That code point is not defined in the referred character set, but later added to JIS X 0213:2000, a newer version of the JIS X character set specifications.
Try using "MS932" or "Shift-JIS" in your encoding. Means
private static final Charset CHARSET = Charset.forName("MS932");
There are different scripts in Japanese like kanji, katakana. Some of the encoding like Cp132 will not support some characters of Japanese. The problem you face is because of the encoding "ISO-2022-JP" you have used in your code.
ISO-2022-JP uses pairs of bytes, called ku and ten, that index into a 94×94 table of characters. The pair that fails has ku 12 and ten 73, which is not listed in table of valid characters I have (based on JIS X 0208). All of ku=12 seems to be unused.
Wikipedia doesn't list any updates to JIS X 0208, either. Perhaps the sender is using some sort of vendor-defined extension?
Despite the fact that ISO-2022-JP is a variable width encoding, it seems as though Java doesn't support the section of the character set that it lies in (possibly as a result of the missing escape sequences in ISO-2022-JP-2 that are present in ISO-2022-JP-3 and ISO-2022-JP-2004 which aren't supported). UTF-8, UTF-16 and UTF-32 do however support all of the characters.
UTF-32:
AAB+SwAAMEwAADBjAAAwZg==
繋がって
AACRTQAAT+EAADDGAAAwuQAAMMgAADIxAABl5QAAZywAAIqe
配信テスト㈱日本語
As an extra tidbit, regardless of whether UTF-32 was used, when the strings were printed as-is they retained their natural encoding and appeared normally.

How to find out if string has already been URL encoded?

How could I check if string has already been encoded?
For example, if I encode TEST==, I get TEST%3D%3D. If I again encode last string, I get TEST%253D%253D, I would have to know before doing that if it is already encoded...
I have encoded parameters saved, and I need to search for them. I don't know for input parameters, what will they be - encoded or not, so I have to know if I have to encode or decode them before search.
Decode, compare to original. If it does differ, original is encoded. If it doesn't differ, original isn't encoded. But still it says nothing about whether the newly decoded version isn't still encoded. A good task for recursion.
I hope one can't write a quine in urlencode, or this algorithm would get stuck.
Exception: When a string contains "+" character url decoder replaces it with a space even though the string is not url encoded
Use regexp to check if your string contains illegal characters (i.e. characters which cannot be found in URL-encoded string, like whitespace).
Try decoding the url. If the resulting string is shorter than the original then the original URL was already encoded, else you can safely encode it (either it is not encoded, or even post encoding the url stays as is, so encoding again will not result in a wrong url). Below is sample pseudo (inspired by ruby) code:
# Returns encoded URL for any given URL after determining whether it is already encoded or not
def escape(url)
unescaped_url = URI.unescape(url)
if (unescaped_url.length < url.length)
return url
else
return URI.escape(url)
end
end
You can't know for sure, unless your strings conform to a certain pattern, or you keep track of your strings. As you noted by yourself, a String that is encoded can also be encoded, so you can't be 100% sure by looking at the string itself.
Check your URL for suspicious characters[1].
List of candidates:
WHITE_SPACE ,", < , > , { , } , | , \ , ^ , ~ , [ , ] , . and `
I use:
private static boolean isAlreadyEncoded(String passedUrl) {
boolean isEncoded = true;
if (passedUrl.matches(".*[\\ \"\\<\\>\\{\\}|\\\\^~\\[\\]].*")) {
isEncoded = false;
}
return isEncoded;
}
For the actual encoding I proceed with:
https://stackoverflow.com/a/49796882/1485527
Note: Even if your URL doesn't contain unsafe characters you might want to apply, e.g. Punnycode encoding to the host name. So there is still much space for additional checks.
[1] A list of candidates can be found in the section "unsafe" of the URL spec at Page 2.
In my understanding '%' or '#' should be left out in the encoding check, since these characters can occur in encoded URLs as well.
Using Spring UriComponentsBuilder:
import java.net.URI;
import org.springframework.web.util.UriComponentsBuilder;
private URI getProperlyEncodedUri(String uriString) {
try {
return URI.create(uriString);
} catch (IllegalArgumentException e) {
return UriComponentsBuilder.fromUriString(uriString).build().toUri();
}
}
If you want to be sure that string is encoded correctly (if it needs to be encoded) - just decode and encode it once again.
metacode:
100%_correctly_encoded_string = encode(decode(input_string))
already encoded string will remain untouched. Unencoded string will be encoded. String with only url-allowed characters will remain untouched too.
According to the spec (https://www.rfc-editor.org/rfc/rfc3986) all URLs MUST start with a scheme followed by a :
Since colons are required as the delimiter between a scheme and the rest of the URI, any string that contains a colon is not encoded.
(This assumes you will not be given an incomplete URI with no scheme.)
So you can test if the string contains a colon, if not, urldecode it, and if that string contains a colon, the original string was url encoded, if not, check if the strings are different and if so, urldecode again and if not, it is not a valid URI.
You can make this loop simpler if you know what schemes you can expect.
Thanks to this answer I coded a function (JS Language) that encodes the URL just once with encodeURI so you can call it to make sure is encoded just once and you don't need to know if the URL is already encoded.
ES6:
var getUrlEncoded = sURL => {
if (decodeURI(sURL) === sURL) return encodeURI(sURL)
return getUrlEncoded(decodeURI(sURL))
}
Pre ES6:
var getUrlEncoded = function(sURL) {
if (decodeURI(sURL) === sURL) return encodeURI(sURL)
return getUrlEncoded(decodeURI(sURL))
}
Here are some tests so you can see the URL is only encoded once:
getUrlEncoded("https://example.com/media/Screenshot27 UI Home.jpg")
//"https://example.com/media/Screenshot27%20UI%20Home.jpg"
getUrlEncoded(encodeURI("https://example.com/media/Screenshot27 UI Home.jpg"))
//"https://example.com/media/Screenshot27%20UI%20Home.jpg"
getUrlEncoded(encodeURI(encodeURI("https://example.com/media/Screenshot27 UI Home.jpg")))
//"https://example.com/media/Screenshot27%20UI%20Home.jpg"
getUrlEncoded(decodeURI("https://example.com/media/Screenshot27 UI Home.jpg"))
//"https://example.com/media/Screenshot27%20UI%20Home.jpg"
getUrlEncoded(decodeURI(decodeURI("https://example.com/media/Screenshot27 UI Home.jpg")))
//"https://example.com/media/Screenshot27%20UI%20Home.jpg"

Categories