What is the deal with Java's bizarre file protocol handling?
I mean on windows UNC path's get turned into 5 slashes, and I get why that happens, but on linux an absolute path gets turned into file:/local/path/to/file
Shouldn't that have three slashes?
I'm assuming the authors of Java aren't incompetent, so is there an explanation for why that's acceptable?
Let’s start with the documentation of the URI class:
A hierarchical URI is subject to further parsing according to the syntax
[scheme:][//authority][path][?query][#fragment]
As you can see, the authority is optional. This is supported by the URI specification, section 3:
The scheme and path components are required, though the path may be empty (no characters). When authority is present, the path must either be empty or begin with a slash ("/") character. When authority is not present, the path cannot begin with two slash characters ("//").
A file: URI can have an authority separator, //, with an effectively empty authority after it, but it serves no purpose, so there is no harm in omitting it. It’s still a fully compliant URI.
Related
I am using the Java URL constructor "URL(URL context, String spec)" found here but the constructed URL is not what I expect - it is leaving out a path segment provided in the context parameter.
As an example, this code
new URL(new URL("http://asdf.com/z"), "a/b/c");
produces a URL with value
http://asdf.com/a/b/c
So it has left out of "z" path segment.
I have two questions:
What is the meaning of "context" the first parameter here in the java doc? I could not find mention of it in the URL specification nor did I really find it in java doc.
Is leaving out the "z" expected behavior?
Thanks!
What is the meaning of "context" the first parameter here in the java doc?
It's like the "base URL" of the spec parameter. If context is https://example.com, and spec is /foo, the constructor would create https://example.com/foo. It's similar to (but not exactly the same as, as we'll see later) asking "I am currently on https://example.com, and I want to go to /foo, what would my final URL be?"
Is leaving out the "z" expected behavior?
Yes. If you follow through the rules of resolving a relative URL against an base URL in RFC 2396 with regards to this case, you will reach this step:
(6) If this step is reached, then we are resolving a relative-path
reference. The relative path needs to be merged with the base
URI's path. Although there are many ways to do this, we will
describe a simple method using a separate string buffer.
(a) All but the last segment of the base URI's path component is
copied to the buffer. In other words, any characters after the
last (right-most) slash character, if any, are excluded.
(b) The reference's path component is appended to the buffer
string.
The "last segment" here, refers to z, and that is not added to the buffer. Right after that, the path a/b/c "is appended to the buffer". Steps (c) onwards deals with removing . and .., which is irrelevant here.
Note that RFC 2386 doesn't say you MUST implement the algorithm in this way, but that whatever your implementation is, your output must match the output of that algorithm:
The above algorithm is intended to provide an example by which the
output of implementations can be tested -- implementation of the
algorithm itself is not required.
So yeah, this is expected. To keep the /z, you should add another / after the z:
new URL(new URL("http://asdf.com/z/"), "a/b/c")
This way the "last segment" becomes the empty string.
You can treat the context like the current directory in file system.
With context "http://asdf.com/z", the current directory is "http://asdf.com/", and use "a/b/c" as the spec will result a full path "http://asdf.com/a/b/c".
I used to use a URI element for representing the base folder and use URI.resolve(filename) to get the URI to the actually file I would like to write to disk.
Now I come along that for apparent reasons the resolve method does not support many characters that the OS supports for file names and those have to be encoded using %HEX.
Since I am not aware of that limitation and how far the encoding really goes. Often this is used in parameter values and I can barely come up with a situation I see encoding in the path.
So is it save to assume that using URI.resolve(URLEncoder.encode(filename)) does the trick? Are there any better alternatives or edge cases I should know about?
It's actually URI.create(en) which fails, for example using "!##$%^&()" (which is a valid if a very strange filename) throws IllegalArgumentException: Malformed escape pair at index 4
As for URLEncoder.encode(filename) - It is deprecated and URLEncoder.encode(filename, encoding) should be used instead.
From my experience, filename URI resolution is best handled by new File(f).toURI() as for a given abstract pathname f, it is guaranteed that:
new File(f.toURI()).equals( f.getAbsoluteFile())
Since Path.resolve does not accept an array of strings, it is possible pass a relative path containing multiple path components, e.g. "foo/bar/baz".
My question is
if the forward slash in such a relative path will work as intended across platforms?
I have seen some answers on here that allege Java treats forward slashes as a "universal separator", but not citations to support them.
/ should be a valid path separator on all major platforms of today. See for instance File.separator vs Slash in Paths (maybe it's even a dup?)
If you're the pedantic type you can use FileSystem.getSeparator.
Note that you can also do
root.resolve(Paths.get("foo", "bar", "baz"));
No. The typical / in Path objects is called a name separator. It is defined in the FileSystem object from which the Path was created.
You can retrieve it with FileSystem#getSeparator().
Returns the name separator, represented as a string.
The name separator is used to separate names in a path string. An
implementation may support multiple name separators in which case this
method returns an implementation specific default name separator. This
separator is used when creating path strings by invoking the
toString() method.
In the case of the default provider, this method returns the same
separator as java.io.File.separator.
You can retrieve a Path's FileSystem with Path#getFileSystem().
As far as I know, all typical file systems will use / as a separator, but you could write your own FileSystem implementation which doesn't.
You can do first FileSystem.getPath("foo", "bar", "baz") to get Path and instead of sending String to Path.resolve() you can use overloaded one which accepts Path.
In many websites (specially gmail, yahoo or hotmail), you would notice the URL
is followed is something like: yahoo.com/abc/bcd.html;_x=12323;_y=2322;
what are these _x and _y parameters? How to access them in server side code?
They are parameters in the URL (as distinct from the query string), this article has a good discussion, including this helpful diagram:
<scheme>://<username>:<password>#<host>:<port>/<path>;<parameters>?<query>#<fragment>
Note that they're not "parameters" in the sense used in the Java EE ServletRequest#getParameter and such (there when they say "parameter" they mean query string or POST arguments, which are different).
This is defined in §3.3 of RFC 2396:
The path may consist of a sequence of path segments separated by a
single slash "/" character. Within a path segment, the characters
"/", ";", "=", and "?" are reserved. Each path segment may include a
sequence of parameters, indicated by the semicolon ";" character.
The parameters are not significant to the parsing of relative
references.
(For the avoidance of doubt: The term "path" above does not include the query string, see the beginning of §3.)
RFC 2396 is obsoleted by RFC 3986, though, which amends the above markedly:
Aside from dot-segments in hierarchical paths, a path segment is
considered opaque by the generic syntax. URI producing applications
often use the reserved characters allowed in a segment to delimit
scheme-specific or dereference-handler-specific subcomponents. For
example, the semicolon (";") and equals ("=") reserved characters are
often used to delimit parameters and parameter values applicable to
that segment. The comma (",") reserved character is often used for
similar purposes. For example, one URI producer might use a segment
such as "name;v=1.1" to indicate a reference to version 1.1 of
"name", whereas another might use a segment such as "name,1.1" to
indicate the same. Parameter types may be defined by scheme-specific
semantics, but in most cases the syntax of a parameter is specific to
the implementation of the URI's dereferencing algorithm.
They're just characters that may appear in the URL. You access them by parsing the URL, because they're not regular query string parameters.
Those are parameters to a segment in the path part of the URI.
The URI syntax is defined in RFC 3986 as follows:
URI = scheme ":" hier-part [ "?" query ] [ "#" fragment ]
hier-part = "//" authority path-abempty
/ path-absolute
/ path-rootless
/ path-empty
[...]
The following are two example URIs and their component parts:
foo://example.com:8042/over/there?name=ferret#nose
\_/ \______________/\_________/ \_________/ \__/
| | | | |
scheme authority path query fragment
| _____________________|__
/ \ / \
urn:example:animal:ferret:nose
In this example (http://yahoo.com/abc/bcd.html;_x=12323;_y=2322;), these parameters are part of the path component. Essentially, this is just a convention used within that application for it to be able to identify resources.
Generally speaking, although paths in HTTP URIs are often similar to what you would find on a file system, they don't have to be related to the file system structure in any way. This is purely an implementation decision from the engine that dereferences the URI (i.e. the HTTP server implementation and what dispatches the request to whatever will produce a response).
Strictly speaking, the query is also an integral part of the URI (so many discussions you'll find on "RESTful" URIs are pointless, except for some SEO techniques).
Because parameters are passed via the query segment when using HTML forms, many HTTP frameworks expose its content by splitting the query for you into a map. For example, in a Java/Servlet content, the query string (getQueryString()) automatically populates the parameters returned by getParameter(...).
If you want to get parameters from bcd.html;_x=12323;_y=2322;, you'll have to split the path yourself.
I have to create a file based on a string provided to me.
For this example, let's say the file name is "My file w/ stuff.txt".
When Java creates the file using
File file = new File("My file w/ stuff.txt")
Even though the default windows separator is '\', it assumes that the '/' slash is a file separator. So a future call to file.getName() would return " stuff.txt". This causes problems for my program.
Is there any way to prevent this behaviour?
According to this Wikipedia page, the Windows APIs treat / as equivalent to \. Even if you somehow managed to embed a / in a pathname component in (for example) a File object, the chances are that Windows at some point will treat it as a path separator.
So your options are:
Let Windows treat the / as it would normally; i.e. let it treat the character as a pathname separator. (Users should know this. It is a "computer literacy" thing ... for Windows users.)
As above, but with a warning to the user about the /.
Check for / AND \ characters, and reject both saying that a filename (i.e. a pathname component) cannot contain pathname separators.
Use some encoding scheme to encode reserved characters before attempting to create the files. You must also then apply the (reverse) decoding scheme at all points where you need to show the user their "file name with slashes".
Note: if the user can see the actual file paths (e.g. via a command shell) you can't hide the encoding / decoding from them. Arguably, that is worse than the "problem" you were trying to solve in the first place.
There is no escaping scheme that the Windows OS will accept for this purpose. You would need to use something like % encoding; e.g. replace / with %2F. Basically you need to "hide" the slash character from Windows by replacing it with other characters in the OS-level pathname!
The best option depends on details of your application; e.g. whether you can report problems to the person who entered the bogus filename.
If a string is being provided to you (from an external source), it doesn't sound like you can prevent that string from containing certain characters. If you have some sort of GUI to create the string, then you can always restrict it there. Otherwise, whatever method is creating your file should check for a slash and either return an error or handle it as you see fit.
Since neither forward nor backward slashes are allowed in windows file names, they should be cleaned out of the Strings used to name files.
Well, how could you stop it being a folder separator? It is a folder separator. If you could just decide for yourself what was and what wasn't a folder separator, then the whole system would come crashing down.