Java URL constructor ignores path segment - java

I am using the Java URL constructor "URL(URL context, String spec)" found here but the constructed URL is not what I expect - it is leaving out a path segment provided in the context parameter.
As an example, this code
new URL(new URL("http://asdf.com/z"), "a/b/c");
produces a URL with value
http://asdf.com/a/b/c
So it has left out of "z" path segment.
I have two questions:
What is the meaning of "context" the first parameter here in the java doc? I could not find mention of it in the URL specification nor did I really find it in java doc.
Is leaving out the "z" expected behavior?
Thanks!

What is the meaning of "context" the first parameter here in the java doc?
It's like the "base URL" of the spec parameter. If context is https://example.com, and spec is /foo, the constructor would create https://example.com/foo. It's similar to (but not exactly the same as, as we'll see later) asking "I am currently on https://example.com, and I want to go to /foo, what would my final URL be?"
Is leaving out the "z" expected behavior?
Yes. If you follow through the rules of resolving a relative URL against an base URL in RFC 2396 with regards to this case, you will reach this step:
(6) If this step is reached, then we are resolving a relative-path
reference. The relative path needs to be merged with the base
URI's path. Although there are many ways to do this, we will
describe a simple method using a separate string buffer.
(a) All but the last segment of the base URI's path component is
copied to the buffer. In other words, any characters after the
last (right-most) slash character, if any, are excluded.
(b) The reference's path component is appended to the buffer
string.
The "last segment" here, refers to z, and that is not added to the buffer. Right after that, the path a/b/c "is appended to the buffer". Steps (c) onwards deals with removing . and .., which is irrelevant here.
Note that RFC 2386 doesn't say you MUST implement the algorithm in this way, but that whatever your implementation is, your output must match the output of that algorithm:
The above algorithm is intended to provide an example by which the
output of implementations can be tested -- implementation of the
algorithm itself is not required.
So yeah, this is expected. To keep the /z, you should add another / after the z:
new URL(new URL("http://asdf.com/z/"), "a/b/c")
This way the "last segment" becomes the empty string.

You can treat the context like the current directory in file system.
With context "http://asdf.com/z", the current directory is "http://asdf.com/", and use "a/b/c" as the spec will result a full path "http://asdf.com/a/b/c".

Related

URI getRawQuery vs getQuery

I think that using getQuery loses information, is dangerous and that instead only getRawQuery should be used, and that any query parameter values that are known to be encoded should be manually decoded (once the raw query is split on the & characters) with URLDecoder.
Case in point: Assume you have the URL www.example.com with two query parameters:
a parameter url with value =www.otherexample.com?b=2&c=3
a nondescript parameter d with value 4.
The parameter url should be url-encoded, so the URI that your application sees is:
www.example.com?url=www%2Eotherexample%2Ecom%3Fb%3D2%26c%3D3&d=4
Now, if you obtain the query part with getQuery, you get the following:
url=www.otherexample.com?b=2&c=3&d=4
Notice that you've already lost information as you can't say whether d is a query parameter of the www.example.com or of www.otherexample.com.
If instead you obtain the query part with getRawQuery, you get the following:
url=www%2Eotherexample%2Ecom%3Fb%3D2%26c%3D3&d=4
This time, no information is lost and all's well. You can parse the query part and URL-decode the value of the url parameter if you like.
Am I missing anything ?
You're correct.
URI.getQuery() is broken and you shouldn't use it.
Strange thing is I can't find any confirmation of this apart from your post, which made me think maybe URI.getQuery could be useful for something. But after some testing of my own I'm pretty sure it just shouldn't be used unless your application's query string doesn't follow the convention of separating arguments with ampersand.
EDIT 11/11/2019
As pointed out in a comment below, while you can use URI.getRawQuery() to work around the broken URI.getQuery() method, you can't just use the raw query as the query argument to the multi-argument URI constructor, as that constructor is also broken.
You can't use the multi-argument URI constructor if any of the query string arguments contain an ampersand. You could argue this is a bug, but the documentation of the expected behaviour contradicts itself so it's not clear which behaviour is correct. The javadoc of the multi-argument constructor says "Any character that is not a legal URI character is quoted". This implies that an escaped octet should NOT be quoted because the main class documentation includes it as a legal character ("The set of all legal URI characters consists of the unreserved, reserved, escaped, and other characters"). But further down, it documents the observed behaviour that the percent character ('%') is always quoted by the multi-argument constructors, which one assumes is without regard for whether it's part of an escaped octet.
Regardless of whether there is ever any acknowledgement that the documentation is contradictory, or what the correct behaviour should be, it is almost certain the current behaviour will never be altered. The only work-around is not to use the multi-argument constructors if you need the URI to end up containing the quoted ampersand octet "%26". Use the single-argument constructor instead, after doing your own encoding and quoting of special characters.

URI.resolve() does not support the full spectrum of allowed file name characters

I used to use a URI element for representing the base folder and use URI.resolve(filename) to get the URI to the actually file I would like to write to disk.
Now I come along that for apparent reasons the resolve method does not support many characters that the OS supports for file names and those have to be encoded using %HEX.
Since I am not aware of that limitation and how far the encoding really goes. Often this is used in parameter values and I can barely come up with a situation I see encoding in the path.
So is it save to assume that using URI.resolve(URLEncoder.encode(filename)) does the trick? Are there any better alternatives or edge cases I should know about?
It's actually URI.create(en) which fails, for example using "!##$%^&()" (which is a valid if a very strange filename) throws IllegalArgumentException: Malformed escape pair at index 4
As for URLEncoder.encode(filename) - It is deprecated and URLEncoder.encode(filename, encoding) should be used instead.
From my experience, filename URI resolution is best handled by new File(f).toURI() as for a given abstract pathname f, it is guaranteed that:
new File(f.toURI()).equals( f.getAbsoluteFile())

Regex to dynamically ignore some parts of a given path

Consider that I have the following string which is a "complete path":
/A/B/C/D/E
And there is also the "simplified path" string:
/A/B/E
I have a situation where some parts of the string can be omitted and still be represent the full path. I know this is strange, but I can't change it.
Basically for this case, I need a regex to ignore the last two paths before the current path (dynamically as I have no specific information of them), to confirm that these two strings have a correlation.
The only thing I could came up with was:
Take the current path (([^\/]+$)) from both strings and compare.
Check in Java if the complete string contains the simplified one.
But I think there must be a cleaner way to do this.
I came up with the following solution:
Search string:
[^\/]+\/[^\/]+\/([^\/]+$)
Replace string: \1
Check it here
If both path point to the same file/directory then you could make use of the Files class.
It has a method Files#isSameFile to which you pass two Path instances and it would check if both files are pointing to the same file at your directory. This simple line would check if A/B/E/ and /A/B/C/D/E are actually the same directory.
System.out.println(Files.isSameFile(Paths.get("/A/B/C/D/E"), Paths.get("/A/B/E")));

Will forward slash work across platforms in Path.resolve?

Since Path.resolve does not accept an array of strings, it is possible pass a relative path containing multiple path components, e.g. "foo/bar/baz".
My question is
if the forward slash in such a relative path will work as intended across platforms?
I have seen some answers on here that allege Java treats forward slashes as a "universal separator", but not citations to support them.
/ should be a valid path separator on all major platforms of today. See for instance File.separator vs Slash in Paths (maybe it's even a dup?)
If you're the pedantic type you can use FileSystem.getSeparator.
Note that you can also do
root.resolve(Paths.get("foo", "bar", "baz"));
No. The typical / in Path objects is called a name separator. It is defined in the FileSystem object from which the Path was created.
You can retrieve it with FileSystem#getSeparator().
Returns the name separator, represented as a string.
The name separator is used to separate names in a path string. An
implementation may support multiple name separators in which case this
method returns an implementation specific default name separator. This
separator is used when creating path strings by invoking the
toString() method.
In the case of the default provider, this method returns the same
separator as java.io.File.separator.
You can retrieve a Path's FileSystem with Path#getFileSystem().
As far as I know, all typical file systems will use / as a separator, but you could write your own FileSystem implementation which doesn't.
You can do first FileSystem.getPath("foo", "bar", "baz") to get Path and instead of sending String to Path.resolve() you can use overloaded one which accepts Path.

How to escape forward slash in java so that to use it in path

I am trying to escape forward slash in String which can be used in path using Java.
For example: String:: "Test/World"
Now I want to use above string path.At the same time I have to make sure that "Test/World" will come as it is in path. Sorry if its duplicate but I couldn't find any satisfactory solution for this.
My purpose is to use above string to create nodes in Zookeeper.
Example:
If I use following string to create node in Zokkeeper then I should get "Test/World" as a single node not separate. Zookeeper accepts "/" as path separator which in some cases I dont require.
/zookeeper/HellowWorld/Test/World
Thanks
You should know about File.separator ... This is safer than \ or / because Linux and Windows use different file separators. Using File.separator will make your program run regardless of the platform it is being run on, after all, that is the point of the JVM. -- forward slash will work, however, File.separator will make you end users more confident that it will.
And you don't need to escape "/ ... you should also see the answer to this question
String fileP = "Test" + File.separator + "World";
In order to escape a character in Java use "\"
for example:
String strPath = "directory\\file.txt".
I believe you do not need to escape forward slashes such as: "/"
Let me rephrase your question. You are trying to create a node in zookeeper and it should be /zookeeper/HelloWorld/NodeName. But the last part "NodeName" is actually "Test/World", and you are looking for ways to escape "/" so the node name can be "Test/World".
I don't think it would work escaping the char, unless you tried with unicode.
Try \u002F which is the equivalent for /.
We are trying to solve exactly the same problem (using filesystem path as node name in zookeeper) a we haven't found a way how to have '/' in node name.
Solution would be either to replace '/' with some character, that cannot appear in your node name. For paths that would be '/' or '\0', which wont help us in this case.
Other possibility is to replace '/' with string of characters allowed in node name, e.g. "Test/World" -> "Test%#World", "Test%World" -> "Test%%World" and add escaping/de-escaping to saving and loading.
If there is any more straightforward way, I'd love to hear it.
I don't know anything about Zookeeper. But it looks to me as though you're trying to keep a list of strings like "zookeeper", "HellowWorld", "Test/World", that you then want to use either to create Zookeeper nodes or to create a pathname in a file system. (I'm assuming that if you're working with a file system, you're going to have a subdirectory Test and a file or subdirectory World in the Test subdirectory. If you're actually trying to create a single file or directory named Test/World, give up. Both Linux and Windows will fight with you.)
If that's the case, then don't try to represent the "path" as a simple String that you pass around in your program. Instead, represent it as a String[] or ArrayList<String>, and then convert it to a filesystem path name only when you need a filesystem path name. Or, better, define your own class with a getFilesystemPath method. Converting your list of node names to a pathname String too early, and then trying to reconstruct the list from the String later, is a poor approach because you throw away data that you need later (in particular, you're throwing away information about which / characters are separators and which ones are part of node names).
EDIT: If you also need a single path name for Zookeeper, as you mentioned in another comment, I can't help you since I don't know Zookeeper and haven't found anything in a quick look at the docs. If there is a way to escape the slash for Zookeeper, then I still recommend defining your own class, with a getFilesystemPath method and a getZookeeperPath method, since the two methods will probably return different Strings in certain cases. The class would internally keep the names as an array or ArrayList.

Categories