Spring boot Hadoop, Webhdfs and Apache Knox

Spring boot Hadoop, Webhdfs and Apache Knox - java

I have a Spring boot application which is accessing HDFS through Webhdfs secured via Apache Knox secured by Kerberos. I created my own KnoxWebHdfsFileSystem with custom scheme (swebhdfsknox) as a subclass of WebHdfsFilesystem which only changes the URLs to contain the Knox proxy prefix. So it effectively remaps requests from form:
http://host:port/webhdfs/v1/...
to the Knox one:
http://host:port/gateway/default/webhdfs/v1/...
I do this by overriding two methods:
public URI getUri()
URL toUrl(Op op, Path fspath, Param<?, ?>... parameters)
So far so good. I let spring boot create FsShell for me and use it for various operations such as list files, mkdir etc. All work fine. Except copyFromLocal which as documented requires 2 steps and redirect. And on the last step when the filesystem tries to PUT to the final URL which received in Location header it fails with error:
org.apache.hadoop.security.AccessControlException: Authentication required
at org.apache.hadoop.hdfs.web.WebHdfsFileSystem.validateResponse(WebHdfsFileSystem.java:334) ~[hadoop-hdfs-2.6.0.jar:na]
at org.apache.hadoop.hdfs.web.WebHdfsFileSystem.access$200(WebHdfsFileSystem.java:91) ~[hadoop-hdfs-2.6.0.jar:na]
at org.apache.hadoop.hdfs.web.WebHdfsFileSystem$FsPathOutputStreamRunner$1.close(WebHdfsFileSystem.java:787) ~[hadoop-hdfs-2.6.0.jar:na]
at org.apache.hadoop.io.IOUtils.copyBytes(IOUtils.java:54) ~[hadoop-common-2.6.0.jar:na]
at org.apache.hadoop.io.IOUtils.copyBytes(IOUtils.java:112) ~[hadoop-common-2.6.0.jar:na]
at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:366) ~[hadoop-common-2.6.0.jar:na]
at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:338) ~[hadoop-common-2.6.0.jar:na]
at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:302) ~[hadoop-common-2.6.0.jar:na]
at org.apache.hadoop.fs.FileSystem.copyFromLocalFile(FileSystem.java:1889) ~[hadoop-common-2.6.0.jar:na]
at org.springframework.data.hadoop.fs.FsShell.copyFromLocal(FsShell.java:265) ~[spring-data-hadoop-core-2.2.0.RELEASE.jar:2.2.0.RELEASE]
at org.springframework.data.hadoop.fs.FsShell.copyFromLocal(FsShell.java:254) ~[spring-data-hadoop-core-2.2.0.RELEASE.jar:2.2.0.RELEASE]
I suspect the problem is the redirect somehow but can't figure out what might be the problem here. If I do the same requests via curl the file is successfully uploaded to HDFS.

This is a known issue with using existing Hadoop clients against Apache Knox using the HadoopAuth provider for kerberos on Knox. If you were to use curl or some other REST client it would likely work for you. The existing Hadoop java client doesn't expect a SPNEGO challenge from the DataNode - which is what the PUT in the send step is talking to. The DataNode expects the block access token/delegation token issued by the NameNode in the first step to be present. The Knox gateway however will require SPNEGO authentication for every request to that topology.
This is an issue that is on the roadmap to be addressed and will likely become hotter with interest moving more inside the cluster rather than only accessing resources through it from the outside.
The following JIRA tracks this item and as you can see from the title is related to DistCp which is a similar usecase:
https://issues.apache.org/jira/browse/KNOX-482
Feel free to take a look and lend a hand with testing or developing - it would all be most welcome!
Another possibility would be to change the Hadoop java client to deal with a SPNEGO challenge for the DataNode as well.

Related

How to secure google cron service tasks on GAE flexible env?

I want from an url:
To be called only by the google cron service
Not to be called by a user in a web browser
Whats on the google docs didn't work: when the cron service calls the servlet, it also give me a 403 error - forbidden access...
And there is no security related informations regarding the app.yaml file for the flexible env.
Two observation I have made:
Google states that "Google App Engine issues Cron requests from the IP address 0.1.0.1". But I got another IP address launching the cron job:
From this IP address, the HTTP header actually contains the X-Appengine-Cron (with the value true)
Do you have any ideas ?

The referenced doc snippet mentioning the securing method based on login: admin config in the handlers section of the app.yaml file is incorrect - the handlers section is applicable to the (non-java) standard environment app.yaml, not the flexible environment one. So you might want to remove such undocumented config, just to be sure it doesn't have some unexpected/undesired negative effect.
Checking just the X-Appengine-Cron should be sufficient enough: it can only be set by the cron service of your app. From Securing URLs for cron:
Requests from the Cron Service will also contain a HTTP header:
X-Appengine-Cron: true
The X-Appengine-Cron header is set internally by Google App Engine. If
your request handler finds this header it can trust that the request
is a cron request. If the header is present in an external user
request to your app, it is stripped, except for requests from logged
in administrators of the application, who are allowed to set the
header for testing purposes.
As for why exactly the response to the cron request is 403 - you should show your handler code which is (most likely) the one responsible for building the reply.

Prerender.io not caching pages - followed all steps as per documentation

We are trying to use pretender.io to our application which developed in AngularJS, Spring and Hibernate konnectnow.com which hosted at amazon server.
Here are the steps I followed:
Signup at prerender.io and got token: cFeRZcsv3JnAftreuhMO
Checked documentation and understood that I need to install middleware and decided to use Spring one.
In web.xml added pom added as mentioned https://github.com/greengerong/prerender-java
Added !# to the URL in all the pages.
Restarted tomcat server.
Logged into pretender.io with login details and found that nothing getting crawl.
For testing purpose the url konnectnow.com/#!/planpage changed to konnectnow.com/?_escaped_fragment_=/planpage
Nothing comes up, got error page isn’t working.
Checked Crawl Stats at pretender.io and found that as:
Status Code: 505, Cache Hit: Miss, Response Time(sec): 1.51sec, URL:
http://localhost:8080/#!/planpage
Not sure why it takes local host.
Can some one help me how to make this work.

We recommend using html5 push state instead of the #! in your URLs if possible. Html5 push state is better since nothing after a # is sent to the server, which can lead to issues for the crawlers that are checked by their user agent (Facebook, Twitter, etc).
You should set the forwardedURLHeader in order to have the Prerender Java middleware use a different host for your website instead of your proxy URL.
https://github.com/greengerong/prerender-java#forwardedurlheader
I also see that you posted your prerender token publicly so we regenerated your token to prevent someone else from using it. Please find your new token when you log into your Prerender.io account. I've also emailed you there.

Hadoop WebHDFS Java Client API enable SSL and Basic Authentication

I have a Spring Boot application that uses spring-yarn-boot:2.2.0.RELEASE to get access to a Hadoop filesystem (HDFS). Operations that I do are LISTSTATUS, GETFILESTATUS and OPEN (to read a file). HDFS URI is specified through application.properties:
spring.hadoop.fsUri=webhdfs://127.0.0.1:50070/webhdfs/v1/
I make a bean to which I provide Hadoop Configuration (that Spring somehow automagically prepares for me on startup):
SimplerFileSystem fs = new SimplerFileSystem(FileSystem.get(configuration));
FsShell shell = new FsShell(configuration);
And everything works well as expected, but the problems came when I got two new requirements.
First thing is that HDFS will be protected with SSL from now on. I can't seem to find any way to tell my application that the fsURI that starts with webhdfs:// is actually a https connection. And if I will give the https URL directly, I'll get an exception:
java.io.IOException: No FileSystem for scheme: https
at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2584)
... which is caused by that code: FileSystem.get(configuration).
This thing is driving me crazy, I don't seem to find a way to get pass this.
Second requirement is, that I need to authenticate myself against the WebHDFS with basic authentication. For this I also can't find any means in the client API.
Has anyone done it before and have any instructions to share? Or maybe anyone knows a different client API that I can use to accomplish this?
One option is to implement the REST calls myself with RestTemplate or any other REST service consumer API, but this looks like not-so-special use case so I'm really hoping that there is something that has been done already.
EDIT:
Found a solution to the HTTPS problem. One should use swebhdfs:// as url prefix and everything will work. Still havent found a solution to the Basic Auth problem.

Hadoop Java Client API messes up my fsURI

I try to access HDFS in Hadoop Sandbox with the help of Java API from a Spring Boot application. To specify the URI to access the filesystem by I use a configuration parameter spring.hadoop.fsUri. HDFS itself is protected by Apache Knox (which to me should act just as a proxy that handles authentication). So if I call the proxy URI with curl, I use the exact same semantics as I would use without Apache Knox. Example:
curl -k -u guest:guest-password https://sandbox.hortonworks.com:8443/gateway/knox_sample/webhdfs/v1?op=GETFILESTATUS
Problem is that I can't access this gateway using the Hadoop client library. Root URL in the configuration parameter is:
spring.hadoop.fsUri=swebhdfs://sandbox.hortonworks.com:8443/gateway/knox_sample/webhdfs/v1
All the requests get Error 404 and the problem why is visible from the logs:
2015-11-19 16:42:15.058 TRACE 26476 --- [nio-8090-exec-9] o.a.hadoop.hdfs.web.WebHdfsFileSystem : url=https://sandbox.hortonworks.com:8443/webhdfs/v1/?op=GETFILESTATUS&user.name=tarmo
It destroys my originally provided fsURI. If I debugged what happens in the internals of Hadoop API, I see that it takes only the domain part sandbox.hortonworks.com:8443 and appends /webhdfs/v1/ to it from a constant. So whatever my original URI is, at the end it will be https://my-provided-hostname/webhdfs/v1. I understand that it might have something to do with the swebhdfs:// beginning but I can't use https:// directly because in that case an exception will be thrown how there is no such filesystem as https.
Googling this, I found an old mailing list thread where someone had the same problem, but no one ever answered the poster.
Does anyone know what can be done to solve this problem?

I apologize for being so late in this response.
You may be able to leverage the Apache Knox Default Topology URL. In your description, you happen to be using a topology called knox_sample. In order to access that topology as the "Default Topology", you would have to configure it as the default topology name. See: http://knox.apache.org/books/knox-0-7-0/user-guide.html#Default+Topology+URLs
The default "Default Topology" name is sandbox

Hosting a war file on cloudfoundry with custom domain? (or alternatives)

Say, I have a Java web app inside a war file that is hosted on cloudfoundry at the url mycoolapp.cfapps.io, which works perfectly. I now need to host it on a custom domain mycoolapp.com and I have purchased the domain.
What is process to host it on my own domain? Can I do it via Cloudfoundry?
My app needs ssl. Currently https://mycoolapp.cfapps.io works. But I need it to work on my custom domain. What will be involved in this? (I think I need to get a certificate for my domain, but what next?)
In the app some confidential information is embedded in urls (this cannot be changed), so I'd also need to ensure that the provider cannot know the urls accessed (apart from the base url). Can this be done? If not, what are the alternatives?

It could be done by creating a CNAME record for your app (see Azure example here). Unfortunately, it seems that Cloud Foudry (CF) does not support it yet. As I understand, it is caused by the fact that CF router determines the exact Virtual Machine (and, hence, IP) by parsing URL and determining the route according to the host name (mycoolapp in your case). Ideally there would be an interface in CF where you could register all CNAME aliases for your app (as implemented for Azure websites)
If CNAME record would be enabled, that it would also work for HTTPS, as it basically resolves IP address. And definitely there would be an interface for you to upload a certificate for your domain. This leads to problems mentioned below about SSL termination. But, again, as far as I know, it is not supported by CF yet.
That it a question to the internal structure of run.pivotal.io deployment of CF. Conceptually HTTPS will do the trick as it encrypts URL parameters. However I suppose that SSL terminates on the router (as certificate is issued for *.cfapps.io - single cert for all apps - you could check it in browser connection to your app by HTTPS). That likely means that internally CF has access to ALL data of your request, and leads to my question about SSL termination in CF, which currently has no answer. Hope CF will provide a way to terminate SSL on the final server processing the request.
UPDATE:
Cloud Foundry has proposed its own way to support custom domains - through using CloudFlare proxy. If the fact of using proxy that decrypts your data is Ok for you, it could be used.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.