My Apache Spark application takes various input files and stores the results and logs in other files. The input files are provided along with the application which is supposed to run on the Amazon cloud (EMR seemed preferable to EC2).
Now, I know that I'm supposed to create an uber-jar containing my input files and the application that accesses them. However, how do I retrieve the generated files from the cloud, once the execution finishes?
As an additional info, the files are created and written using relative paths from the code.
Assuming you mean that you want to access the output generated by the Spark application outside the cluster, the usual thing to do is to write it to S3. Then you may of course read the data directly from S3 from outside the EMR cluster.
Related
I have common question about architecture I should use for my specific problem.
I have .TSV file with some informations and my task is to create REST API app that will consume this .TSV file and there will be 3 REST API endpoints. Each endpoint will return JSON data I processed from .TSV file.
My question is: Should I crate some POST method that will upload the TSV file and I will save it eg to the session and do the logic with using the API Endpoints?
Or should I POST the content of TFS file as JSON in every request to the specific endpoint?
I dont know how to glue it all together.
There is no requirement fot the DB. The program will be tested just with numerous requests through the API and I dont know how to process or store the .TSV content in my app so one user could call all three endpoint sequentially above the same data without reuploading the TSV file.
It's better to upload the file and then do the processing on server. The file will upload in one request and it's better rather than send multiple request.
I believe the solution will depend on the size of the file. Storing the file in the memory can not be a good approach if the file is very large. And also, saving the file in a session may not be good, because if you need to scale your service in the future, you will not be able to do it. Even storing the file in a /tmp directory can also be a bad approach, because the solution continues to be not scalable.
It will be a good idea using a Storage Service like AWS S3 or Google Firebase or any other related. When you would call one of your three RESTs, your application will verify if that file was not yet processed, read that file, process anything you want and save the result to your S3 Bucket (If you don't want to save the processed files, you can use a retention policy on S3 to delete the file after X period of time).
And only after this, you will return the result. As you can see, this is a synchronous solution.
If the file processing need a lot of CPU and takes so long, you will need an asynchronous solution. So instead of processing the files directly when you call the REST API, you will have to create another application that will read that file from S3, process it and save it. All asynchronously. And your REST API would only get the file from S3 and return it.
We have an use case of downloading a large file hosted on Network File System. That means it will be accessible through nfs://.
I need a Java/Scala library that can access/read/move the file to my local or in HDFS for that matter.
Whatever I have read so far there are issues on API's:
1. WebNFS changed Yanfs
2. Yanfs has no activity: https://java.net/projects/yanfs/sources/svn/show
3. No maven repository dependency to use in project
4. No Documentation
If programatically (not by mounting) I have to access files using Java/Scala what is my best bet!!
I am working on a Java Web-Application project using servlets, eclipse, and tomcat.
I would like to be able to dynamically store/create persistent files from servlets and allow the user to access the files using a link, without storing the files in the database.
I have read that getServletContext().getRealPath("/") is volatile and gets reset every time the server is restarted.
I have also read that creating a directory like "$HOME/.ourapp" would solve this. Although, I cannot seem to find how to set-up tomcat to allow the user to access the files using a link, using the eclipse-tomcat.
Question : How to set-up eclipse-tomcat so that the link to the website "http://localhost/" and the file "http://localhost/temp-xx.txt" is the same, while also allowing to dynamically create persistent data "temp-xx.txt" is generated by a servlet and allow the user to access it and does not get deleted when the server is restarted.
This gets complicated, because Tomcat can server files using DefaultServlet (it just sends files back to the client, exactly as you'd expect from a web server), but it caches files internally, so modifying the file system underneath it can have some surprising behavior.
You can disable caching for the DefaultServlet but I've seen reports that it still behaves in surprising ways. The only fool-proof solution I've seen is to write your own servlet that streams the files from wherever they are stored.
But writing your own streaming servlet isn't as simple as you might think. If you want it to be high-performance, you'll want to enable all the nice HTTP features like range-requests, eTags, If-Modified-Since and all that stuff that the DefaultServlet already provides. Perhaps you should start with using the DefaultServlet and see how far it will get you.
The configuration is actually really easy: just add a <Resources> element to your META-INF/context.xml file and use a postResources attribute. You can find the documentation in the Tomcat users' guide for resources.
Am using CloudBees to deploy my Java EE application. In that I need to write and read files and I wont find any cloud file system from CloudBees. Please suggest me any free cloud file system storage and java code to access that file system.
Using jclouds you can store stuff in several different clouds while using a consistent API. http://www.jclouds.org/
You can store files - however they will be ephemeral and not shared in the cluster. To achieve that, you would need to store in a DB or s3 or similar (there is also an option of webdav).
file system on RUN#Cloud is indeed not persistent neither distributed. File stored there will "disappear" when application is redeployed/restarted and will not be replicated if application scale out on multiple nodes in cluster.
Best option afaik is to use a storage service (amazon s3 to benefit from low network latency from your RUN instance) using jclouds neutral API (http://www.jclouds.org/documentation/quickstart/aws/), that can be configured to use filsystem storage (http://www.jclouds.org/documentation/quickstart/filesystem/) so that you can test on you own computer, and cache filestore content in temp directory - as defined by System.getProperty("java.io.temp") - to get best performances.
This will require a "warm-up" phase for this cache to get populated from filestore, but you'll then get best of both words.
I have a webapp with an architecture I'm not thrilled with. In particular, I have a servlet that handles a very large file upload (via commons-fileupload), then processes the file, passing it to a service/repository layer.
What has been suggested to me is that I simply have my servlet upload the file, and a service on the backend do the processing. I like the idea, but I have no idea to go about it. I do not know JMS.
Other details:
- App is a GWT app split into the recommended client/server/shared subpackages, using an MVP architecture.
- Currently, I am only running in GWT hosted mode, but am planning to move to Tomcat in the very near future.
I'm perfectly willing to learn whatever I need to in order to get this working (in fact, that's the point of writing the app). I'm not expecting anyone to write code for me, but can someone point me in the right direction to get started?
There are many options for this scenario, but the simplest may be just copying the uploaded file to a known location on the file system, and have a background daemon monitor the location and process when it finds it.
#Jason, there are many ways to solve your problem.
i) Have dump you file data into Database with column type BLOB. and have a DB polling thread(after a particular time period) polls table for newly inserted file .
ii) Have dump file into file system and have a file montioring process.
Benefit of i) over ii) is that DB is centralized and fast resource where as file systems are genrally slow and non-centalized in nature.
So basically servlet would dump either to DB or file system. Now about who will process that dumped file:- a) It could be either montioring process as discussed above or b) you can use JMS which is asynchronous in nature what it means servlet would put a trigger event in queue which will asynchronously trigger new processing thread.
Well don't introduce JMS in your system unnecessarily if you are ok with monitoring process.
This sounds interesting and familiar to me :). We do it in the similar way.
We have our four projects, all four projects includes file upload and file processing (Image/Video/PDF/Docs) etc. So we created a single project to handle all file processing, it is something like below:
All four projects and File processor use Amazon S3/Our File Storage for file storage so file storage is shared among all five projects.
We make request to File Processor providing details in XML via http request which include file-path on S3/Stoarge, aws authentication details, file conversion/processing parameters. File Processor does processing and puts processed files on S3/Storage, constructs XML with processed files details and sends XML via response.
We use Spring Frameowrk and Tomcat.
Since this is foremost a learning exercise, you need to pick an easy to use JMS provider. This discussion suggested FFMQ just one year ago.
Since you are starting with a simple processor, you can keep it simple and use a JMS Queue.
In the simplest form, each message send by the servlet has to correspond to a single job. You can either put the entire payload of the upload in the message, or just send a filename as reference to the content in the message. These are details you can refactor later.
On the processor side, if you are using Java EE, you can use a MessageBean. If you are not, then I would suggest a 3 JVM solution -- one each for Tomcat, the JMS server, and the message processor. This article includes the basics of a message consuming client.