I have a mapreduce job which writes its output to a file in HDFS. But instead of writing it to HDFS, I want the output to be written directly on a webpage. I have created a web project in eclipse and written driver, mapper and reducer classes in it. When I run it with tomcat server, it didn't work.
So how can the output be displayed on a webpage?
If you are using MAP-R distribution , you can write the output of your map reduce job to the file system (not the HDFS), but to fix your issue will require more info.
HDFS (by itself) is not really designed for low-latency random read/writes. A few options you do have however are WebHDFS / HTTPfs. This exposes a REST API to HDFS. http://archive.cloudera.com/cdh4/cdh/4/hadoop-2.0.0-cdh4.6.0/hadoop-project-dist/hadoop-hdfs/WebHDFS.html and http://hadoop.apache.org/docs/r2.4.1/hadoop-hdfs-httpfs/. You could have the webserver pull whatever file you want and serve it on the webpage. I don't think this is a very good solution however.
A better solution might be to have MapReduce output to HBase (http://hbase.apache.org/) and have your webserver pull from HBase. It is far better suited for low-latency random read / writes.
Related
I have my data on hdfs, the folder structure is something like,
hdfs://ns1/abc/20200101/00/00/
hdfs://ns1/abc/20200101/00/01/
hdfs://ns1/abc/20200101/00/02/
......
Basically, we create folder every minute and put hundreds of files in the folder.
We have a spark (2.3) application (written in java) which processes data on a daily basis, so the input path we used is like hdfs://ns1/abc/20200101, simple and straight, but sometime, a few files are corrupt or zero size, this causes the whole spark job failed.
So is there a simpe way to just ingore any bad file? have tried --conf spark.sql.files.ignoreCorruptFiles=true, but doesnt help at all.
Or can we have some 'file pattern' on command-line when submitting spark job, since those bad files are usually using different file extension.
Or, since I'm using JavaSparkContext#newAPIHadoopFile(path, ...) to read data from hdfs, any trick I can do with JavaSparkContext#newAPIHadoopFile(path, ...), so that it will ignore bad file?
Thanks.
I'm working on a small side-project for our company that does the following:
PDF-based documents received through Office 365 Outlook are temporarily stored in OneDrive, using Power Automate
Text data is extracted from the PDFs using a few Java libraries
Based on extracted data an appropriate filename and filepath is created
The PDFs are permanently saved in OneDrive
The issue right now is that my Java program is locally-run, i.e. point 2,3,4 require code to run 24/7 on my PC. I'd like to transition to a Cloud-based solution.
What is the easiest way to accomplish this? The solution doesn't have to be free, but shouldn't cost more than $20/mo. Our company already has an Azure subscription, though I'm not familiar yet with Azure.
What you are looking for is a solution that uses a serverless computing execution model. Azure Functions seems to be a possible choice here. It does seem to have input bindings that respond to OneDrive files and an likewise output bindings.
The cost will depend on the number of documents, not the time the solution is available. I assume we are talking about a small number of documents a month so this will come out cheaper than other execution models.
I am new to Spark. I have a scenario where I need to read and process CSV file from AWS s3. This file is generated on daily basis, so I need to read and process it and dump data into Postgres.
I want to process this huge file in parallel to save time and memory.
I came up with two design but I am a little bit confused about spark as spark context requires connection to be open with all s3 bucket.
Use spark streaming to read CSV from s3 and process it and convert into JSON row by row and append the JSON data in JSONB column of Postgres.
Use spring & java -> download file on the server then start processing and convert it into JSON.
Can anyone help me to get the right direction?
If it's daily, and only 100MB, you don't really need much in the way of large scale tooling. I'd estimate < minute for basic download and process, even remotely, after which bomes the postgres load. Which Postgres offers
try doing this locally, with an aws s3 cp to copy to your local system, then try with postgres.
I wouldn't bother with any parallel tooling; even Spark is going to want to work with 32-64MB blocks, so you won't get more than 2-3 workers. And if the file is .gz, you get exactly one.
That said, if you want to learn spark, you could do this in spark-shell. Download locally first though, just to save time and money.
I am going to implement a system using HDFS and Accumulo. I have an amount of files on my HDFS, and I need to process them using Spark job and save them on Accumulo. I could not find any good examples using google.
Could someone provide an example on how to set up such a workflow?
I need to transfer files from one hdfs folder to another hdfs folder in java code.
May I ask is there api that we can call to transfer files among hdfs paths?
Also I'd like to ask is there anyway to invoke a mapreduce job from java code? Of course, this java not running in hdfs.
Thank you very much and have a great weekend!
May I ask is there api that we can call to transfer files among hdfs paths?
Use the o.a.h.hdfs.DistributedFileSystem#rename method to move file from one folder in HDFS to another folder. The function has been overloaded and one of the method takes Options.Rename as a parameter.
FYI .... I haven't checked the code, but I think that rename involves changes to the name space and not any actual block movements.
Also I'd like to ask is there anyway to invoke a mapreduce job from java code? Of course, this java not running in hdfs.
Hadoop is written in Java, so there should be a way :) Use the o.a.h.mapreduce.Job#submit and o.a.h.mapreduce.Job#waitForCompletion methods.