Spark with HDFS as input and Accumulo as output - java

I am going to implement a system using HDFS and Accumulo. I have an amount of files on my HDFS, and I need to process them using Spark job and save them on Accumulo. I could not find any good examples using google.
Could someone provide an example on how to set up such a workflow?

Related

Is there any way to read and write orc files without hadoop in Java?

My service should get files with different formats and getting informations from them. One of that format is orc. Is there a way to read it from RAM (as byte array) without hadoop and other additional systems? Can't find any way in google. :(

reading CSV file from s3 using spark

I am new to Spark. I have a scenario where I need to read and process CSV file from AWS s3. This file is generated on daily basis, so I need to read and process it and dump data into Postgres.
I want to process this huge file in parallel to save time and memory.
I came up with two design but I am a little bit confused about spark as spark context requires connection to be open with all s3 bucket.
Use spark streaming to read CSV from s3 and process it and convert into JSON row by row and append the JSON data in JSONB column of Postgres.
Use spring & java -> download file on the server then start processing and convert it into JSON.
Can anyone help me to get the right direction?
If it's daily, and only 100MB, you don't really need much in the way of large scale tooling. I'd estimate < minute for basic download and process, even remotely, after which bomes the postgres load. Which Postgres offers
try doing this locally, with an aws s3 cp to copy to your local system, then try with postgres.
I wouldn't bother with any parallel tooling; even Spark is going to want to work with 32-64MB blocks, so you won't get more than 2-3 workers. And if the file is .gz, you get exactly one.
That said, if you want to learn spark, you could do this in spark-shell. Download locally first though, just to save time and money.

Writing mapreduce output directly onto a webpage

I have a mapreduce job which writes its output to a file in HDFS. But instead of writing it to HDFS, I want the output to be written directly on a webpage. I have created a web project in eclipse and written driver, mapper and reducer classes in it. When I run it with tomcat server, it didn't work.
So how can the output be displayed on a webpage?
If you are using MAP-R distribution , you can write the output of your map reduce job to the file system (not the HDFS), but to fix your issue will require more info.
HDFS (by itself) is not really designed for low-latency random read/writes. A few options you do have however are WebHDFS / HTTPfs. This exposes a REST API to HDFS. http://archive.cloudera.com/cdh4/cdh/4/hadoop-2.0.0-cdh4.6.0/hadoop-project-dist/hadoop-hdfs/WebHDFS.html and http://hadoop.apache.org/docs/r2.4.1/hadoop-hdfs-httpfs/. You could have the webserver pull whatever file you want and serve it on the webpage. I don't think this is a very good solution however.
A better solution might be to have MapReduce output to HBase (http://hbase.apache.org/) and have your webserver pull from HBase. It is far better suited for low-latency random read / writes.

Using local file system as Flume source

I've just started learning Big Data, and at this time, I'm working on Flume. The common example I've encountered is for processing of tweets (the example from Cloudera) using some Java.
Just for testing and simulation purposes, can I use my local file system as a Flume source? particularly, some Excel or CSV files? Do I also need to use some Java code, aside from Flume configuration file, just like in Twitter extraction?
Will this source be an event-driven or pollable?
Thanks for your input.
I assume you are using a cloudera sandbox and are talking about putting a file on the sandbox local to the flume agent you are planning on kicking off. A flume agent contains a:
Source
Channel
Sink
These should sit local to the flume agent. The list of available flume sources is on the user guide: https://flume.apache.org/FlumeUserGuide.html. You could use an Exec source if you just want to stream data from a file with a tail or cat commands.
You could also use a Spooling Directory Source will watch the specified directory for new files, and will parse events out of new files as they appear.
Have a good read of the user guide. Contains everything you need.

About transfer file in hdfs

I need to transfer files from one hdfs folder to another hdfs folder in java code.
May I ask is there api that we can call to transfer files among hdfs paths?
Also I'd like to ask is there anyway to invoke a mapreduce job from java code? Of course, this java not running in hdfs.
Thank you very much and have a great weekend!
May I ask is there api that we can call to transfer files among hdfs paths?
Use the o.a.h.hdfs.DistributedFileSystem#rename method to move file from one folder in HDFS to another folder. The function has been overloaded and one of the method takes Options.Rename as a parameter.
FYI .... I haven't checked the code, but I think that rename involves changes to the name space and not any actual block movements.
Also I'd like to ask is there anyway to invoke a mapreduce job from java code? Of course, this java not running in hdfs.
Hadoop is written in Java, so there should be a way :) Use the o.a.h.mapreduce.Job#submit and o.a.h.mapreduce.Job#waitForCompletion methods.

Categories