How to process a range of hbase rows using spark? - java

I am trying to use HBase as a data source for spark. So the first step turns out to be creating a RDD from a HBase table. Since Spark works with hadoop input formats, i could find a way to use all rows by creating an rdd http://www.vidyasource.com/blog/Programming/Scala/Java/Data/Hadoop/Analytics/2014/01/25/lighting-a-spark-with-hbase But how do we create a RDD for a range scan ?
All suggestions are welcome.

Here is an example of using Scan in Spark:
import java.io.{DataOutputStream, ByteArrayOutputStream}
import java.lang.String
import org.apache.hadoop.hbase.client.Scan
import org.apache.hadoop.hbase.HBaseConfiguration
import org.apache.hadoop.hbase.io.ImmutableBytesWritable
import org.apache.hadoop.hbase.client.Result
import org.apache.hadoop.hbase.mapreduce.TableInputFormat
import org.apache.hadoop.hbase.util.Base64
def convertScanToString(scan: Scan): String = {
val out: ByteArrayOutputStream = new ByteArrayOutputStream
val dos: DataOutputStream = new DataOutputStream(out)
scan.write(dos)
Base64.encodeBytes(out.toByteArray)
}
val conf = HBaseConfiguration.create()
val scan = new Scan()
scan.setCaching(500)
scan.setCacheBlocks(false)
conf.set(TableInputFormat.INPUT_TABLE, "table_name")
conf.set(TableInputFormat.SCAN, convertScanToString(scan))
val rdd = sc.newAPIHadoopRDD(conf, classOf[TableInputFormat], classOf[ImmutableBytesWritable], classOf[Result])
rdd.count
You need to add related libraries to the Spark classpath and make sure they are compatible with your Spark. Tips: you can use hbase classpath to find them.

You can set below conf
val conf = HBaseConfiguration.create()//need to set all param for habse
conf.set(TableInputFormat.SCAN_ROW_START, "row2");
conf.set(TableInputFormat.SCAN_ROW_STOP, "stoprowkey");
this will load rdd only for those reocrds

Here is a Java example with TableMapReduceUtil.convertScanToString(Scan scan):
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.hbase.HBaseConfiguration;
import org.apache.hadoop.hbase.HConstants;
import org.apache.hadoop.hbase.client.Result;
import org.apache.hadoop.hbase.client.Scan;
import org.apache.hadoop.hbase.io.ImmutableBytesWritable;
import org.apache.hadoop.hbase.mapreduce.TableInputFormat;
import org.apache.hadoop.hbase.mapreduce.TableMapReduceUtil;
import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaPairRDD;
import org.apache.spark.api.java.JavaSparkContext;
import java.io.IOException;
public class HbaseScan {
public static void main(String ... args) throws IOException, InterruptedException {
// Spark conf
SparkConf sparkConf = new SparkConf().setMaster("local[4]").setAppName("My App");
JavaSparkContext jsc = new JavaSparkContext(sparkConf);
// Hbase conf
Configuration conf = HBaseConfiguration.create();
conf.set(TableInputFormat.INPUT_TABLE, "big_table_name");
// Create scan
Scan scan = new Scan();
scan.setCaching(500);
scan.setCacheBlocks(false);
scan.setStartRow(Bytes.toBytes("a"));
scan.setStopRow(Bytes.toBytes("d"));
// Submit scan into hbase conf
conf.set(TableInputFormat.SCAN, TableMapReduceUtil.convertScanToString(scan));
// Get RDD
JavaPairRDD<ImmutableBytesWritable, Result> source = jsc
.newAPIHadoopRDD(conf, TableInputFormat.class,
ImmutableBytesWritable.class, Result.class);
// Process RDD
System.out.println(source.count());
}
}

Related

Copy JSON file from Local to HDFS

import java.io.BufferedInputStream;
import java.io.FileInputStream;
import java.io.InputStream;
import java.io.OutputStream;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.conf.Configured;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IOUtils;
import org.apache.hadoop.util.Tool;
import org.apache.hadoop.util.ToolRunner;
public class HdfsWriter extends Configured implements Tool {
public int run(String[] args) throws Exception {
//String localInputPath = args[0];
Path outputPath = new Path(args[0]); // ARGUMENT FOR OUTPUT_LOCATION
Configuration conf = getConf();
FileSystem fs = FileSystem.get(conf);
OutputStream os = fs.create(outputPath);
InputStream is = new BufferedInputStream(new FileInputStream("/home/acadgild/acadgild.txt")); //Data set is getting copied into input stream through buffer mechanism.
IOUtils.copyBytes(is, os, conf); // Copying the dataset from input stream to output stream
return 0;
}
public static void main(String[] args) throws Exception {
int returnCode = ToolRunner.run(new HdfsWriter(), args);
System.exit(returnCode);
}
}
Need to Move the data from Local to HDFS.
The above code I got from another blog , it's not working. can anyone help me on this.
Also i need to parse the Json using MR and group by DateTime and move to HDFS
Map Reduce is a distributed job processing framework
for each mapper local means the local filesytem on the node on which that mapper is running.
What you want is reading from local on a given node to be put on to HDFS and then processing it via MapReduce.
There are multiple tools available for copying from Local of one node to HDFS
hdfs put localPath HdfsPath (Shell script)
flume

Java - A master URL must be set in your configuration

I am trying to run some algorithm in apache Spark. I am getting
Java - A master URL must be set in your configuration error even if I set the configuration.
SparkSession spark = SparkSession.builder().appName("Sp_LogistcRegression").config("spark.master", "local").getOrCreate();
This is the code I work with
import java.io.FileOutputStream;
import java.io.IOException;
import java.io.ObjectOutputStream;
import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaSparkContext;
import org.apache.spark.ml.classification.LogisticRegression;
import org.apache.spark.ml.classification.LogisticRegressionModel;
import org.apache.spark.sql.Dataset;
import org.apache.spark.sql.Row;
import org.apache.spark.sql.SparkSession;
import org.apache.spark.mllib.util.MLUtils;
public class Sp_LogistcRegression {
public void trainLogisticregression(String path, String model_path) throws IOException {
//SparkConf conf = new SparkConf().setAppName("Linear Regression Example");
// JavaSparkContext sc = new JavaSparkContext(conf);
SparkSession spark = SparkSession.builder().appName("Sp_LogistcRegression").config("spark.master", "local").getOrCreate();
Dataset<Row> training = spark.read().option("header","true").csv(path);
System.out.print(training.count());
LogisticRegression lr = new LogisticRegression().setMaxIter(10).setRegParam(0.3);
// Fit the model
LogisticRegressionModel lrModel = lr.fit(training);
lrModel.save(model_path);
spark.close();
}
}
This is my test case:
import java.io.File;
import org.junit.Test;
public class Sp_LogistcRegressionTest {
Sp_LogistcRegression spl =new Sp_LogistcRegression ();
#Test
public void test() {
String filename = "datas/seg-large.csv";
ClassLoader classLoader = getClass().getClassLoader();
File file1 = new File(classLoader.getResource(filename).getFile());
spl. trainLogisticregression( file1.getAbsolutePath(), "/tmp");
}
}
Why I am getting this error? I checked the solutions here
Spark - Error "A master URL must be set in your configuration" when submitting an app
It does n´t work.
Any clues ?
your
SparkSession spark = SparkSession.builder().appName("Sp_LogistcRegression").config("spark.master", "local").getOrCreate();
should be
SparkSession spark = SparkSession.builder().appName("Sp_LogistcRegression").master("local").getOrCreate();
Or
when you run, you need to
spark-submit --class mainClass --master local yourJarFile

Exception in thread "main" java.io.IOException: No input paths specified in job

I am trying to read a json file using spark in Java. The few changes I tried were :
SparkConf conf = new SparkConf().setAppName("Search").setMaster("local[*]");
DataFrame df = sqlContext.read().json("../Users/pshah/Desktop/sample.json/*");
Code:
import java.util.Arrays;
import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.JavaSparkContext;
import org.apache.spark.api.java.function.FlatMapFunction;
import org.apache.spark.api.java.function.Function;
import org.apache.spark.api.java.function.PairFunction;
import org.apache.spark.sql.DataFrame;
import org.apache.spark.sql.SQLContext;
public class ParseData {
public static void main(String args[]){
SparkConf conf = new SparkConf().setAppName("Search").setMaster("local");
JavaSparkContext sc= new JavaSparkContext(conf);
SQLContext sqlContext = new org.apache.spark.sql.SQLContext(sc);
// Create the DataFrame
DataFrame df = sqlContext.read().json("/Users/pshah/Desktop/sample.json");
// Show the content of the DataFrame
df.show();
}}
Error:
Exception in thread "main" java.io.IOException: No input paths specified in job
at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:198)
i wrote the same code, and meet the same problem. i put the people.json file under project directory src/main/resources. the reason is program could not finding the file. after i copy the people.json file to the program's working directory, the program works well

How to Save a dataframe in mongoDB using Apache spark and java library

I have a csv file ..I load it to program using Sql Context and it upload it a dataframe.Now i want to store this csv file into mongodbCollection.And i am not able to convert it in to JavaPairedRDD.Please Help...
My Code is...
import org.apache.hadoop.conf.Configuration;
import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaPairRDD;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.JavaSparkContext;
import org.apache.spark.sql.DataFrame;
import org.apache.spark.sql.Row;
import org.apache.spark.sql.SQLContext;
import org.apache.spark.sql.types.DataTypes;
import org.apache.spark.sql.types.Metadata;
import org.apache.spark.sql.types.StructField;
import org.apache.spark.sql.types.StructType;
import org.bson.BSONObject;
import org.apache.spark.api.java.JavaPairRDD;
import com.mongodb.hadoop.MongoOutputFormat;
public class CSVReader {
public static void main(String args[]){
SparkConf conf = new SparkConf().setAppName("sparkConnection").setMaster("local");
JavaSparkContext sc = new JavaSparkContext(conf);
SQLContext sqlContext = new SQLContext(sc);
/* To load a csv file frol given location*/
DataFrame df = sqlContext.read()
.format("com.databricks.spark.csv")
.option("inferSchema", "true")//Automaticaaly infers the data
.option("header", "true")//To include the headers in dataframe
.load("D:/SparkFiles/abc.csv");
}
}
You clearly haven't researched enough.
Because if you would have, you would know dataframe is a nothing but a combination of schema+ rdd.
Assuming your posted code works fine, you can read the rdd from df as df.rdd

Apache Spark - JavaSparkContext cannot be converted to SparkContext error

I'm having considerable difficulty translating the Spark examples to runnable code (as evidenced by my previous question here).
The answers provided there helped me with that particular example, but now I am trying to experiment with the Multilayer Perceptron example and straight away I am encountering errors.
import org.apache.spark.SparkConf;
import org.apache.spark.api.java.*;
import org.apache.spark.api.java.function.Function;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.JavaSparkContext;
import org.apache.spark.ml.classification.MultilayerPerceptronClassificationModel;
import org.apache.spark.ml.classification.MultilayerPerceptronClassifier;
import org.apache.spark.ml.evaluation.MulticlassClassificationEvaluator;
import org.apache.spark.ml.param.ParamMap;
import org.apache.spark.mllib.regression.LabeledPoint;
import org.apache.spark.mllib.util.MLUtils;
import org.apache.spark.mllib.linalg.Vectors;
import org.apache.spark.sql.DataFrame;
import org.apache.spark.sql.Row;
import org.apache.spark.sql.SQLContext;
// Load training data
public class SimpleANN {
public static void main(String[] args) {
String path = "file:/usr/local/share/spark-1.5.0/data/mllib/sample_multiclass_classification_data.txt";
SparkConf conf = new SparkConf().setAppName("Simple ANN");
JavaSparkContext sc = new JavaSparkContext(conf);
JavaRDD<LabeledPoint> data = MLUtils.loadLibSVMFile(sc, path).toJavaRDD();
...
...
}
}
I get the following error
[ERROR] Failed to execute goal org.apache.maven.plugins:maven-compiler-plugin:3.1:compile (default-compile) on project simple-ann: Compilation failure
[ERROR] /Users/robo/study/spark/ann/src/main/java/SimpleANN.java:[23,61] incompatible types: org.apache.spark.api.java.JavaSparkContext cannot be converted to org.apache.spark.SparkContext
If you need a SparkContext from your JavaSparkContext you can use the static method :
JavaSparkContext.toSparkContext(youJavaSparkContextBean)
So you have to modify your code from
JavaSparkContext sc = new JavaSparkContext(conf);
JavaRDD<LabeledPoint> data = MLUtils.loadLibSVMFile(sc, path).toJavaRDD();
To
JavaSparkContext sc = new JavaSparkContext(conf);
JavaRDD<LabeledPoint> data = MLUtils.loadLibSVMFile(
JavaSparkContext.toSparkContext(sc),
path).toJavaRDD();

Categories