I am trying to run some algorithm in apache Spark. I am getting
Java - A master URL must be set in your configuration error even if I set the configuration.
SparkSession spark = SparkSession.builder().appName("Sp_LogistcRegression").config("spark.master", "local").getOrCreate();
This is the code I work with
import java.io.FileOutputStream;
import java.io.IOException;
import java.io.ObjectOutputStream;
import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaSparkContext;
import org.apache.spark.ml.classification.LogisticRegression;
import org.apache.spark.ml.classification.LogisticRegressionModel;
import org.apache.spark.sql.Dataset;
import org.apache.spark.sql.Row;
import org.apache.spark.sql.SparkSession;
import org.apache.spark.mllib.util.MLUtils;
public class Sp_LogistcRegression {
public void trainLogisticregression(String path, String model_path) throws IOException {
//SparkConf conf = new SparkConf().setAppName("Linear Regression Example");
// JavaSparkContext sc = new JavaSparkContext(conf);
SparkSession spark = SparkSession.builder().appName("Sp_LogistcRegression").config("spark.master", "local").getOrCreate();
Dataset<Row> training = spark.read().option("header","true").csv(path);
System.out.print(training.count());
LogisticRegression lr = new LogisticRegression().setMaxIter(10).setRegParam(0.3);
// Fit the model
LogisticRegressionModel lrModel = lr.fit(training);
lrModel.save(model_path);
spark.close();
}
}
This is my test case:
import java.io.File;
import org.junit.Test;
public class Sp_LogistcRegressionTest {
Sp_LogistcRegression spl =new Sp_LogistcRegression ();
#Test
public void test() {
String filename = "datas/seg-large.csv";
ClassLoader classLoader = getClass().getClassLoader();
File file1 = new File(classLoader.getResource(filename).getFile());
spl. trainLogisticregression( file1.getAbsolutePath(), "/tmp");
}
}
Why I am getting this error? I checked the solutions here
Spark - Error "A master URL must be set in your configuration" when submitting an app
It does n´t work.
Any clues ?
your
SparkSession spark = SparkSession.builder().appName("Sp_LogistcRegression").config("spark.master", "local").getOrCreate();
should be
SparkSession spark = SparkSession.builder().appName("Sp_LogistcRegression").master("local").getOrCreate();
Or
when you run, you need to
spark-submit --class mainClass --master local yourJarFile
Related
I'm following this example about create YarnApp by java API.
https://github.com/hortonworks/simple-yarn-app
Works fine, but, the log exists only execution, after it the log gone.
How I can caught this by code ? or maybe enable one option?
You can find logs using LogCliHelpers by application id after application had finished:
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.security.UserGroupInformation;
import org.apache.hadoop.yarn.api.records.ApplicationId;
import org.apache.hadoop.yarn.api.records.ApplicationSubmissionContext;
import org.apache.hadoop.yarn.client.api.YarnClientApplication;
import org.apache.hadoop.yarn.conf.YarnConfiguration;
import org.apache.hadoop.yarn.exceptions.YarnException;
import org.apache.hadoop.yarn.logaggregation.LogCLIHelpers;
import java.io.IOException;
import java.io.PrintStream;
public static void getLogs(YarnConfiguration conf, YarnClientApplication app) throws IOException, YarnException {
ApplicationSubmissionContext appContext =
app.getApplicationSubmissionContext();
ApplicationId appId = appContext.getApplicationId();
LogCLIHelpers logCLIHelpers = new LogCLIHelpers();
logCLIHelpers.setConf(conf);
FileSystem fs = FileSystem.get(conf);
Path logFile = new Path("/path/to/log/file.log");
fs.create(logFile, false);
try (PrintStream printStream = new PrintStream(logFile.toString())) {
logCLIHelpers.dumpAllContainersLogs(appId, UserGroupInformation.getCurrentUser().getShortUserName(), printStream);
}
}
Im new to spark related work.I had tried codings as in below.
package hdd.models;
import java.util.ArrayList;
import java.util.List;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.JavaSparkContext;
import org.apache.spark.api.java.function.Function;
import org.apache.spark.sql.Dataset;
import org.apache.spark.sql.Row;
import org.apache.spark.sql.RowFactory;
import org.apache.spark.sql.SQLContext;
import org.apache.spark.sql.types.StructField;
import org.apache.spark.sql.types.StructType;
import org.apache.spark.sql.types.DataType;
import org.apache.spark.sql.types.DataTypes;
import org.apache.spark.sql.SparkSession;
/*
* Analysis of the data using Spark SQL
*
*/
public class HrtDisDataAnalyze {
public HrtDisDataAnalyze() {
}
public static void main(String[] args) {
SparkConfAndCtxBuilder ctxBuilder = new SparkConfAndCtxBuilder();
JavaSparkContext jctx = ctxBuilder.loadSimpleSparkContext("Heart Disease Data Analysis App", "local");
JavaRDD<String> rows = jctx.textFile("file:///C:/Users/harpr/workspace/HrtDisDetection/src/resources/full_data_cleaned.csv");
String schemaString = "age sex cp trestbps chol fbs restecg thalach exang oldpeak slope ca thal num";
List<StructField> fields = new ArrayList<>();
for (String fieldName : schemaString.split(" ")) {
fields.add(DataTypes.createStructField(fieldName, DataTypes.StringType, true));
}
StructType schema = DataTypes.createStructType(fields);
JavaRDD<Row> rowRdd = rows.map(new Function<String, Row>() {
#Override
public Row call(String record) throws Exception {
String[] fields = record.split(",");
return RowFactory.create(fields[0],fields[1],fields[2],fields[3],fields[4],fields[5],fields[6],fields[7],fields[8],fields[9],fields[10],fields[11],fields[12],fields[13]);
}
});
SparkSession sparkSession = SparkSession.builder().config("spark.serializer", "org.apache.spark.serializer.KryoSerializer").config("spark.kryo.registrator", "org.datasyslab.geospark.serde.GeoSparkKryoRegistrator").master("local[*]").appName("testGeoSpark").getOrCreate();
Dataset df = spark.read().csv("usr/local/eclipse1/eclipse/hrtdisdetection/src/resources/cleveland_data_raw.csv");
df.createOrReplaceTempView("heartDisData");
following error occurs in sparksession
"he type org.apache.spark.sql.SparkSession$Builder cannot be resolved. It is indirectly referenced from required .class files"
Note: Im using spark-2.1.0 with scala 2.10.This above code i tried in java eclipse-neon
There is no sense to use builder.
Just create Spark Session at the beginning and call spark context from session.
SparkSession sparkSession = SparkSession.builder().config("spark.serializer", "org.apache.spark.serializer.KryoSerializer").config("spark.kryo.registrator", "org.datasyslab.geospark.serde.GeoSparkKryoRegistrator").master("local[*]").appName("testGeoSpark").getOrCreate();
sparkSession.sparkContext().textFile(yourFileOrURL);
I have added jar file for spark session.
error cleared.
https://jar-download.com/?search_box=org.apache.spark%20spark.sql
I am trying to read a json file using spark in Java. The few changes I tried were :
SparkConf conf = new SparkConf().setAppName("Search").setMaster("local[*]");
DataFrame df = sqlContext.read().json("../Users/pshah/Desktop/sample.json/*");
Code:
import java.util.Arrays;
import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.JavaSparkContext;
import org.apache.spark.api.java.function.FlatMapFunction;
import org.apache.spark.api.java.function.Function;
import org.apache.spark.api.java.function.PairFunction;
import org.apache.spark.sql.DataFrame;
import org.apache.spark.sql.SQLContext;
public class ParseData {
public static void main(String args[]){
SparkConf conf = new SparkConf().setAppName("Search").setMaster("local");
JavaSparkContext sc= new JavaSparkContext(conf);
SQLContext sqlContext = new org.apache.spark.sql.SQLContext(sc);
// Create the DataFrame
DataFrame df = sqlContext.read().json("/Users/pshah/Desktop/sample.json");
// Show the content of the DataFrame
df.show();
}}
Error:
Exception in thread "main" java.io.IOException: No input paths specified in job
at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:198)
i wrote the same code, and meet the same problem. i put the people.json file under project directory src/main/resources. the reason is program could not finding the file. after i copy the people.json file to the program's working directory, the program works well
I'm having considerable difficulty translating the Spark examples to runnable code (as evidenced by my previous question here).
The answers provided there helped me with that particular example, but now I am trying to experiment with the Multilayer Perceptron example and straight away I am encountering errors.
import org.apache.spark.SparkConf;
import org.apache.spark.api.java.*;
import org.apache.spark.api.java.function.Function;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.JavaSparkContext;
import org.apache.spark.ml.classification.MultilayerPerceptronClassificationModel;
import org.apache.spark.ml.classification.MultilayerPerceptronClassifier;
import org.apache.spark.ml.evaluation.MulticlassClassificationEvaluator;
import org.apache.spark.ml.param.ParamMap;
import org.apache.spark.mllib.regression.LabeledPoint;
import org.apache.spark.mllib.util.MLUtils;
import org.apache.spark.mllib.linalg.Vectors;
import org.apache.spark.sql.DataFrame;
import org.apache.spark.sql.Row;
import org.apache.spark.sql.SQLContext;
// Load training data
public class SimpleANN {
public static void main(String[] args) {
String path = "file:/usr/local/share/spark-1.5.0/data/mllib/sample_multiclass_classification_data.txt";
SparkConf conf = new SparkConf().setAppName("Simple ANN");
JavaSparkContext sc = new JavaSparkContext(conf);
JavaRDD<LabeledPoint> data = MLUtils.loadLibSVMFile(sc, path).toJavaRDD();
...
...
}
}
I get the following error
[ERROR] Failed to execute goal org.apache.maven.plugins:maven-compiler-plugin:3.1:compile (default-compile) on project simple-ann: Compilation failure
[ERROR] /Users/robo/study/spark/ann/src/main/java/SimpleANN.java:[23,61] incompatible types: org.apache.spark.api.java.JavaSparkContext cannot be converted to org.apache.spark.SparkContext
If you need a SparkContext from your JavaSparkContext you can use the static method :
JavaSparkContext.toSparkContext(youJavaSparkContextBean)
So you have to modify your code from
JavaSparkContext sc = new JavaSparkContext(conf);
JavaRDD<LabeledPoint> data = MLUtils.loadLibSVMFile(sc, path).toJavaRDD();
To
JavaSparkContext sc = new JavaSparkContext(conf);
JavaRDD<LabeledPoint> data = MLUtils.loadLibSVMFile(
JavaSparkContext.toSparkContext(sc),
path).toJavaRDD();
I am trying to use HBase as a data source for spark. So the first step turns out to be creating a RDD from a HBase table. Since Spark works with hadoop input formats, i could find a way to use all rows by creating an rdd http://www.vidyasource.com/blog/Programming/Scala/Java/Data/Hadoop/Analytics/2014/01/25/lighting-a-spark-with-hbase But how do we create a RDD for a range scan ?
All suggestions are welcome.
Here is an example of using Scan in Spark:
import java.io.{DataOutputStream, ByteArrayOutputStream}
import java.lang.String
import org.apache.hadoop.hbase.client.Scan
import org.apache.hadoop.hbase.HBaseConfiguration
import org.apache.hadoop.hbase.io.ImmutableBytesWritable
import org.apache.hadoop.hbase.client.Result
import org.apache.hadoop.hbase.mapreduce.TableInputFormat
import org.apache.hadoop.hbase.util.Base64
def convertScanToString(scan: Scan): String = {
val out: ByteArrayOutputStream = new ByteArrayOutputStream
val dos: DataOutputStream = new DataOutputStream(out)
scan.write(dos)
Base64.encodeBytes(out.toByteArray)
}
val conf = HBaseConfiguration.create()
val scan = new Scan()
scan.setCaching(500)
scan.setCacheBlocks(false)
conf.set(TableInputFormat.INPUT_TABLE, "table_name")
conf.set(TableInputFormat.SCAN, convertScanToString(scan))
val rdd = sc.newAPIHadoopRDD(conf, classOf[TableInputFormat], classOf[ImmutableBytesWritable], classOf[Result])
rdd.count
You need to add related libraries to the Spark classpath and make sure they are compatible with your Spark. Tips: you can use hbase classpath to find them.
You can set below conf
val conf = HBaseConfiguration.create()//need to set all param for habse
conf.set(TableInputFormat.SCAN_ROW_START, "row2");
conf.set(TableInputFormat.SCAN_ROW_STOP, "stoprowkey");
this will load rdd only for those reocrds
Here is a Java example with TableMapReduceUtil.convertScanToString(Scan scan):
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.hbase.HBaseConfiguration;
import org.apache.hadoop.hbase.HConstants;
import org.apache.hadoop.hbase.client.Result;
import org.apache.hadoop.hbase.client.Scan;
import org.apache.hadoop.hbase.io.ImmutableBytesWritable;
import org.apache.hadoop.hbase.mapreduce.TableInputFormat;
import org.apache.hadoop.hbase.mapreduce.TableMapReduceUtil;
import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaPairRDD;
import org.apache.spark.api.java.JavaSparkContext;
import java.io.IOException;
public class HbaseScan {
public static void main(String ... args) throws IOException, InterruptedException {
// Spark conf
SparkConf sparkConf = new SparkConf().setMaster("local[4]").setAppName("My App");
JavaSparkContext jsc = new JavaSparkContext(sparkConf);
// Hbase conf
Configuration conf = HBaseConfiguration.create();
conf.set(TableInputFormat.INPUT_TABLE, "big_table_name");
// Create scan
Scan scan = new Scan();
scan.setCaching(500);
scan.setCacheBlocks(false);
scan.setStartRow(Bytes.toBytes("a"));
scan.setStopRow(Bytes.toBytes("d"));
// Submit scan into hbase conf
conf.set(TableInputFormat.SCAN, TableMapReduceUtil.convertScanToString(scan));
// Get RDD
JavaPairRDD<ImmutableBytesWritable, Result> source = jsc
.newAPIHadoopRDD(conf, TableInputFormat.class,
ImmutableBytesWritable.class, Result.class);
// Process RDD
System.out.println(source.count());
}
}