Im new to spark related work.I had tried codings as in below.
package hdd.models;
import java.util.ArrayList;
import java.util.List;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.JavaSparkContext;
import org.apache.spark.api.java.function.Function;
import org.apache.spark.sql.Dataset;
import org.apache.spark.sql.Row;
import org.apache.spark.sql.RowFactory;
import org.apache.spark.sql.SQLContext;
import org.apache.spark.sql.types.StructField;
import org.apache.spark.sql.types.StructType;
import org.apache.spark.sql.types.DataType;
import org.apache.spark.sql.types.DataTypes;
import org.apache.spark.sql.SparkSession;
/*
* Analysis of the data using Spark SQL
*
*/
public class HrtDisDataAnalyze {
public HrtDisDataAnalyze() {
}
public static void main(String[] args) {
SparkConfAndCtxBuilder ctxBuilder = new SparkConfAndCtxBuilder();
JavaSparkContext jctx = ctxBuilder.loadSimpleSparkContext("Heart Disease Data Analysis App", "local");
JavaRDD<String> rows = jctx.textFile("file:///C:/Users/harpr/workspace/HrtDisDetection/src/resources/full_data_cleaned.csv");
String schemaString = "age sex cp trestbps chol fbs restecg thalach exang oldpeak slope ca thal num";
List<StructField> fields = new ArrayList<>();
for (String fieldName : schemaString.split(" ")) {
fields.add(DataTypes.createStructField(fieldName, DataTypes.StringType, true));
}
StructType schema = DataTypes.createStructType(fields);
JavaRDD<Row> rowRdd = rows.map(new Function<String, Row>() {
#Override
public Row call(String record) throws Exception {
String[] fields = record.split(",");
return RowFactory.create(fields[0],fields[1],fields[2],fields[3],fields[4],fields[5],fields[6],fields[7],fields[8],fields[9],fields[10],fields[11],fields[12],fields[13]);
}
});
SparkSession sparkSession = SparkSession.builder().config("spark.serializer", "org.apache.spark.serializer.KryoSerializer").config("spark.kryo.registrator", "org.datasyslab.geospark.serde.GeoSparkKryoRegistrator").master("local[*]").appName("testGeoSpark").getOrCreate();
Dataset df = spark.read().csv("usr/local/eclipse1/eclipse/hrtdisdetection/src/resources/cleveland_data_raw.csv");
df.createOrReplaceTempView("heartDisData");
following error occurs in sparksession
"he type org.apache.spark.sql.SparkSession$Builder cannot be resolved. It is indirectly referenced from required .class files"
Note: Im using spark-2.1.0 with scala 2.10.This above code i tried in java eclipse-neon
There is no sense to use builder.
Just create Spark Session at the beginning and call spark context from session.
SparkSession sparkSession = SparkSession.builder().config("spark.serializer", "org.apache.spark.serializer.KryoSerializer").config("spark.kryo.registrator", "org.datasyslab.geospark.serde.GeoSparkKryoRegistrator").master("local[*]").appName("testGeoSpark").getOrCreate();
sparkSession.sparkContext().textFile(yourFileOrURL);
I have added jar file for spark session.
error cleared.
https://jar-download.com/?search_box=org.apache.spark%20spark.sql
Related
package com.evampsaanga.imran.testing;
import org.apache.spark.SparkConf;
import org.apache.spark.sql.Dataset;
import org.apache.spark.sql.Row;
import org.apache.spark.sql.SparkSession;
public class ReadingCsv {
public static void main(String[] args) {
SparkConf conf = new SparkConf().setAppName("Text File Data Load").setMaster("local").set("spark.driver.host","localhost").set("spark.testing.memory", "2147480000");
SparkSession spark = SparkSession.builder().config(conf).getOrCreate();
Dataset<Row> df = spark.read()
.format("org.apache.spark.sql.execution.datasources.csv.CSVFileFormat")
.option("sep", ",")
.option("inferSchema", true)
.option("header", true)
.load("E:/CarsData.csv");
df.write().parquet("test.parquet");
I have used df.write.parquet but I am getting this error
Exception in thread "main" org.apache.spark.sql.AnalysisException: Multiple sources found for parquet (org.apache.spark.sql.execution.datasources.v2.parquet.ParquetDataSourceV2, org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat), please specify the fully qualified class name.;
at org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSource(DataSource.scala:702)
at org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSourceV2(DataSource.scala:728)
at org.apache.spark.sql.DataFrameWriter.lookupV2Provider(DataFrameWriter.scala:948)
at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:285)
at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:269)
at org.apache.spark.sql.DataFrameWriter.parquet(DataFrameWriter.scala:829)
at com.evampsaanga.imran.testing.ReadingCsv.main(ReadingCsv.java:23)
I followed the instructions at Structured Streaming + Kafka and built a program that receives data streams sent from kafka as input, when I receive the data stream I want to pass it to SparkSession variable to do some query work with Spark SQL, so I extend the ForeachWriter class again as follows:
package stream;
import java.io.FileNotFoundException;
import java.io.PrintWriter;
import org.apache.spark.sql.ForeachWriter;
import org.apache.spark.sql.SparkSession;
import org.json.simple.JSONObject;
import org.json.simple.parser.JSONParser;
import org.json.simple.parser.ParseException;
import dataservices.OrderDataServices;
import models.SuccessEvent;
public class MapEventWriter extends ForeachWriter<String>{
private SparkSession spark;
public MapEventWriter(SparkSession spark) {
this.spark = spark;
}
private static final long serialVersionUID = 1L;
#Override
public void close(Throwable errorOrNull) {
// TODO Auto-generated method stub
}
#Override
public boolean open(long partitionId, long epochId) {
// TODO Auto-generated method stub
return true;
}
#Override
public void process(String input) {
OrderDataServices services = new OrderDataServices(this.spark);
}
}
however in the process function, if I use spark variable, the program gives an error, the program passes in my spark as follows:
package demo;
import java.io.BufferedReader;
import java.io.BufferedWriter;
import java.io.FileNotFoundException;
import java.io.FileReader;
import java.io.FileWriter;
import java.io.IOException;
import java.nio.file.Files;
import java.nio.file.Paths;
import java.util.concurrent.TimeoutException;
import org.apache.hadoop.fs.Path;
import org.apache.spark.sql.Dataset;
import org.apache.spark.sql.Encoders;
import org.apache.spark.sql.SparkSession;
import org.apache.spark.sql.streaming.StreamingQuery;
import org.apache.spark.sql.streaming.StreamingQueryException;
import org.json.simple.parser.ParseException;
import dataservices.OrderDataServices;
import models.MapperEvent;
import models.OrderEvent;
import models.SuccessEvent;
import stream.MapEventWriter;
import stream.MapEventWriter1;
public class Demo {
public static void main(String[] args) throws TimeoutException, StreamingQueryException, ParseException, IOException {
try (SparkSession spark = SparkSession.builder().appName("Read kafka").getOrCreate()) {
Dataset<String> data = spark
.readStream()
.format("kafka")
.option("kafka.bootstrap.servers", "localhost:9092")
.option("subscribe", "tiki-1")
.load()
.selectExpr("CAST(value AS STRING)")
.as(Encoders.STRING());
MapEventWriter eventWriter = new MapEventWriter(spark);
StreamingQuery query = data
.writeStream()
.foreach(eventWriter)
.start();
query.awaitTermination();
}
}
}
The error is NullPointerException at the spark call location, that is, no spark variable is initialized.
Hope anyone can help me, I really appreciate it.
Caused by: java.lang.NullPointerException
at org.apache.spark.sql.SparkSession.sessionState$lzycompute(SparkSession.scala:151)
at org.apache.spark.sql.SparkSession.sessionState(SparkSession.scala:149)
at org.apache.spark.sql.DataFrameReader.<init>(DataFrameReader.scala:998)
at org.apache.spark.sql.SparkSession.read(SparkSession.scala:655)
at dataservices.OrderDataServices.<init>(OrderDataServices.java:18)
at stream.MapEventWriter.process(MapEventWriter.java:38)
at stream.MapEventWriter.process(MapEventWriter.java:15)
do some query work with Spark SQL
You wouldn't use a ForEachWriter for that
.selectExpr("CAST(value AS STRING)")
.as(Encoders.STRING()); // or parse your JSON here using a schema
data.select(...) // or move this to a method / class that takes the Dataset as a parameter
// await termination
I am trying to run some algorithm in apache Spark. I am getting
Java - A master URL must be set in your configuration error even if I set the configuration.
SparkSession spark = SparkSession.builder().appName("Sp_LogistcRegression").config("spark.master", "local").getOrCreate();
This is the code I work with
import java.io.FileOutputStream;
import java.io.IOException;
import java.io.ObjectOutputStream;
import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaSparkContext;
import org.apache.spark.ml.classification.LogisticRegression;
import org.apache.spark.ml.classification.LogisticRegressionModel;
import org.apache.spark.sql.Dataset;
import org.apache.spark.sql.Row;
import org.apache.spark.sql.SparkSession;
import org.apache.spark.mllib.util.MLUtils;
public class Sp_LogistcRegression {
public void trainLogisticregression(String path, String model_path) throws IOException {
//SparkConf conf = new SparkConf().setAppName("Linear Regression Example");
// JavaSparkContext sc = new JavaSparkContext(conf);
SparkSession spark = SparkSession.builder().appName("Sp_LogistcRegression").config("spark.master", "local").getOrCreate();
Dataset<Row> training = spark.read().option("header","true").csv(path);
System.out.print(training.count());
LogisticRegression lr = new LogisticRegression().setMaxIter(10).setRegParam(0.3);
// Fit the model
LogisticRegressionModel lrModel = lr.fit(training);
lrModel.save(model_path);
spark.close();
}
}
This is my test case:
import java.io.File;
import org.junit.Test;
public class Sp_LogistcRegressionTest {
Sp_LogistcRegression spl =new Sp_LogistcRegression ();
#Test
public void test() {
String filename = "datas/seg-large.csv";
ClassLoader classLoader = getClass().getClassLoader();
File file1 = new File(classLoader.getResource(filename).getFile());
spl. trainLogisticregression( file1.getAbsolutePath(), "/tmp");
}
}
Why I am getting this error? I checked the solutions here
Spark - Error "A master URL must be set in your configuration" when submitting an app
It does n´t work.
Any clues ?
your
SparkSession spark = SparkSession.builder().appName("Sp_LogistcRegression").config("spark.master", "local").getOrCreate();
should be
SparkSession spark = SparkSession.builder().appName("Sp_LogistcRegression").master("local").getOrCreate();
Or
when you run, you need to
spark-submit --class mainClass --master local yourJarFile
I have two Hbase tables 'hbaseTable', 'hbaseTable1' and Hive table 'hiveTable'
my query looks like:
'insert overwrite hiveTable select col1, h2.col2, col3 from hbaseTable h1,hbaseTable2 h2 where h1.col=h2.col2';
I need to do a inner join in hbase and bring the data to hive. We are using hive with java which gives a very poor performance.
So planning to change the approach by using spark. i.e, spark with java
How do I connect to hbase from my JAVA code using SPARK.
now my spark code should do a join in hbase and bring in data to hive by the above query.
Please provide sample code.
If you are using spark to load hbase data then why load it in hive?
You can use spark sql which is similar to hive and thus sql.
You can query the data without using hive at all.
For example :
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.hbase.HBaseConfiguration;
import org.apache.hadoop.hbase.client.Result;
import org.apache.hadoop.hbase.io.ImmutableBytesWritable;
import org.apache.hadoop.hbase.mapreduce.TableInputFormat;
import org.apache.hadoop.hbase.util.Bytes;
import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaPairRDD;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.JavaSparkContext;
import org.apache.spark.api.java.function.Function;
import org.apache.spark.sql.DataFrame;
import org.apache.spark.sql.SQLContext;
import scala.Tuple2;
import java.util.Arrays;
public class SparkHbaseHive {
public static void main(String[] args) {
Configuration conf = HBaseConfiguration.create();
conf.set(TableInputFormat.INPUT_TABLE, "test");
JavaSparkContext jsc = new JavaSparkContext(new SparkConf().setAppName("Spark-Hbase").setMaster("local[3]"));
JavaPairRDD<ImmutableBytesWritable, Result> source = jsc
.newAPIHadoopRDD(conf, TableInputFormat.class,
ImmutableBytesWritable.class, Result.class);
SQLContext sqlContext = new SQLContext(jsc);
JavaRDD<Table1Bean> rowJavaRDD =
source.map((Function<Tuple2<ImmutableBytesWritable, Result>, Table1Bean>) object -> {
Table1Bean table1Bean = new Table1Bean();
table1Bean.setRowKey(Bytes.toString(object._1().get()));
table1Bean.setColumn1(Bytes.toString(object._2().getValue(Bytes.toBytes("colfam1"), Bytes.toBytes("col1"))));
return table1Bean;
});
DataFrame df = sqlContext.createDataFrame(rowJavaRDD, Table1Bean.class);
//similarly create df2
//use df.join() and then register as joinedtable or register two tables and join
//execute sql queries
//Example of sql query on df
df.registerTempTable("table1");
Arrays.stream(sqlContext.sql("select * from table1").collect()).forEach(row -> System.out.println(row.getString(0) + "," + row.getString(1)));
}
}
public class Table1Bean {
private String rowKey;
private String column1;
public String getRowKey() {
return rowKey;
}
public void setRowKey(String rowKey) {
this.rowKey = rowKey;
}
public String getColumn1() {
return column1;
}
public void setColumn1(String column1) {
this.column1 = column1;
}
}
If you need to use hive for some reasons use HiveContext to read from hive and persist data using saveAsTable.
Let me know in case of doubts.
I have a csv file ..I load it to program using Sql Context and it upload it a dataframe.Now i want to store this csv file into mongodbCollection.And i am not able to convert it in to JavaPairedRDD.Please Help...
My Code is...
import org.apache.hadoop.conf.Configuration;
import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaPairRDD;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.JavaSparkContext;
import org.apache.spark.sql.DataFrame;
import org.apache.spark.sql.Row;
import org.apache.spark.sql.SQLContext;
import org.apache.spark.sql.types.DataTypes;
import org.apache.spark.sql.types.Metadata;
import org.apache.spark.sql.types.StructField;
import org.apache.spark.sql.types.StructType;
import org.bson.BSONObject;
import org.apache.spark.api.java.JavaPairRDD;
import com.mongodb.hadoop.MongoOutputFormat;
public class CSVReader {
public static void main(String args[]){
SparkConf conf = new SparkConf().setAppName("sparkConnection").setMaster("local");
JavaSparkContext sc = new JavaSparkContext(conf);
SQLContext sqlContext = new SQLContext(sc);
/* To load a csv file frol given location*/
DataFrame df = sqlContext.read()
.format("com.databricks.spark.csv")
.option("inferSchema", "true")//Automaticaaly infers the data
.option("header", "true")//To include the headers in dataframe
.load("D:/SparkFiles/abc.csv");
}
}
You clearly haven't researched enough.
Because if you would have, you would know dataframe is a nothing but a combination of schema+ rdd.
Assuming your posted code works fine, you can read the rdd from df as df.rdd