facing an issue in Spark Structured Streaming

facing an issue in Spark Structured Streaming - java

I have written a code to read a csf file and printing that on console using Spark Stuctured Stream. Code is below -
import java.util.ArrayList;
import java.util.List;
import org.apache.spark.api.java.function.FlatMapFunction;
import org.apache.spark.sql.*;
import org.apache.spark.sql.streaming.StreamingQuery;
import org.apache.spark.sql.Encoders;
import org.apache.spark.sql.types.StructType;
import com.cybernetix.models.BaseDataModel;
public class ReadCSVJob {
static List<BaseDataModel> bdmList=new ArrayList<BaseDataModel>();
public static void main(String args[]) {
SparkSession spark = SparkSession
.builder()
.config("spark.eventLog.enabled", "false")
.config("spark.driver.memory", "2g")
.config("spark.executor.memory", "2g")
.appName("StructuredStreamingAverage")
.master("local")
.getOrCreate();
StructType userSchema = new StructType();
userSchema.add("name", "string");
userSchema.add("status", "String");
userSchema.add("u_startDate", "String");
userSchema.add("u_lastlogin", "string");
userSchema.add("u_firstName", "string");
userSchema.add("u_lastName", "string");
userSchema.add("u_phone","string");
userSchema.add("u_email", "string")
;
Dataset<Row> dataset = spark.
readStream().
schema(userSchema)
.csv("D:\\user\\sdata\\user-2019-10-03_20.csv");
dataset.writeStream()
.format("console")
.option("truncate","false")
.start();
}
}
in this code line userSchema.add("name", "string"); causing the program to terrminate. Below is the log trace.
ANTLR Tool version 4.7 used for code generation does not match the current runtime version 4.5.3ANTLR Runtime version 4.7 used for parser compilation does not match the current runtime version 4.5.3Exception in thread "main" java.lang.ExceptionInInitializerError at org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parse(ParseDriver.scala:84) at org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parseDataType(ParseDriver.scala:39) at org.apache.spark.sql.types.StructType.add(StructType.scala:213) at com.cybernetix.sparks.jobs.ReadCSVJob.main(ReadCSVJob.java:45) Caused by: java.lang.UnsupportedOperationException: java.io.InvalidClassException: org.antlr.v4.runtime.atn.ATN; Could not deserialize ATN with UUID 59627784-3be5-417a-b9eb-8131a7286089 (expected aadb8d7e-aeef-4415-ad2b-8204d6cf042e or a legacy UUID). at org.antlr.v4.runtime.atn.ATNDeserializer.deserialize(ATNDeserializer.java:153) at org.apache.spark.sql.catalyst.parser.SqlBaseLexer.<clinit>(SqlBaseLexer.java:1175) ... 4 more Caused by: java.io.InvalidClassException: org.antlr.v4.runtime.atn.ATN; Could not deserialize ATN with UUID 59627784-3be5-417a-b9eb-8131a7286089 (expected aadb8d7e-aeef-4415-ad2b-8204d6cf042e or a legacy UUID). ... 6 more
I have added ANTLR maven dependency in pom.xml file but still facing the same issue.
<!-- https://mvnrepository.com/artifact/org.antlr/antlr4 -->
<dependency>
<groupId>org.antlr</groupId>
<artifactId>antlr4</artifactId>
<version>4.7</version>
</dependency>
I am not sure after adding antlr dependency , why in maven dependency list still it antlr-runtime-4.5.3.jar. Have a look to below screen shot.
Can anyone help me what i am doing wrong here?

Update your artifactId to antlr4-runtime, and try again. Please clean and build.
dependency should look like below:
<dependency>
<groupId>org.antlr</groupId>
<artifactId>antlr4-runtime</artifactId>
<version>4.7</version>
</dependency>

Related

Migrating com.vividsolutions.jts to org.locationtech.jts still complaining about lack of com.vividsolutions package

I'm trying to upgrade dependencies for a java application that uses com.vividsolutions.jts. I have removed all the references to this library from pom.xml and replaced them by the ones from org.locationtech.jts.
I have updated all the imports to use org.locationtech version. However, in my function I'm still getting an error related to com.vividsolutions object not being imported.
import org.locationtech.spatial4j.context.jts.JtsSpatialContext;
import org.locationtech.jts.geom.Coordinate;
import org.locationtech.jts.geom.GeometryFactory;
import org.locationtech.jts.geom.LinearRing;
import org.locationtech.spatial4j.shape.jts.JtsGeometry;
// ... other stuff
public static myFunc() {
GeometryFactory gf = new GeometryFactory();
LinearRing linear = gf.createLinearRing(coordinates);
JtsGeometry poly = new JtsGeometry(fact.createPolygon(linear), JtsSpatialContext.GEO, true, true);
}
Here's the error that I get for the last line of the above code: [ERROR] cannot access com.vividsolutions.jts.geom.Geometry [ERROR] class file for com.vividsolutions.jts.geom.Geometry not found
I'm clearly importing JtsGeometry from the new library at org.locationtech, however, it's still thinking the old library should be used.
The old library isn't in the dependency tree or the code anymore, as the followings don't return anything:
mvn dependency:tree | grep vivid
rg vivid
Any idea what I'm missing here or how I should troubleshoot this?

I'm not too sure what was wrong with the vividsolutions library inclusion. However, I was able to resolve my issue by including both of these in pom.xml:
<dependency>
<groupId>org.locationtech.jts</groupId>
<artifactId>jts-core</artifactId>
<version>1.18.2</version>
</dependency>
<dependency>
<groupId>org.locationtech.spatial4j</groupId>
<artifactId>spatial4j</artifactId>
<version>0.8</version>
</dependency>
Initially I didn't have the second dependency in the pom.xml file.

NoSuchMethodError in JAVA when using forEach Method

I have below code snipet :
JSONArray processNodes = new JSONObject(customCon.geteOutput())
.getJSONArray("process-node");
processNodes.forEach(item -> {JSONObject node = (JSONObject) item;});
I added dependency in pom.xml as :
<dependency>
<groupId>org.json</groupId>
<artifactId>json</artifactId>
<version>20160810</version>
</dependency>
But runtime it gives error as java.lang.NoSuchMethodError :org.json.JSONArray.forEach(Ljava/util/function/Consumer;)
Any idea why i am having this error ?

Any idea why I am having this error ?
There is a mismatch between the version of JSONArray that you compiled against and the one that you are using at runtime. That causes the error.
According to the javadoc for the 20160810 version in your POM file, there is a forEach method on org.json.JSONArray that is defined by the Iterable interface.
However it is clear from the exception that the version of JSONArray that you are using at runtime does not have that method.
Note that the method is not present in the Android version (see https://developer.android.com/reference/org/json/JSONArray) and won't be present in versions prior to Java 8 ... because Java 7 Iterable doesn't have a forEach method (javadoc).

Check for correct import statement.Your Code seems fine.
I did quick check in Eclipse works fine.
import org.json.JSONArray;
import org.json.JSONObject;
public class Test {
public static void main(String[] args) {
JSONObject jo = new JSONObject(5);
jo.put("red",new JSONArray(List.of(1,2,3,4,5)));
jo.put("blue", "green");
jo.getJSONArray("red").forEach(item -> {String var = item.toString();});
}
}

This is because while running, the jar is picking from the jar folder of Spark, inorder to override this, specify the jar in --jar of spark submit and also add conf like this :
--conf spark.driver.extraClassPath=json-20200518.jar
--conf spark.executor.extraClassPath=json-20200518.jar
https://hadoopsters.com/2019/05/08/how-to-override-a-spark-dependency-in-client-or-cluster-mode/

Unable to run JAR - Spark Twitter Streaming with Java

I'm running Spark 2.4.3 in standalone mode in Ubuntu. I am using Maven to create the JAR file. Below is the code I'm trying to run which is intended to stream data from Twitter.
Once Spark is started Spark master will be at 127.0.1.1:7077.
The java version being used is 1.8.
package SparkTwitter.SparkJavaTwitter;
import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaPairRDD;
import org.apache.spark.api.java.function.VoidFunction;
import org.apache.spark.streaming.Duration;
import org.apache.spark.streaming.api.java.JavaDStream;
import org.apache.spark.streaming.api.java.JavaPairDStream;
import org.apache.spark.streaming.api.java.JavaReceiverInputDStream;
import org.apache.spark.streaming.api.java.JavaStreamingContext;
import org.apache.spark.streaming.twitter.TwitterUtils;
import scala.Tuple2;
import twitter4j.Status;
import twitter4j.auth.Authorization;
import twitter4j.auth.OAuthAuthorization;
import twitter4j.conf.Configuration;
import twitter4j.conf.ConfigurationBuilder;
import com.google.common.collect.Iterables;
public class TwitterStream {
public static void main(String[] args) {
// Prepare the spark configuration by setting application name and master node "local" i.e. embedded mode
final SparkConf sparkConf = new SparkConf().setAppName("Twitter Data Processing").setMaster("local[2]");
// Create Streaming context using spark configuration and duration for which messages will be batched and fed to Spark Core
final JavaStreamingContext streamingContext = new JavaStreamingContext(sparkConf, Duration.apply(10000));
// Prepare configuration for Twitter authentication and authorization
final Configuration conf = new ConfigurationBuilder().setDebugEnabled(false)
.setOAuthConsumerKey("customer key")
.setOAuthConsumerSecret("customer key secret")
.setOAuthAccessToken("Access token")
.setOAuthAccessTokenSecret("Access token secret")
.build();
// Create Twitter authorization object by passing prepared configuration containing consumer and access keys and tokens
final Authorization twitterAuth = new OAuthAuthorization(conf);
// Create a data stream using streaming context and Twitter authorization
final JavaReceiverInputDStream<Status> inputDStream = TwitterUtils.createStream(streamingContext, twitterAuth, new String[]{});
// Create a new stream by filtering the non english tweets from earlier streams
final JavaDStream<Status> enTweetsDStream = inputDStream.filter((status) -> "en".equalsIgnoreCase(status.getLang()));
// Convert stream to pair stream with key as user screen name and value as tweet text
final JavaPairDStream<String, String> userTweetsStream =
enTweetsDStream.mapToPair(
(status) -> new Tuple2<String, String>(status.getUser().getScreenName(), status.getText())
);
// Group the tweets for each user
final JavaPairDStream<String, Iterable<String>> tweetsReducedByUser = userTweetsStream.groupByKey();
// Create a new pair stream by replacing iterable of tweets in older pair stream to number of tweets
final JavaPairDStream<String, Integer> tweetsMappedByUser = tweetsReducedByUser.mapToPair(
userTweets -> new Tuple2<String, Integer>(userTweets._1, Iterables.size(userTweets._2))
);
// Iterate over the stream's RDDs and print each element on console
tweetsMappedByUser.foreachRDD((VoidFunction<JavaPairRDD<String, Integer>>)pairRDD -> {
pairRDD.foreach(new VoidFunction<Tuple2<String,Integer>>() {
#Override
public void call(Tuple2<String, Integer> t) throws Exception {
System.out.println(t._1() + "," + t._2());
}
});
});
// Triggers the start of processing. Nothing happens if streaming context is not started
streamingContext.start();
// Keeps the processing live by halting here unless terminated manually
//streamingContext.awaitTermination();
}
}
pom.xml
<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
<modelVersion>4.0.0</modelVersion>
<groupId>SparkTwitter</groupId>
<artifactId>SparkJavaTwitter</artifactId>
<version>0.0.1-SNAPSHOT</version>
<packaging>jar</packaging>
<name>SparkJavaTwitter</name>
<url>http://maven.apache.org</url>
<properties>
<project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
<maven.compiler.source>1.8</maven.compiler.source>
<maven.compiler.target>1.8</maven.compiler.target>
</properties>
<dependencies>
<dependency>
<groupId>junit</groupId>
<artifactId>junit</artifactId>
<version>3.8.1</version>
<scope>test</scope>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.12</artifactId>
<version>2.4.3</version>
</dependency>
<!-- https://mvnrepository.com/artifact/org.apache.spark/spark-streaming -->
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-streaming_2.12</artifactId>
<version>2.4.3</version>
<scope>provided</scope>
</dependency>
<!-- https://mvnrepository.com/artifact/org.apache.spark/spark-streaming-twitter -->
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-streaming-twitter_2.11</artifactId>
<version>1.6.3</version>
</dependency>
</dependencies>
</project>
To execute the code I'm using the following command
./bin/spark-submit --class SparkTwitter.SparkJavaTwitter.TwitterStream /home/hadoop/eclipse-workspace/SparkJavaTwitter/target/SparkJavaTwitter-0.0.1-SNAPSHOT.jar
Below is the output I'm getting.
19/11/10 22:17:58 WARN Utils: Your hostname, hadoop-VirtualBox resolves to a loopback address: 127.0.1.1; using 10.0.2.15 instead (on interface enp0s3)
19/11/10 22:17:58 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
19/11/10 22:17:58 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Warning: Failed to load SparkTwitter.SparkJavaTwitter.TwitterStream: twitter4j/auth/Authorization
log4j:WARN No appenders could be found for logger (org.apache.spark.util.ShutdownHookManager).
log4j:WARN Please initialize the log4j system properly.
log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info.
I've been running a word count program the same way and it works fine. When I build the JAR it builds successfully as well. Do I have to specify any more parameters while running the JAR?

I've faced a similar problem and found that you need to give the jars directly to spark-submit. What I do is point out the directory where the jars used to build the project are stored using the --jars "<path-to-jars>/*" option to spark-submit.
Perhaps this is not the best option, but it works...
Also, when updating versions beware that the jars in that folder must also be updated.

Load model in Scala/Java from pmml created in R

I want to save a random forest regression model in PMML from R, and load it in Spark (Scala or Java). Unfortunately I have issues in the second step.
A minimal example of saving a PMML of a random forest regresion model in R is provided below.
When I try to load this model from Scala or Java using jpmml (see code below), I get the following error:
Exception in thread "main" java.lang.IllegalArgumentException: http://www.dmg.org/PMML-4_3
I can overcome this error editing the xml file: the attribute "xmlns" in the tag "PMML" contains the url that appears in the error message. If I remove completely the url or I change 4_3 to 4_2, this error disappears. However, a new error message appears:
Exception in thread "main" org.jpmml.evaluator.UnsupportedFeatureException (at or around line 19): MiningModel
Do you have please any suggestions or ideas on how to solve this specific error or, more in general, how to load in Scala a pmml created in R?
Thank you!
Update: The problem, as answered by #user1808924, was the version of the jpmml library. The code quoted below now works fine. The correct libs should be loaded, for example using the Maven Central Repository:
<dependency>
<groupId>org.jpmml</groupId>
<artifactId>pmml-evaluator</artifactId>
<version>1.3.6</version>
</dependency>
<dependency>
<groupId>org.jpmml</groupId>
<artifactId>pmml-model</artifactId>
<version>1.3.7</version>
</dependency>
<dependency>
<groupId>org.jpmml</groupId>
<artifactId>pmml-spark</artifactId>
<version>1.0-SNAPSHOT</version>
</dependency>
Minimal example of saving a PMML of a random forest regresion model in R:
library(randomForest)
library(r2pmml)
data(mtcars)
MPGmodel.rf <- randomForest(mpg~., mtcars, ntree=5, do.trace=1)
# with package "r2pmml", convert model to pmml version 4.3 and save to xml:
r2pmml(MPGmodel.rf, "MPGmodel-r2pmml.pmml")
Loading the model in Scala:
import java.io.File
import org.jpmml.evaluator.Evaluator
import org.jpmml.spark.EvaluatorUtil
val fileNamePmml = "MPGmodel-r2pmml.pmml"
val pmmlFile = new File(fileNamePmml)
// the "UnsupportedFeature MiningModel" error appears here:
val myEvaluator: Evaluator = EvaluatorUtil.createEvaluator(pmmlFile)
I've also tried to load the model using Java, with identical error messages:
import org.dmg.pmml.PMML;
import org.jpmml.evaluator.ModelEvaluator;
import org.jpmml.evaluator.ModelEvaluatorFactory;
import java.io.*;
import java.util.Scanner;
import java.io.ByteArrayInputStream;
File pmmlFile = new File(fileNamePmml );
// the pmml file is successfully loaded as a string:
String pmmlString = null;
pmmlString = new Scanner(pmmlFile).useDelimiter("FILEFINISHESHERE").next();
// a PMML object is successfully created from the pmml string:
PMML myPmml = null;
try(InputStream is = new ByteArrayInputStream(pmmlString.getBytes())){
myPmml = org.jpmml.model.PMMLUtil.unmarshal(is);
}
// the "UnsupportedFeature MiningModel" error appears here:
ModelEvaluatorFactory modelEvaluatorFactory = ModelEvaluatorFactory.newInstance();
ModelEvaluator<?> modelEvaluator = modelEvaluatorFactory.newModelEvaluator(myPmml);

You're using a legacy JPMML library, which was discontinued 3+ years ago. Naturally, it doesn't support new PMML features (such as PMML 4.2 and 4.3 schemas) that have been added since then.
Simply upgrade to the JPMML-Evaluator library. As a bonus, your code will be much shorter and cleaner.

You could use PMML4S to load the PMML model in Scala, for example:
import org.pmml4s.model.Model
val model = Model.fromFile("MPGmodel-r2pmml.pmml")
val result = model.predict(data)
The input data could be a map, a list of pairs of keys and values, an array, json, or PMML4S's Series.

using mllib in apache spark 2.0.2 and "The import org.apache.spark.mllib cannot be resolved" error

I just want to do some 2D Matrix operation in using JavaRDD and looked into this link https://spark.apache.org/docs/latest/mllib-data-types.html. I tried doing exactly the same sample codes that are given here. But eclipse doesn't seem to recognize the mllib in the first place. Here is my code snippet (same as that in the above link)
import org.apache.spark.mllib.linalg.Vector;
import org.apache.spark.mllib.regression.LabeledPoint;
import org.apache.spark.mllib.regression.LabeledPoint;
import org.apache.spark.mllib.util.MLUtils;
import org.apache.spark.mllib.linalg.Matrix;
import org.apache.spark.mllib.linalg.Matrices;
JavaRDD<Vector> rows = ... // a JavaRDD of local vectors
// Create a RowMatrix from an JavaRDD<Vector>.
RowMatrix mat = new RowMatrix(rows.rdd());
// Get its size.
long m = mat.numRows();
long n = mat.numCols();
// QR decomposition
QRDecomposition<RowMatrix, Matrix> result = mat.tallSkinnyQR(true);
I am using Spark 2.0.2. Where am I going wrong? Do we need any maven dependency? I checked my spark home directory, and I have the mllib directory and mllib-local directory in my spark directory.

Check your pom.xml to see if there is a spark-mllib dependency. If not, get the right version from here: https://mvnrepository.com/artifact/org.apache.spark/spark-mllib_2.11
At the point of my answering, the latest version is:
<!-- https://mvnrepository.com/artifact/org.apache.spark/spark-mllib_2.11 -->
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-mllib_2.11</artifactId>
<version>2.1.0</version>
</dependency>

Make sure that your spark-mllib configuration in pom.xml hasn't been runtime.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

facing an issue in Spark Structured Streaming - java

Update your artifactId to antlr4-runtime, and try again. Please clean and build. dependency should look like below: <dependency> <groupId>org.antlr</groupId> <artifactId>antlr4-runtime</artifactId> <version>4.7</version> </dependency>

Related

Migrating com.vividsolutions.jts to org.locationtech.jts still complaining about lack of com.vividsolutions package

NoSuchMethodError in JAVA when using forEach Method

Unable to run JAR - Spark Twitter Streaming with Java

Load model in Scala/Java from pmml created in R

using mllib in apache spark 2.0.2 and "The import org.apache.spark.mllib cannot be resolved" error

Categories

Resources