I am trying a simple movie recommendation machine learning program in spark.
Spark version:2.1.1
Java version:java 8
Scala version: Scala code runner version 2.11.7
Env: windows 7
Running these commands to start master and worker slaves
//start master
spark-class org.apache.spark.deploy.master.Master
//start worker
spark-class org.apache.spark.deploy.worker.Worker spark://valid ip:7077
I am trying a very simple movie recommendation code from here: http://blogs.quovantis.com/recommendation-engine-using-apache-spark/
I have updated code to :
SparkConf conf = new SparkConf().setAppName("Collaborative Filtering Example").setMaster("spark://valid ip:7077");
conf.setJars(new String[] {"C:\\Spark2.1.1\\spark-2.1.1-bin-hadoop2.7\\jars\\spark-mllib_2.11-2.1.1.jar"});
I cannot run this thru intelliJ
Running mvn clean install and copying the jar to folder does not work.
The command I used to run on :
bin\spark-submit --verbose –-jars jars\spark-mllib_2.11-2.1.1.jar –-class “com.abc.enterprise.RecommendationEngine” –-master spark://valid ip:7077 C:\Spark2.1.1\spark-2.1.1-bin-hadoop2.7\spark-mllib-example\spark-poc-1.0-SNAPSHOT.jar C:\Spark2.1.1\spark-2.1.1-bin-hadoop2.7\spark-mllib-example\ratings.csv C:\Spark2.1.1\spark-2.1.1-bin-hadoop2.7\spark-mllib-example\movies.csv 10
The error I see is:
C:\Spark2.1.1\spark-2.1.1-bin-hadoop2.7>bin\spark-submit --verbose --class "com.sandc.enterprise.RecommendationEngine" --master spark://10.64.98.101:7077 C:\Spark2.1.1\spark-2.1.1-
bin-hadoop2.7\spark-mllib-example\spark-poc-1.0-SNAPSHOT.jar C:\Spark2.1.1\spark-2.1.1-bin-hadoop2.7\spark-mllib-example\ratings.csv C:\Spark2.1.1\spark-2.1.1-bin-hadoop2.7\spark-m
llib-example\movies.csv 10
Using properties file: C:\Spark2.1.1\spark-2.1.1-bin-hadoop2.7\bin\..\conf\spark-defaults.conf
Adding default property: spark.serializer=org.apache.spark.serializer.KryoSerializer
Adding default property: spark.executor.extraJavaOptions=-XX:+PrintGCDetails -Dkey=value -Dnumbers="one two three"
Adding default property: spark.eventLog.enabled=true
Adding default property: spark.driver.memory=5g
Adding default property: spark.master=spark://valid ip:7077
Error: Cannot load main class from JAR file:/C:/Spark2.1.1/spark-2.1.1-bin-hadoop2.7/û-class
Run with --help for usage help or --verbose for debug output
If I give the --jar command, it gives the error:
Error: Cannot load main class from JAR file:/C:/Spark2.1.1/spark-2.1.1-bin-hadoop2.7/û-jars
Any ideas how I can submit this job to spark??
Is your Jar built correctly ?
Also you don't need to add double quotes for --class option value.
Related
I'm setting up GeoSpark Python and after installing all the pre-requisites, I'm running the very basic code examples to test it.
from pyspark.sql import SparkSession
from geo_pyspark.register import GeoSparkRegistrator
spark = SparkSession.builder.\
getOrCreate()
GeoSparkRegistrator.registerAll(spark)
df = spark.sql("""SELECT st_GeomFromWKT('POINT(6.0 52.0)') as geom""")
df.show()
I tried running it with python3 basic.py and spark-submit basic.py, both give me this error:
Traceback (most recent call last):
File "/home/jessica/Downloads/geo_pyspark/basic.py", line 8, in <module>
GeoSparkRegistrator.registerAll(spark)
File "/home/jessica/Downloads/geo_pyspark/geo_pyspark/register/geo_registrator.py", line 22, in registerAll
cls.register(spark)
File "/home/jessica/Downloads/geo_pyspark/geo_pyspark/register/geo_registrator.py", line 27, in register
spark._jvm. \
TypeError: 'JavaPackage' object is not callable
I'm using Java 8, Python 3, Apache Spark 2.4, my JAVA_HOME is set correctly, I'm running Linux Mint 19. My SPARK_HOME is also set:
$ printenv SPARK_HOME
/home/jessica/spark/
How can I fix this?
The Jars for geoSpark are not correctly registered with your Spark Session. There's a few ways around this ranging from a tad inconvenient to pretty seamless. For example, if when you call spark-submit you specify:
--jars jar1.jar,jar2.jar,jar3.jar
then the problem will go away, you can also provide a similar command to pyspark if that's your poison.
If, like me, you don't really want to be doing this every time you boot (and setting this as a .conf() in Jupyter will get tiresome) then instead you can go into $SPARK_HOME/conf/spark-defaults.conf and set:
spark-jars jar1.jar,jar2.jar,jar3.jar
Which will then be loaded when you create a spark instance. If you've not used the conf file before it'll be there as spark-defaults.conf.template.
Of course, when I say jar1.jar.... What I really mean is something along the lines of:
/jars/geo_wrapper_2.11-0.3.0.jar,/jars/geospark-1.2.0.jar,/jars/geospark-sql_2.3-1.2.0.jar,/jars/geospark-viz_2.3-1.2.0.jar
but that's up to you to get the right ones from the geo_pyspark package.
If you are using an EMR:
You need to set your cluster config json to
[
{
"classification":"spark-defaults",
"properties":{
"spark.jars": "/jars/geo_wrapper_2.11-0.3.0.jar,/jars/geospark-1.2.0.jar,/jars/geospark-sql_2.3-1.2.0.jar,/jars/geospark-viz_2.3-1.2.0.jar"
},
"configurations":[]
}
]
and also get your jars to upload as part of your bootstrap. You can do this from Maven but I just threw them on an S3 bucket:
#!/bin/bash
sudo mkdir /jars
sudo aws s3 cp s3://geospark-test-ds/bootstrap/geo_wrapper_2.11-0.3.0.jar /jars/
sudo aws s3 cp s3://geospark-test-ds/bootstrap/geospark-1.2.0.jar /jars/
sudo aws s3 cp s3://geospark-test-ds/bootstrap/geospark-sql_2.3-1.2.0.jar /jars/
sudo aws s3 cp s3://geospark-test-ds/bootstrap/geospark-viz_2.3-1.2.0.jar /jars/
If you are using an EMR Notebook
You need a magic cell at the top of your notebook:
%%configure -f
{
"jars": [
"s3://geospark-test-ds/bootstrap/geo_wrapper_2.11-0.3.0.jar",
"s3://geospark-test-ds/bootstrap/geospark-1.2.0.jar",
"s3://geospark-test-ds/bootstrap/geospark-sql_2.3-1.2.0.jar",
"s3://geospark-test-ds/bootstrap/geospark-viz_2.3-1.2.0.jar"
]
}
I was seeing a similar kind of issue with SparkMeasure jars on Windows 10 machine
self.stagemetrics =
self.sc._jvm.ch.cern.sparkmeasure.StageMetrics(self.sparksession._jsparkSession)
TypeError: 'JavaPackage' object is not callable
So what I did was
Went to 'SPARK_HOME' via Pyspark shell, and installed the required jar
bin/pyspark --packages ch.cern.sparkmeasure:spark-measure_2.12:0.16
Grabbed that jar ( ch.cern.sparkmeasure_spark-measure_2.12-0.16.jar ) and copied into the the Jars folder of 'SPARK_HOME'
Reran the script and now it worked without that above error.
I was trying to configure Weblogic 12.2.1.3 on Linux account.
After going through oracle documentation, I understood we setup Weblogic by running config.sh script located in /opt/weblogic12213/wlserver_12.2.1.3/installation/oracle_common/common/bin folder.
But this script is giving error as :
Error: Could not find or load main class
com.oracle.cie.wizard.WizardController
Below is the last command which it executes in setup script and it is giving an error :
/usr/java/jdk1.8.0_192/bin/java -Dpython.cachedir=/tmp/cachedir
-Xms32m -Xmx1024m -Dweblogic.alternateTypesDirectory=/opt/weblogic12213/wlserver_12.2.1.3/installation/wlserver/../oracle_common/modules/oracle.oamprovider,/opt/weblogic12213/wlserver_12.2.1.3/installation/wlserver/../oracle_common/modules/oracle.jps
com.oracle.cie.wizard.WizardController -target=config-oneclick
scripts provided by weblogic doesn't set CLASSPATH correctly.
com.oracle.cie.wizard.WizardController comes under com.oracle.cie.wizard_7.8.2.0.jar .
Steps done.
1. Copied config_internal.sh script on some local path. (Because I didn't have the root permission.)
2. Appended this jar in CLASSPATH varibale like this :
CLASSPATH="${FMWCONFIG_CLASSPATH}${CLASSPATHSEP}${DERBY_CLASSPATH}:/opt/weblogic12213/wlserver_12.2.1.3/installation/oracle_common/modules/com.oracle.cie.wizard_7.8.2.0.jar"
I am running my spark job as yarn client mode for which,
I need guava jar for my spark job that needs to be in driver and executors classpath.
So, I am running below spark-submit command:
spark-submit --class "com.bk.App" --master yarn --deploy-mode client --executor-cores 2 --driver-memory 1G --driver-cores 1 --driver-class-path /home/my_account/spark-jars/guava-19.0.jar --conf spark.executor.extraClassPath=/home/my_account/spark-jars/guava-19.0.jar maprfs:///user/my_account/jobs/spark-jobs.jar parma1 parma2
which give me below exception:
Downloading maprfs:///user/my_account/jobs/spark-jobs.jar to /tmp/tmp732578642370806645/user/my_account/jobs/spark-jobs.jar.
2018-10-29 19:37:52,2025 ERROR JniCommon fs/client/fileclient/cc/jni_MapRClient.cc:566 Thread: 6832 Client initialization failed due to mismatch of libraries. Please make sure that the java library version matches the native build version 5.0.0.32987.GA and native patch version $Id: mapr-version: 5.0.0.32987.GA 40889:3056362e419b $
Exception in thread "main" java.io.IOException: Could not create FileClient
at com.mapr.fs.MapRFileSystem.lookupClient(MapRFileSystem.java:593)
at com.mapr.fs.MapRFileSystem.lookupClient(MapRFileSystem.java:654)
at com.mapr.fs.MapRFileSystem.getMapRFileStatus(MapRFileSystem.java:1310)
at com.mapr.fs.MapRFileSystem.getFileStatus(MapRFileSystem.java:942)
at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:345)
at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:297)
at org.apache.hadoop.fs.FileSystem.copyToLocalFile(FileSystem.java:2066)
at org.apache.hadoop.fs.FileSystem.copyToLocalFile(FileSystem.java:2035)
at org.apache.hadoop.fs.FileSystem.copyToLocalFile(FileSystem.java:2011)
at org.apache.spark.deploy.SparkSubmit$.downloadFile(SparkSubmit.scala:874)
at org.apache.spark.deploy.SparkSubmit$$anonfun$prepareSubmitEnvironment$1.apply(SparkSubmit.scala:316)
at org.apache.spark.deploy.SparkSubmit$$anonfun$prepareSubmitEnvironment$1.apply(SparkSubmit.scala:316)
at scala.Option.map(Option.scala:146)
at org.apache.spark.deploy.SparkSubmit$.prepareSubmitEnvironment(SparkSubmit.scala:316)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:153)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:119)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
I even tried putting my guava jar to hdfs location and added extra classpath using hdfs:// and even maprfs:// in my spark submit. Tried giving local:// too. All end up in same above exception.
Note: Job works absolutely fine, if no driver and executor extra classpath jars are given.
Any suggestions? am I using the class path param in wrong way?
I'm really new to maven and storm so I'm trying to follow the instructions in https://github.com/apache/storm/tree/master/examples/storm-starter. My current path is /home/luc/theTest/storm/examples/storm-starter. Inside the target folder there is a storm-starter-2.0.0-SNAPSHOT.jar file. I'm getting stuck when running
storm jar target/storm-starter-*.jar org.apache.storm.starter.ExclamationTopology -local
I get this error
ionTopology -local
Running: /usr/lib/jvm/java-8-openjdk-amd64/bin/java -client -Ddaemon.name= -Dstorm.options= -Dstorm.home=/home/luc/stormTest/apache-storm-1.1.1 -Dstorm.log.dir=/home/luc/stormTest/apache-storm-1.1.1/logs -Djava.library.path=/usr/local/lib:/opt/local/lib:/usr/lib -Dstorm.conf.file= -cp /home/luc/stormTest/apache-storm-1.1.1/lib/servlet-api-2.5.jar:/home/luc/stormTest/apache-storm-1.1.1/lib/slf4j-api-1.7.21.jar:/home/luc/stormTest/apache-storm-1.1.1/lib/objenesis-2.1.jar:/home/luc/stormTest/apache-storm-1.1.1/lib/kryo-3.0.3.jar:/home/luc/stormTest/apache-storm-1.1.1/lib/log4j-core-2.8.2.jar:/home/luc/stormTest/apache-storm-1.1.1/lib/log4j-over-slf4j-1.6.6.jar:/home/luc/stormTest/apache-storm-1.1.1/lib/storm-core-1.1.1.jar:/home/luc/stormTest/apache-storm-1.1.1/lib/log4j-slf4j-impl-2.8.2.jar:/home/luc/stormTest/apache-storm-1.1.1/lib/minlog-1.3.0.jar:/home/luc/stormTest/apache-storm-1.1.1/lib/log4j-api-2.8.2.jar:/home/luc/stormTest/apache-storm-1.1.1/lib/clojure-1.7.0.jar:/home/luc/stormTest/apache-storm-1.1.1/lib/ring-cors-0.1.5.jar:/home/luc/stormTest/apache-storm-1.1.1/lib/asm-5.0.3.jar:/home/luc/stormTest/apache-storm-1.1.1/lib/reflectasm-1.10.1.jar:/home/luc/stormTest/apache-storm-1.1.1/lib/disruptor-3.3.2.jar:/home/luc/stormTest/apache-storm-1.1.1/lib/storm-rename-hack-1.1.1.jar:target/storm-starter-2.0.0-SNAPSHOT.jar:/home/luc/stormTest/apache-storm-1.1.1/conf:/home/luc/stormTest/apache-storm-1.1.1/bin -Dstorm.jar=target/storm-starter-2.0.0-SNAPSHOT.jar -Dstorm.dependency.jars= -Dstorm.dependency.artifacts={} org.apache.storm.starter.ExclamationTopology -local
Error: Could not find or load main class org.apache.storm.starter.ExclamationTopology
Am I doing something wrong? I'm also a bit confused on whether I have to run the nimbus and supervisor first. I tried with and without them and neither worked anyways. Been searching the web but nothing works. Not sure what else to try.
This is usually caused by inconsistent versions of storm-client and storm-starter. Try following these steps to get the example working.
download the latest release from http://storm.apache.org/downloads.html
in this example, we will use version 1.1.1
extract this to a folder, lets call it ${STORM_HOME}
cd into ${STORM_HOME}/examples/storm-starter
execute mvn package -DskipTests=true
this should build the storm-starter jar in target folder
${STORM_HOME}/examples/storm-starter/target/storm-starter-1.1.1.jar
run the example from ${STORM_HOME} directory:
./bin/storm jar examples/storm-starter/target/storm-starter-1.1.1.jar org.apache.storm.starter.ExclamationTopology
don't add the -local flag since it seems like the ExclamationTopology is only deployed in a LocalCluster if no args are passed. You can check the source code here: ${STORM_HOME}/examples/storm-starter/src/jvm/org/apache/storm/starter/ExclamationTopology.java
When I run HiveRead.java from intellij ide I can successfully run and get result. Then I created jar file ( It's a maven project ) , then I tried to run from IDE, it gave me
ClassLoaderResolver for class "" gave error on creation : {1}
Then I looked at SO answers and found I had to add datanulcues jars, I did something like this
java -jar /home/saurab/sparkProjects/spark_hive/target/myJar-jar-with-dependencies.jar --jars jars/datanucleus-api-jdo-3.2.6.jar,jars/datanucleus-core-3.2.10.jar,jars/datanucleus-rdbms-3.2.9.jar,/home/saurab/hadoopec/hive/lib/mysql-connector-java-5.1.38.jar
Then I got this error
org.datanucleus.exceptions.NucleusUserException: Persistence process has been specified to use a ClassLoaderResolver of name "datanucleus" yet this has not been found by the DataNucleus plugin mechanism. Please check your CLASSPATH and plugin specification.
Somewhere I found I should do spark-submit. So I did like this
./bin/spark-submit --class HiveRead --master yarn --jars jars/datanucleus-api-jdo-3.2.6.jar,jars/datanucleus-core-3.2.10.jar,jars/datanucleus-rdbms-3.2.9.jar,/home/saurab/hadoopec/hive/lib/mysql-connector-java-5.1.38.jar --files /home/saurab/hadoopec/spark/conf/hive-site.xml /home/saurab/sparkProjects/spark_hive/target/myJar-jar-with-dependencies.jar
Now I get new type of error
Table or view not found: `bigmart`.`o_sales`;
HELP ME !! :)
I have copied my hive-site.xml to /spark/conf, started hive-metastore service ( hiveserver2 --service metastore )
Here is HiveRead.Java code if anyone is interested.
Spark session is not able to read the hive directory.
Provide the hive-site.xml file path with spark-submit command as below.
For hortonworks - file path /usr/hdp/current/spark2-client/conf/hive-site.xml
pass it as --files /usr/hdp/current/spark2-client/conf/hive-site.xml in spark-submit command.