Unable to initialize spark context using java - java

i am trying a simple work count program using spark , but its fails when i try to initialize spark context.
Below is my code
conf = new SparkConf(true).
setAppName("WordCount").
setMaster("spark://192.168.0.104:7077");
sc = new JavaSparkContext(conf);
Now few things i wanted to clarify i am using Spark version 2.1.1 , my java code is on windows 10 and my server is running on VM box.
I have disabled firewall in VM and can access URL http://192.168.0.104:8080/ from windows.
However i am getting below stacktrace when running the code
17/08/06 18:44:15 INFO SparkUI: Bound SparkUI to 0.0.0.0, and started at http://192.168.0.103:4040
17/08/06 18:44:15 INFO StandaloneAppClient$ClientEndpoint: Connecting to master spark://192.168.0.104:7077...
17/08/06 18:44:15 INFO TransportClientFactory: Successfully created connection to /192.168.0.104:7077 after 41 ms (0 ms spent in bootstraps)
17/08/06 18:44:15 WARN StandaloneAppClient$ClientEndpoint: Failed to connect to master 192.168.0.104:7077
org.apache.spark.SparkException: Exception thrown in awaitResult
at org.apache.spark.rpc.RpcTimeout$$anonfun$1.applyOrElse(RpcTimeout.scala:77)
at org.apache.spark.rpc.RpcTimeout$$anonfun$1.applyOrElse(RpcTimeout.scala:75)
at scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:33)
at org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:59)
at org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:59)
Can some one help ?

A bit late but for those running into this now: This can be caused by the maven version used for Spark Core or Spark SQL not being compatible with the Spark version being used on the server. At this moment Spark 2.4.4 appears to be compatible with the following Maven setup:
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-sql_2.11</artifactId>
<version>2.3.4</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.11</artifactId>
<version>2.3.4</version>
</dependency>
Incompatibility issues can be diagnosed by viewing the Spark Master node logs. They should mention something along the lines of spark local class incompatible stream classdesc serialversionuid
I hope this is still of some use to some!

You need to import some Spark classes into your program. Add the following lines:
import org.apache.spark.api.java.JavaSparkContext
import org.apache.spark.api.java.JavaRDD
import org.apache.spark.SparkConf
SparkConf conf = new SparkConf().setAppName("WordCount").setMaster("local");
JavaSparkContext sc = new JavaSparkContext(conf);

Related

DB driver is not found when running in docker swarm

I have a Spring boot application running with maven. I can successfully run my app locally, but when I run an image in the local docker swarm: docker stack deploy --compose-file docker-compose.yml compose I get the following error: Caused by: java.lang.IllegalStateException: Cannot load driver class: org.postgresql.Driver
I've checked env.getPropertySources():
compose_service#debian| spring.datasource.driver-class-name=org.postgresql.Driver
compose_service#debian| spring.datasource.url=jdbc:postgresql://localhost:5432/service
compose_service#debian| spring.jpa.properties.hibernate.dialect=org.hibernate.dialect.PostgreSQLDialect
These props work fine with local running.
I've checked, the built jar contains the Postgres lib; maven dependency in my project:
<dependency>
<groupId>org.postgresql</groupId>
<artifactId>postgresql</artifactId>
<version>42.3.1</version>
</dependency>
I recently ran the app with docker-compose up and it also worked, so it seems like a problem with running in swarm. Any ideas?
I shouldn't add secrets to docker swarm by echo,
which adds \n to each string (that's why my driver name wasn't valid).
Instead, I should use printf:
printf "org.postgresql.Driver" | docker secret create db-driver -
Hope it will save time for someone

Unsupported partitioner with Amazon Keyspaces (for Apache Cassandra)

I have a Java Spring app and I'm using Amazon Keyspaces (for Apache Cassandra). I'm using the sigv4 plugin , (version 4.0.2), the cassandra java-driver-core (version 4.4.0) and have followed the official documentation on how to connect my java app with MCS. The app connects just fine but I'm getting a weird warning at start up:
WARN 1 --- [ s0-admin-0] .o.d.i.c.m.t.DefaultTokenFactoryRegistry : [s0] Unsupported partitioner 'com.amazonaws.cassandra.DefaultPartitioner', token map will be empty.
Everything looks good but after a few minutes that warning comes back and my queries start to fail. This is how the logs look after a few minutes:
WARN 1 --- [ s0-admin-0] .o.d.i.c.m.t.DefaultTokenFactoryRegistry : [s0] Unsupported partitioner 'com.amazonaws.cassandra.DefaultPartitioner', token map will be empty.
WARN 1 --- [ s0-io-1] c.d.o.d.i.c.m.SchemaAgreementChecker : [s0] Unknown peer xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx, excluding from schema agreement check
WARN 1 --- [ s0-io-0] c.d.o.d.i.c.control.ControlConnection : [s0] Unexpected error while refreshing schema after a successful reconnection, keeping previous version (CompletionException: com.datastax.oss.driver.api.core.connection.ClosedConnectionException: Channel was force-closed)
WARN 1 --- [ s0-io-1] c.d.o.d.i.c.m.DefaultTopologyMonitor : [s0] Control node ec2-x-xx-xxx-xx.us-east-2.compute.amazonaws.com/x.xx.xxx.xxx:xxxx has an entry for itself in system.peers: this entry will be ignored. This is likely due to a misconfiguration; please verify your rpc_address configuration in cassandra.yaml on all nodes in your cluster.
I have debugged a little and it looks like that partitioner comes from the actual node metadata, so I don't really know if there's an actual way to fix it.
I've seen there's a similar question asked recently here, but no solution has been posted yet. Any ideas? Thanks so much in advance
These are all warnings and not errors. Your connection should work just fine. They are logged due to how Amazon Keyspaces is slightly different from an actual Cassandra cluster. Try setting these to get rid of the noise:
datastax-java-driver.advanced {
metadata {
schema.enabled = false
token-map.enabled = false
}
connection.warn-on-init-error = false
}
Problem the same with me.
The above problems encountered when using a Spring boot version 2.3.x
Because is a Spring boot version 2.3.x
<dependency>
<groupId>org.springframework.boot</groupId>
<artifactId>spring-boot-starter-data-cassandra-reactive</artifactId>
</dependency>
OR
<dependency>
<groupId>org.springframework.boot</groupId>
<artifactId>spring-boot-starter-data-cassandra</artifactId>
</dependency>
When creating a Maven / Gradle will get "datastax-java-driver-core 4.6.1" and I think this is another reason that Amazon Keyspaces are not supported.
Okay, Clear.....
Back to the subject of AWS library aws-sigv4-auth-cassandra-java-driver-plugin 4.0.2
When creating a Maven / Gradle will get "datastax-java-driver-core 4.4.0"
Now I am starting to see that Amazon Keyspaces
Maybe not supported "datastax-java-driver-core" with version greater than 4.4.0
Okay, it's been very long.
If you want Application Spring Boot 2 to work
You try following the solutions follows.
look at pom.xml
remove aws-sigv4-auth-cassandra-java-driver-plugin
downgrade Spring boot version 2.3.x to 2.2.9
add dependency below,
spring-boot-starter-data-cassandra-reactive
OR
spring-boot-starter-data-cassandra
create Amazon digital certificate and download
If you used InteljiJ IDEA Edit Configurations
Goto Edit Configurations -> next VM Options add below,
-Djavax.net.ssl.trustStore=path_to_file/cassandra_truststore.jks
-Djavax.net.ssl.trustStorePassword=my_password
Reference: https://docs.aws.amazon.com/keyspaces/latest/devguide/using_java_driver.html
application-dev.yml add config below,
spring:
data:
cassandra:
contact-points:
- "cassandra.ap-southeast-1.amazonaws.com"
port: 9142
ssl: true
username: "cassandra-username"
password: "cassandra-password"
keyspace-name: keyspace-name
request:
consistency: local_quorum
Run Test Program
Pass.
Work for me.
Tech Stack
Spring Boot WebFlux 2.2.9.RELEASE
Cassandra Reactive
JDK 13
Cassandra Database with Amazon Keyspaces
Maven 3.6.3
Have fun with programming.

Failing to set up Zookeeper cluster for Pulsar

I am trying to set up a Zookeeper cluster for Pulsar. I am following the instructions here, but I keep failing.
In my setup, I have two nodes, that should be part of the cluster. Since I need to deploy bookie to the same nodes, I executed
$ PULSAR_EXTRA_OPTS="-Dstats_server_port=8001" bin/pulsar-daemon start zookeeper
to start zookeeper. Afterwards, I am trying to init the cluster using this command:
bin/pulsar initialize-cluster-metadata \
--cluster pulsar-cluster-1 \
--zookeeper 10.100.100.77:2181 \
--configuration-store 10.100.100.77:2181 \
--web-service-url http://10.100.100.77:8080 \
--broker-service-url pulsar://10.100.100.77:6650 \
But I keep getting this error:
17:12:24.146 [main-SendThread(10.100.100.77:2181)] INFO org.apache.zookeeper.ClientCnxn - Socket error occurred: 10.100.100.77/10.100.100.77:2181: Verbindungsaufbau abgelehnt
17:12:25.251 [main-SendThread(10.100.100.77:2181)] INFO org.apache.zookeeper.ClientCnxn - Opening socket connection to server 10.100.100.77/10.100.100.77:2181. Will not attempt to authenticate using SASL (unknown error)
I read here that I need to have an odd number of nodes, so I added a virtual machine on one of the nodes. When I start Zookeeper on it, it doesn't print an error message, but but shows:
$ PULSAR_EXTRA_OPTS="-Dstats_server_port=8001" bin/pulsar-daemon start zookeeper
doing start zookeeper ...
starting zookeeper, logging to /home/host1/apache-pulsar-2.4.0/logs/pulsar-zookeeper-host1-VirtualBox.log
OpenJDK 64-Bit Server VM warning: Option AggressiveOpts was deprecated in version 11.0 and will likely be removed in a future release.
[AppClassLoader#27c170f0] info AspectJ Weaver Version 1.9.2 built on Wednesday Oct 24, 2018 at 15:43:33 GMT
[AppClassLoader#27c170f0] info register classloader jdk.internal.loader.ClassLoaders$AppClassLoader#27c170f0
[AppClassLoader#27c170f0] info using configuration file:/home/host1/apache-pulsar-2.4.0/lib/org.apache.pulsar-pulsar-zookeeper-utils-2.4.0.jar!/META-INF/aop.xml
[AppClassLoader#27c170f0] info using configuration file:/home/host1/apache-pulsar-2.4.0/lib/org.apache.pulsar-pulsar-zookeeper-2.4.0.jar!/META-INF/aop.xml
[AppClassLoader#27c170f0] info register aspect org.apache.pulsar.zookeeper.SerializeUtilsAspect
[AppClassLoader#27c170f0] info register aspect org.apache.pulsar.broker.zookeeper.aspectj.ClientCnxnAspect
However the Zookeeper service is not started, even if the setup is very similar to its host and I can't make up why.
Any Ideas how I could proceed from here? Thanks in advance!
The first error you posted seems to indicate that the connection to 10.100.100.77:2181 is refused "Verbindungsaufbau abgelehnt", and therefore the ZK server isn't listening on that server and port. You should first confirm that ZK is up and running and check the ZK log for any errors.
HTH
I found the soulution. The original error was indeed caused by having an odd number of nodes. The third (virtual) one wouldn't start, because of a mislocation of Zookepers data-directory. Now that the third server started, also the configuration passed successfully.

No able to connect to spark cluster via sparklyr package when my custom package method is invoked via OpenCpu

I have created an R package that makes use of the sparklyr capabilities within a dummy hello function. My package does a very simple thing as connection to a spark cluster, print the spark version and disconnect. The package is successfully clean and build and is successfully executed from R and Rstudio.
# Connect to Spark cluster
spark_conn <- sparklyr::spark_connect(master = "spark://elenipc.home:7077", spark_home = '/home/eleni/spark-2.2.0-bin-hadoop2.7/')
# Print the version of Spark
sv<- sparklyr::spark_version(spark_conn)
print(sv)
# Disconnect from Spark
sparklyr::spark_disconnect(spark_conn)
It is very important for me to be able to execute the hello function from OpenCpu rest api. (I have used opencpu api for executing many other custom created packages.)
When invoking opencpu api like:
curl http://localhost/ocpu/user/rstudio/library/myFirstBigDataPackage/R/hello/print -X POST
i get the following response:
Failed while connecting to sparklyr to port (8880) for sessionid (89615): Gateway in port (8880) did not respond.
Path: /home/eleni/spark-2.2.0-bin-hadoop2.7/bin/spark-submit
Parameters: --class, sparklyr.Shell, '/home/rstudio/R/x86_64-pc-linux-gnu-library/3.4/sparklyr/java/sparklyr-2.2-2.11.jar', 8880, 89615
Log: /tmp/ocpu-temp/file26b165c92166_spark.log
---- Output Log ----
Error occurred during initialization of VM
Could not allocate metaspace: 1073741824 bytes
---- Error Log ----
In call:
force(code)
Of course allocate more memory to both java & spark executor does not resolve the issue. permission issues are also discarded as i already configured the etc/apparmor.d/opencpu.d/custom file so as to permit opencpu to have rwx privileges on spark. It seems to be a connectivity issue that i do not know how to face. During method invocation via opencpu api spark logs do not even print something.
For you info my environment configuration is as follows:
java version "1.8.0_65"
R version 3.4.1
RStudio version 1.0.153
spark-2.2.0-bin-hadoop2.7
opencpu 1.5 (compatible with my Ubuntu 14.04.3 LTS)
Thank you very much for you support and time!!!

Query dataframe using spark-sql from remote client

I have a client application from where I need to remotely execute queries on spark using spark-sql. I am able to do it from spark-shell but how can I remotely execute them from my scala based client application.
I have tried the following code :
val conf = new SparkConf().set("spark.shuffle.blockTransferService", "nio").setMaster("spark://master:port").setAppName("Query Fire").set("spark.hadoop.validateOutputSpecs", "true")
.set("spark.local.dir", "/tmp/spark-temp")
.set("spark.driver.memory", "4G").set("spark.executor.memory", "4G")
val spark = SparkContext.getOrCreate(conf)
I tried giving the default port 7077 but it is not open. I have a cloudera based spark installation which it seems is not running spark standalone.
The error I get running the code when I try to give yarn resourcemanager port 8042 is the following:
16/09/16 20:14:36 WARN TransportChannelHandler: Exception in
connection from /192.168.0.171:8042 java.io.IOException: Connection
reset by peer at sun.nio.ch.FileDispatcherImpl.read0(Native Method)
Is there any way to get around this to remotely call spark-sql via jdbc client like we can do for Hive queries??

Categories