Unable to fetch data from Cassandra using spark (java) - java

I am new to Cassandra and Spark and trying to fetch data from DB using spark.
I am using Java for this purpose.
Problem is that there are no exceptions thrown or error occurred but still I am not able to get the data. Find my code below -
SparkConf sparkConf = new SparkConf();
sparkConf.setAppName("Spark-Cassandra Integration");
sparkConf.setMaster("local[4]");
sparkConf.set("spark.cassandra.connection.host", "stagingHost22");
sparkConf.set("spark.cassandra.connection.port", "9042");
sparkConf.set("spark.cassandra.connection.timeout_ms", "5000");
sparkConf.set("spark.cassandra.read.timeout_ms", "200000");
JavaSparkContext javaSparkContext = new JavaSparkContext(sparkConf);
String keySpaceName = "testKeySpace";
String tableName = "testTable";
CassandraJavaRDD<CassandraRow> cassandraRDD = CassandraJavaUtil.javaFunctions(javaSparkContext).cassandraTable(keySpaceName, tableName);
final ArrayList dataList = new ArrayList();
JavaRDD<String> userRDD = cassandraRDD.map(new Function<CassandraRow, String>() {
private static final long serialVersionUID = -165799649937652815L;
public String call(CassandraRow row) throws Exception {
System.out.println("Inside RDD call");
dataList.add(row);
return "test";
}
});
System.out.println( "data Size -" + dataList.size());
Cassandra and spark maven dependencies are -
<dependency>
<groupId>com.datastax.cassandra</groupId>
<artifactId>cassandra-driver-core</artifactId>
<version>3.0.0</version>
</dependency>
<dependency>
<groupId>com.datastax.cassandra</groupId>
<artifactId>cassandra-driver-mapping</artifactId>
<version>3.0.0</version>
</dependency>
<dependency>
<groupId>com.datastax.cassandra</groupId>
<artifactId>cassandra-driver-extras</artifactId>
<version>3.0.0</version>
</dependency>
<dependency>
<groupId>com.sparkjava</groupId>
<artifactId>spark-core</artifactId>
<version>2.5.4</version>
</dependency>
<dependency>
<groupId>com.datastax.spark</groupId>
<artifactId>spark-cassandra-connector_2.10</artifactId>
<version>2.0.0-M3</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.10</artifactId>
<version>1.4.0</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-sql_2.10</artifactId>
<version>1.3.0</version>
</dependency>
This is sure that stagingHost22 host has the cassandra data with keyspace - testKeySpace and table name - testTable. Find below query output -
cqlsh:testKeySpace> select count(*) from testTable;
count
34
(1 rows)
Can Anybody please suggest what am I missing here?
Thanks in advance.
Warm regards,
Vibhav

Your current code does not perform any Spark action. Therefore no data is loaded.
See the Spark documentation to understand the difference between transformations and actions in Spark:
http://spark.apache.org/docs/latest/programming-guide.html#rdd-operations
Furthermore adding CassandraRows to a ArrayList isn't something that is usally necessary when using the Cassandra connector. I would suggest to implement a simple select first (following the Spark-Cassandra-Connector documentation). If this is working you can extend this code as needed.
Check the following links on samples how to load data using the connector:
https://github.com/datastax/spark-cassandra-connector/blob/master/doc/2_loading.md
https://github.com/datastax/spark-cassandra-connector/blob/master/doc/14_data_frames.md

Related

Spark job works when running locally but not working when on standalone mode

I have a simple Spark code that works fine when running locally however when I Try to run it using Spark Standalone Cluster with Docker it strangely fails.
I can confirm that the integration with the master and the worker is working.
In the code below I show where the error raises.
JavaRDD<Row> rddwithoutMap = dataFrame.javaRDD();
JavaRDD<Row> rddwithMap = dataFrame.javaRDD()
.map((Function<Row, Row>) row -> row);
long count = rddwithoutMap.count(); //here is fine
long countBeforeMap = rddwithMap.count(); // here I get the error
After the Map, I can't call any Spark's action.
the error Caused by: java.lang.ClassNotFoundException: com.apssouza.lambda.MyApp$1
Obs: I am using Lambda in the Map, to make the code more readable but I am also not able to use lambda when using the standalone version.
Caused by: java.lang.ClassCastException: cannot assign instance of java.lang.invoke.SerializedLambda to field org.apache.spark.api.java.JavaPairRDD$$anonfun$toScalaFunction$1.fun$1 of type org.apache.spark.api.java.function.Function in instance of org.apache.spark.api.java.JavaPairRDD$$anonfun$toScalaFunction$1
Docker image: bde2020/spark-master:2.3.2-hadoop2.7
Local Spark version: 2.4.0
Spark dependency version: spark-core_2.112.3.2
public class MyApp {
public static void main(String[] args) throws IOException, URISyntaxException {
// String sparkMasterUrl = "local[*]";
// String csvFile = "/Users/apssouza/Projetos/java/lambda-arch/data/spark/input/localhost.csv";
String sparkMasterUrl = "spark://spark-master:7077";
String csvFile = "hdfs://namenode:8020/user/lambda/localhost.csv";
SparkConf sparkConf = new SparkConf()
.setAppName("Lambda-demo")
.setMaster(sparkMasterUrl);
// .setJars(/path/to/my/jar); I even tried to set the jar
JavaSparkContext sparkContext = new JavaSparkContext(sparkConf);
SQLContext sqlContext = new SQLContext(sparkContext);
Dataset<Row> dataFrame = sqlContext.read()
.format("csv")
.option("header", "true")
.load(csvFile);
JavaRDD<Row> rddwithoutMap = dataFrame.javaRDD();
JavaRDD<Row> rddwithMap = dataFrame.javaRDD()
.map((Function<Row, Row>) row -> row);
long count = rddwithoutMap.count();
long countBeforeMap = rddwithMap.count();
}
}
<?xml version="1.0" encoding="UTF-8"?>
<project>
<modelVersion>4.0.0</modelVersion>
<groupId>com.apssouza.lambda</groupId>
<artifactId>lambda-arch</artifactId>
<version>1.0-SNAPSHOT</version>
<name>lambda-arch</name>
<properties>
<project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
<maven.compiler.source>1.8</maven.compiler.source>
<maven.compiler.target>1.8</maven.compiler.target>
</properties>
<dependencies>
<dependency>
<groupId>com.fasterxml.jackson.core</groupId>
<artifactId>jackson-databind</artifactId>
<version>2.9.7</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.11</artifactId>
<version>2.3.2</version>
</dependency>
<!-- https://mvnrepository.com/artifact/org.apache.spark/spark-sql -->
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-sql_2.11</artifactId>
<version>2.3.2</version>
</dependency>
<dependency>
<groupId>org.apache.commons</groupId>
<artifactId>commons-lang3</artifactId>
<version>3.6</version>
</dependency>
<dependency>
<groupId>com.fasterxml.jackson.module</groupId>
<artifactId>jackson-module-scala_2.11</artifactId>
<version>2.9.7</version>
</dependency>
</dependencies>
</project>
Obs: If uncomment the first two lines everything works perfectly.
The problem was because I was not packaging my program before running it and I was getting an outdated version of my app in the Spark cluster. This is weird because I am running it through my IDE(IntelliJ) and it should be packaging the jar before running it. Anyway, mvn package before hit the run button solved the issue.

Java ElasticSearach Bool Geo Query

I'm trying to issue an ElasticSearch query using the java api from my application but for some reason i keep getting the following error:
java.lang.NoClassDefFoundError:
org/apache/lucene/search/spans/SpanBoostQuery at
org.elasticsearch.index.query.QueryBuilders.boolQuery(QueryBuilders.java:301)
Below are the current dependencies I have in my pom.xml:
<dependency>
<groupId>org.elasticsearch.client</groupId>
<artifactId>transport</artifactId>
<version>5.4.2</version>
</dependency>
<dependency>
<groupId>org.locationtech.spatial4j</groupId>
<artifactId>spatial4j</artifactId>
<version>0.6</version>
</dependency>
<dependency>
<groupId>com.vividsolutions</groupId>
<artifactId>jts</artifactId>
<version>1.13</version>
<exclusions>
<exclusion>
<groupId>xerces</groupId>
<artifactId>xercesImpl</artifactId>
</exclusion>
</exclusions>
</dependency>
The code:
double lon = -115.14029016987968;
double lat = 36.17206351151878;
QueryBuilder fullq = boolQuery()
.must(matchAllQuery())
.filter(geoShapeQuery(
"geometry",
ShapeBuilders.newCircleBuilder().center(lon, lat).radius(10, DistanceUnit.METERS)).relation(ShapeRelation.INTERSECTS));
TransportClient client = new PreBuiltTransportClient(Settings.EMPTY)
.addTransportAddress(new InetSocketTransportAddress(InetAddress.getByName("localhost"), 9300));
SearchRequestBuilder finalQuery = client.prepareSearch("speedlimit").setTypes("speedlimit")
.setQuery(fullq);
SearchResponse searchResponse = finalQuery.execute().actionGet();
SearchHits searchHits = searchResponse.getHits();
if (searchHits.getTotalHits() > 0) {
String strSpeed = JsonPath.read(searchResponse.toString(), "$.hits.hits[0]._source.properties.TITLE");
int speed = Integer.parseInt(strSpeed.substring(0, 2));
}
else if (searchHits.getTotalHits() <= 0){
System.out.println("nothing");
}
This is the query I'm trying to run, i've followed the ES docs but can't get any further. Has anyone tried to run a query like this, or am I going the incorrect route? I'm tempted to just abandon the Java API and go back to making HTTP calls from Java, but i thought i would try their Java API. Any tips appreciated, thanks.
This error for me was resolved after i removed the older dependency related to "org.apache.lucene", we need to make sure all the org.apache.lucene dependecies are latest which are at par with the version which contains SpanBoostQuery:
I commented below and it worked:
<!--<dependency>-->
<!--<groupId>org.apache.lucene</groupId>-->
<!--<artifactId>lucene-spellchecker</artifactId>-->
<!--<version>3.6.2</version>-->
<!--</dependency>-->

Kafka stream not working in spark job

I wrote code to get data from "topicTest1" Kafka Queue. I am not able to print data from the consumer. Error occurred and mentioned below,
Below is my code to consume data,
public static void main(String[] args) throws Exception {
// StreamingExamples.setStreamingLogLevels();
SparkConf sparkConf = new SparkConf().setAppName("JavaKafkaWordCount").setMaster("local[*]");
;
// Create the context with 2 seconds batch size
JavaStreamingContext jssc = new JavaStreamingContext(sparkConf, new Duration(100));
int numThreads = Integer.parseInt("3");
Map<String, Integer> topicMap = new HashMap<>();
String[] topics = "topicTest1".split(",");
for (String topic : topics) {
topicMap.put(topic, numThreads);
}
JavaPairReceiverInputDStream<String, String> messages = KafkaUtils.createStream(jssc, "9.98.171.226:9092", "1",
topicMap);
messages.print();
jssc.start();
jssc.awaitTermination();
}
Using following depedencies
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.10</artifactId>
<version>1.6.1</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-streaming-kafka_2.10</artifactId>
<version>1.6.1</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-streaming_2.11</artifactId>
<version>1.6.1</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-streaming-twitter_2.11</artifactId>
<version>1.6.1</version>
</dependency>
Below error I got
Exception in thread "dispatcher-event-loop-0" java.lang.NoSuchMethodError: scala/Predef$.$conforms()Lscala/Predef$$less$colon$less; (loaded from file:/C:/Users/Administrator/.m2/repository/org/scala-lang/scala-library/2.10.5/scala-library-2.10.5.jar by sun.misc.Launcher$AppClassLoader#4b69b358) called from class org.apache.spark.streaming.scheduler.ReceiverSchedulingPolicy (loaded from file:/C:/Users/Administrator/.m2/repository/org/apache/spark/spark-streaming_2.11/1.6.2/spark-streaming_2.11-1.6.2.jar by sun.misc.Launcher$AppClassLoader#4b69b358).
at org.apache.spark.streaming.scheduler.ReceiverSchedulingPolicy.scheduleReceivers(ReceiverSchedulingPolicy.scala:138)
at org.apache.spark.streaming.scheduler.ReceiverTracker$ReceiverTrackerEndpoint$$anonfun$receive$1.applyOrElse(ReceiverTracker.scala:450)
at org.apache.spark.rpc.netty.Inbox$$anonfun$process$1.apply$mcV$sp(Inbox.scala:116)16/11/14 13:38:00 INFO ForEachDStream: metadataCleanupDelay = -1
at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:204)
at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:100)
at org.apache.spark.rpc.netty.Dispatcher$MessageLoop.run(Dispatcher.scala:215)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1153)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
at java.lang.Thread.run(Thread.java:785)
Another Error
Exception in thread "JobGenerator" java.lang.NoSuchMethodError: scala/Predef$.$conforms()Lscala/Predef$$less$colon$less; (loaded from file:/C:/Users/Administrator/.m2/repository/org/scala-lang/scala-library/2.10.5/scala-library-2.10.5.jar by sun.misc.Launcher$AppClassLoader#4b69b358) called from class org.apache.spark.streaming.scheduler.ReceivedBlockTracker (loaded from file:/C:/Users/Administrator/.m2/repository/org/apache/spark/spark-streaming_2.11/1.6.2/spark-streaming_2.11-1.6.2.jar by sun.misc.Launcher$AppClassLoader#4b69b358).
at org.apache.spark.streaming.scheduler.ReceivedBlockTracker.allocateBlocksToBatch(ReceivedBlockTracker.scala:114)
at org.apache.spark.streaming.scheduler.ReceiverTracker.allocateBlocksToBatch(ReceiverTracker.scala:203)
at org.apache.spark.streaming.scheduler.JobGenerator$$anonfun$3.apply(JobGenerator.scala:247)
at org.apache.spark.streaming.scheduler.JobGenerator$$anonfun$3.apply(JobGenerator.scala:246)
at scala.util.Try$.apply(Try.scala:161)
at org.apache.spark.streaming.scheduler.JobGenerator.generateJobs(JobGenerator.scala:246)
at org.apache.spark.streaming.scheduler.JobGenerator.org$apache$spark$streaming$scheduler$JobGenerator$$processEvent(JobGenerator.scala:181)
at org.apache.spark.streaming.scheduler.JobGenerator$$anon$1.onReceive(JobGenerator.scala:87)
at org.apache.spark.streaming.scheduler.JobGenerator$$anon$1.onReceive(JobGenerator.scala:86)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
Make sure that you use the correct versions. Lets say you use following maven dependecy:
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-streaming-kafka_2.10</artifactId>
<version>1.6.1</version>
</dependency>
So the artifact equals: spark-streaming-kafka_2.10
Now check if you use the correct Kafka version:
cd /KAFKA_HOME/libs
Now find kafka_YOUR-VERSION-sources.jar.
In case you have kafka_2.10-0xxxx-sources.jar you are fine! :)
If you use different versions, just change maven dependecies OR download the correct kafka version.
After that check your Spark version. Make sure you use the correct versions
groupId: org.apache.spark
artifactId: spark-core_2.xx
version: xxx

LongComparator does not work in Google Cloud Bigtable with HBase API

I'm trying to build some filters to filter data from Bigtable. I'm using bigtable-hbase drivers and HBase drivers. Actually here are my dependencies from pom.xml:
<dependency>
<groupId>org.apache.hbase</groupId>
<artifactId>hbase-common</artifactId>
<version>${hbase.version}</version>
</dependency>
<dependency>
<groupId>org.apache.hbase</groupId>
<artifactId>hbase-protocol</artifactId>
<version>${hbase.version}</version>
</dependency>
<dependency>
<groupId>org.apache.hbase</groupId>
<artifactId>hbase-client</artifactId>
<version>${hbase.version}</version>
</dependency>
<dependency>
<groupId>org.apache.hbase</groupId>
<artifactId>hbase-server</artifactId>
<version>${hbase.version}</version>
</dependency>
<dependency>
<groupId>com.google.cloud.bigtable</groupId>
<artifactId>bigtable-hbase</artifactId>
<version>${bigtable.version}</version>
</dependency>
I'm filtering data like this:
Filter filterName = new SingleColumnValueFilter(Bytes.toBytes("FName"), Bytes.toBytes("FName"),
CompareFilter.CompareOp.EQUAL, new RegexStringComparator("JOHN"));
FilterList filters = new FilterList();
filters.addFilter(filterName);
Scan scan1 = new Scan();
scan1.setFilter(filters);
This works ok. But then I add following to previous code:
Filter filterSalary = new SingleColumnValueFilter(Bytes.toBytes("Salary"), Bytes.toBytes("Salary"),
CompareFilter.CompareOp.GREATER_OR_EQUAL, new LongComparator(100000));
filters.addFilter(filterSalary);
and it give me this exception:
Exception in thread "main" com.google.cloud.bigtable.hbase.adapters.filters.UnsupportedFilterException: Unsupported filters encountered: FilterSupportStatus{isSupported=false, reason='ValueFilter must have either a BinaryComparator with any compareOp or a RegexStringComparator with an EQUAL compareOp. Found (LongComparator, GREATER_OR_EQUAL)'}
at com.google.cloud.bigtable.hbase.adapters.filters.FilterAdapter.throwIfUnsupportedFilter(FilterAdapter.java:144)
at com.google.cloud.bigtable.hbase.adapters.ScanAdapter.throwIfUnsupportedScan(ScanAdapter.java:55)
at com.google.cloud.bigtable.hbase.adapters.ScanAdapter.adapt(ScanAdapter.java:91)
at com.google.cloud.bigtable.hbase.adapters.ScanAdapter.adapt(ScanAdapter.java:43)
at com.google.cloud.bigtable.hbase.BigtableTable.getScanner(BigtableTable.java:247)
So my question is how to filter long data type? Is it hbase issue or bigtable specific?
I found this How do you use a custom comparator with SingleColumnValueFilter on HBase? but I can't load my jars to server so it is not applicable for my case.
SingleColumnValueFilter supports the following comparators:
BinaryComparator
BinaryPrefixComparator
RegexStringComparator.
See this link for an up-to-date list:
https://cloud.google.com/bigtable/docs/hbase-differences

Unable to query using Neo4j Rest API - Error reading as JSON ''

Not sure what I am doing wrong, but with my set up even a basic cypher query using the neo4j Rest API is not working. I get a java.lang.RuntimeException: Error reading as JSON ''
My set up
<dependency>
<groupId>org.neo4j</groupId>
<artifactId>neo4j-rest-graphdb</artifactId>
<version>2.0.0-M06</version>
</dependency>
<dependency>
<groupId>org.neo4j</groupId>
<artifactId>neo4j</artifactId>
<version>2.0.0</version>
</dependency>
<dependency>
<groupId>org.neo4j.app</groupId>
<artifactId>neo4j-server</artifactId>
<version>2.0.0</version>
</dependency>
private GraphDatabaseService graphDb;
private RestCypherQueryEngine queryEngine;
System.setProperty("org.neo4j.rest.batch_transaction", "true");
graphDb = new RestGraphDatabase( "http://localhost:7474/db/data/" );
queryEngine = new RestCypherQueryEngine(((RestGraphDatabase)graphDb).getRestAPI());
StringBuilder query = new StringBuilder();
query.append("match (u { id:'").append(id).append("' }) return u");
QueryResult<Map<String,Object>> result = engine.query(query.toString(), null);
//the above statement throws the runtime exception with message "Error reading as JSON '' "
I just released 2.0.0 of java-rest-binding, so please give that a try.
M06 shouldn't really work with Neo4j 2.0.0 final.

Categories