Spark Sql query fails - java

Using Sparks 2/java/Cassanda2.2
Trying to run a simple sparks sql query, it errors:
Tried as below, + variations like "'LAX'", and '=' instead of '=='.
Exception in thread "main" org.apache.spark.sql.AnalysisException: cannot resolve '`LAX`' given input columns: [transdate, origin]; line 1 pos 42;
'Project ['origin]
+- 'Filter (origin#1 = 'LAX)
+- SubqueryAlias origins
+- LogicalRDD [transdate#0, origin#1]
JavaRDD<TransByDate> originDateRDD = javaFunctions(sc).cassandraTable("trans", "trans_by_date", CassandraJavaUtil.mapRowTo(TransByDate.class)).select(CassandraJavaUtil.column("origin"), CassandraJavaUtil.column("trans_date").as("transdate"));
long cnt1= originDateRDD.count();
System.out.println("sqlLike originDateRDD.count: "+cnt1); --> 406000
Dataset<Row> originDF = sparks.createDataFrame(originDateRDD, TransByDate.class);
originDF.createOrReplaceTempView("origins");
Dataset<Row> originlike = sparks.sql("SELECT origin FROM origins WHERE origin =="+ "LAX");
I have enabled Hive Support (if that helps)
Thanks

Put the column value inside single quote. Your query should look like below.
Dataset<Row> originlike = spark.sql("SELECT origin FROM origins WHERE origin == "+"'LAX'");
You can refer Querying Cassandra data using Spark SQL in Java for more detail.
Like query should be like below.
Dataset<Row> originlike = spark.sql("SELECT origin FROM origins WHERE origin like 'LA%'");

Hive is not the problem, here is the line that is your problem:
Exception in thread "main" org.apache.spark.sql.AnalysisException: cannot resolve '`LAX`' given input columns: [transdate, origin]; line 1 pos 42;
What this is saying is that among the column names, none are named LAX. The scala DSL asks for === when matching a value that is a key within a column, perahps something similar would be more ideal, something like a origins.filter($"origin === "LAX")

Related

Spark java filter isin method or something else?

I have around 2 billions of rows in my cassandra database which I filter with the isin method based on an experimentlist with 4827 Strings, as shown below. However, I noticed that after the distinct command I have only 4774 unique rows. Any ideas why 53 are missing? Does the isin method has a threshold/limitations? I have double and triple checked the experimentlist, it does have 4827 Strings, and also the other 53 strings do exist in the database as I can query them with cqlsh. Any help much appreciated!
Dataset<Row> df1 = sp.read().format("org.apache.spark.sql.cassandra")
.options(new HashMap<String, String>() {
{
put("keyspace", "mdb");
put("table", "experiment");
}
})
.load().select(col("experimentid")).filter(col("experimentid").isin(experimentlist.toArray()));
List<String> tmplist=df1.distinct().as(Encoders.STRING()).collectAsList();
System.out.println("tmplist "+tmplist.size());
Regarding the actual question about "missing data" - there could be problems when your cluster has missing writes, and repair isn't done regularly. Spark Cassandra Connector (SCC) reads data with consistency level LOCAL_ONE, and may hit nodes without all data. You can try to set consistency level to LOCAL_QUORUM (via --conf spark.cassandra.input.consistency.level=LOCAL_QUORUM), for example, and repeat the experiment, although it's better to make sure that data is repaired.
Another problem that you have is that you're using the .isin function - it's translating into a query SELECT ... FROM table WHERE partition_key IN (list). See the execution plan:
scala> import org.apache.spark.sql.cassandra._
import org.apache.spark.sql.cassandra._
scala> val data = spark.read.cassandraFormat("m1", "test").load()
data: org.apache.spark.sql.DataFrame = [id: int, m: map<int,string>]
scala> data.filter($"id".isin(Seq(1,2,3,4):_*)).explain
== Physical Plan ==
*Scan org.apache.spark.sql.cassandra.CassandraSourceRelation [id#169,m#170] PushedFilters: [*In(id, [1,2,3,4])], ReadSchema: struct<id:int,m:map<int,string>>
This query is very inefficient, and put an additional load to the node that performs query. In the SCC 2.5.0, there are some optimizations around that, but it's better to use so-called "Direct Join" that was also introduced in the SCC 2.5.0, so SCC will perform requests to specific partition keys in parallel - that's more effective and put the less load to the nodes. You can use it as following (the only difference that I have it as "DSE Direct Join", while in OSS SCC it's printed as "Cassandra Direct Join"):
scala> val toJoin = Seq(1,2,3,4).toDF("id")
toJoin: org.apache.spark.sql.DataFrame = [id: int]
scala> val joined = toJoin.join(data, data("id") === toJoin("id"))
joined: org.apache.spark.sql.DataFrame = [id: int, id: int ... 1 more field]
scala> joined.explain
== Physical Plan ==
DSE Direct Join [id = id#189] test.m1 - Reading (id, m) Pushed {}
+- LocalTableScan [id#189]
This direct join optimization needs to be explicitly enabled as described in the documentation.

How to Convert DataSet<Row> to DataSet of JSON messages to write to Kafka?

I use Spark 2.1.1.
I have the following DataSet<Row> ds1;
name | ratio | count // column names
"hello" | 1.56 | 34
(ds1.isStreaming gives true)
and I am trying to generate DataSet<String> ds2. other words when I write to a kafka sink I want to write something like this
{"name": "hello", "ratio": 1.56, "count": 34}
I have tried something like this df2.toJSON().writeStream().foreach(new KafkaSink()).start() but then it gives the following error
Queries with streaming sources must be executed with writeStream.start()
There are to_json and json_tuple however I am not sure how to leverage them here ?
I tried the following using json_tuple() function
Dataset<String> df4 = df3.select(json_tuple(new Column("result"), " name", "ratio", "count")).as(Encoders.STRING());
and I get the following error:
cannot resolve 'result' given input columns: [name, ratio, count];;
tl;dr Use struct function followed by to_json (as toJSON was broken for streaming datasets due to SPARK-17029 that got fixed just 20 days ago).
Quoting the scaladoc of struct:
struct(colName: String, colNames: String*): Column Creates a new struct column that composes multiple input columns.
Given you use Java API you have 4 different variants of struct function, too:
public static Column struct(Column... cols) Creates a new struct column.
With to_json function your case is covered:
public static Column to_json(Column e) Converts a column containing a StructType into a JSON string with the specified schema.
The following is a Scala code (translating it to Java is your home exercise):
val ds1 = Seq(("hello", 1.56, 34)).toDF("name", "ratio", "count")
val recordCol = to_json(struct("name", "ratio", "count")) as "record"
scala> ds1.select(recordCol).show(truncate = false)
+----------------------------------------+
|record |
+----------------------------------------+
|{"name":"hello","ratio":1.56,"count":34}|
+----------------------------------------+
I've also given your solution a try (with Spark 2.3.0-SNAPSHOT built today) and it seems it works perfectly.
val fromKafka = spark.
readStream.
format("kafka").
option("subscribe", "topic1").
option("kafka.bootstrap.servers", "localhost:9092").
load.
select('value cast "string")
fromKafka.
toJSON. // <-- JSON conversion
writeStream.
format("console"). // using console sink
start
format("kafka") was added in SPARK-19719 and is not available in 2.1.0.

Data type mismatch while transforming data in spark dataset

I created a parquet-structure from a csv file using spark:
Dataset<Row> df = park.read().format("com.databricks.spark.csv").option("inferSchema", "true")
.option("header", "true").load("sample.csv");
df.write().parquet("sample.parquet");
I'm reading the parquet-structure and I'm trying to transform the data in a dataset:
Dataset<org.apache.spark.sql.Row> df = spark.read().parquet("sample.parquet");
df.createOrReplaceTempView("tmpview");
Dataset<Row> namesDF = spark.sql("SELECT *, md5(station_id) as hashkey FROM tmpview");
Unfortunately I get a data type mismatch error. Do I have to explicitly assign data types?
17/04/12 09:21:52 INFO SparkSqlParser: Parsing command: SELECT *,
md5(station_id) as hashkey FROM tmpview Exception in thread "main"
org.apache.spark.sql.AnalysisException: cannot resolve
'md5(tmpview.station_id)' due to data type mismatch: argument 1
requires binary type, however, 'tmpview.station_id' is of int type.;
line 1 pos 10; 'Project [station_id#0, bikes_available#1,
docks_available#2, time#3, md5(station_id#0) AS hashkey#16]
+- SubqueryAlias tmpview, tmpview +- Relation[station_id#0,bikes_available#1,docks_available#2,time#3]
parquet
Yes, as per Spark documentation, md5 function works only on binary (text/string) columns so you need to cast station_id into string before applying md5. In Spark SQL, you can chain both md5 and cast together, e.g.:
Dataset<Row> namesDF = spark.sql("SELECT *, md5(cast(station_id as string)) as hashkey FROM tmpview");
Or you can create a new column in dataframe and apply md5 on it, e.g.:
val newDf = df.withColumn("station_id_str", df.col("station_id").cast(StringType))
newDf.createOrReplaceTempView("tmpview");
Dataset<Row> namesDF = spark.sql("SELECT *, md5(station_id_str) as hashkey FROM tmpview");
In case if you are experiencing this error on Databricks (Azure Databricks Notebooks)
which run on 3.x clusters following snippet might work for you.
md5(cast(concat_ws('some string' + 'some string') + """)as BINARY))

How to remove duplicate columns after a JOIN in Pig?

Let's say I JOIN two relations like:
-- part looks like:
-- 1,5.3
-- 2,4.9
-- 3,4.9
-- original looks like:
-- 1,Anju,3.6,IT,A,1.6,0.3
-- 2,Remya,3.3,EEE,B,1.6,0.3
-- 3,akhila,3.3,IT,C,1.3,0.3
jnd = JOIN part BY $0, original BY $0;
The output will be:
1,5.3,1,Anju,3.6,IT,A,1.6,0.3
2,4.9,2,Remya,3.3,EEE,B,1.6,0.3
3,4.9,3,akhila,3.3,IT,C,1.3,0.3
Notice that $0 is shown twice in each tuple. EG:
1,5.3,1,Anju,3.6,IT,A,1.6,0.3
^ ^
|-----|
I can remove the duplicate key manually by doing:
jnd = foreach jnd generate $0,$1,$3,$4 ..;
Is there a way to remove this dynamically? Like remove(the duplicate key joiner).
Have faced the same kind of issue while working on Data Set Joining and other data processing techniques where in output the column names get repeated.
So was working on UDF which will remove the duplicates column by using schema name of that field and retaining the first unique column occurrence data.
Pre-Requisite:
Name of all the fields should be present
You need to download this UDF file and make it jar so as to use it.
UDF file location from GitHub :
GitHub UDF Java File Location
We will take the above question as example.
--Data Set A contains this data
-- 1,5.3
-- 2,4.9
-- 3,4.9
--Data Set B contains this data
-- 1,Anju,3.6,IT,A,1.6,0.3
-- 2,Remya,3.3,EEE,B,1.6,0.3
-- 3,Akhila,3.3,IT,C,1.3,0.3
PIG Script:
REGISTER /home/user/
DSA = LOAD '/home/user/DSALOC' AS (ROLLNO:int,CGPA:float);
DSB = LOAD '/home/user/DSBLOC' AS (ROLLNO:int,NAME:chararray,SUB1:float,BRANCH:chararray,GRADE:chararray,SUB2:float);
JOINOP = JOIN DSA BY ROLLNO,DSB BY ROLLNO;
We will get column name after joining as
DSA::ROLLNO:int,DSA::CGPA:float,DSB::ROLLNO:int,DSB::NAME:chararray,DSB::SUB1:float,DSB::BRANCH:chararray,DSB::GRADE:chararray,DSB::SUB2:float
For making it to
DSA::ROLLNO:int,DSA::CGPA:float,DSB::NAME:chararray,DSB::SUB1:float,DSB::BRANCH:chararray,DSB::GRADE:chararray,DSB::SUB2:float
DSB::ROLLNO:int is removed.
We need to use the UDF as
JOINOP_NODUPLICATES = FOREACH JOINOP GENERATE FLATTEN(org.imagine.REMOVEDUPLICATECOLUMNS(*));
Where org.imagine.REMOVEDUPLICATECOLUMNS is the UDF.
This UDF removes duplicate columns by using Name in schema.So DSA::ROLLNO:int is retained and DSB::ROLLNO:int is removed from the dataset.

How to write selected columns to Kafka topic?

I am using spark-sql-2.4.1v with java 1.8.
and kafka versions spark-sql-kafka-0-10_2.11_2.4.3 and kafka-clients_0.10.0.0
StreamingQuery queryComapanyRecords =
comapanyRecords
.writeStream()
.format("kafka")
.option("kafka.bootstrap.servers",KAFKA_BROKER)
.option("topic", "in_topic")
.option("auto.create.topics.enable", "false")
.option("key.serializer","org.apache.kafka.common.serialization.StringDeserializer")
.option("value.serializer", "com.spgmi.ca.prescore.serde.MessageRecordSerDe")
.option("checkpointLocation", "/app/chkpnt/" )
.outputMode("append")
.start();
queryLinkingMessageRecords.awaitTermination();
Giving error :
Caused by: org.apache.spark.sql.AnalysisException: Required attribute 'value' not found;
at org.apache.spark.sql.kafka010.KafkaWriter$$anonfun$6.apply(KafkaWriter.scala:71)
at org.apache.spark.sql.kafka010.KafkaWriter$$anonfun$6.apply(KafkaWriter.scala:71)
at scala.Option.getOrElse(Option.scala:121)
I tried to fix as below, but unable to send the value i.e. which is a java bean in my case.
StreamingQuery queryComapanyRecords =
comapanyRecords.selectExpr("CAST(company_id AS STRING) AS key", "to_json(struct(\"company_id\",\"fiscal_year\",\"fiscal_quarter\")) AS value")
.writeStream()
.format("kafka")
.option("kafka.bootstrap.servers",KAFKA_BROKER)
.option("topic", "in_topic")
.start();
So is there anyway in java how to handle/send this value( i.e. Java
bean as record) ??.
Kafka data source requires a specific schema for reading (loading) and writing (saving) datasets.
Quoting the official documentation (highlighting the most important field / column):
Each row in the source has the following schema:
...
value binary
...
In other words, you have Kafka records in the value column when reading from a Kafka topic and you have to make your data to save to a Kafka topic available in the value column as well.
In other words, whatever is or is going to be in Kafka is in the value column. The value column is where you "store" business records (the data).
On to your question:
How to write selected columns to Kafka topic?
You should "pack" the selected columns together so they can all together be part of the value column. to_json standard function is a good fit so the selected columns are going to be a JSON message.
Example
Let me give you an example.
Don't forget to start a Spark application or spark-shell with the Kafka data source. Mind the versions of Scala (2.11 or 2.12) and Spark (e.g. 2.4.4).
spark-shell --packages org.apache.spark:spark-sql-kafka-0-10_2.11:2.4.4
Let's start by creating a sample dataset. Any multiple-field dataset would work.
val ns = Seq((0, "zero")).toDF("id", "name")
scala> ns.show
+---+----+
| id|name|
+---+----+
| 0|zero|
+---+----+
If we tried to write the dataset to a Kafka topic, it would error out due to value column missing. That's what you faced initially.
scala> ns.write.format("kafka").option("topic", "in_topic").save
org.apache.spark.sql.AnalysisException: Required attribute 'value' not found;
at org.apache.spark.sql.kafka010.KafkaWriter$.$anonfun$validateQuery$6(KafkaWriter.scala:71)
at scala.Option.getOrElse(Option.scala:138)
...
You have to come up with a way to "pack" multiple fields (columns) together and make it available as value column. struct and to_json standard functions will do it.
val vs = ns.withColumn("value", to_json(struct("id", "name")))
scala> vs.show(truncate = false)
+---+----+----------------------+
|id |name|value |
+---+----+----------------------+
|0 |zero|{"id":0,"name":"zero"}|
+---+----+----------------------+
Saving to a Kafka topic should now be a breeze.
vs.write.format("kafka").option("topic", "in_topic").save

Categories