scala.MatchError: in Dataframes - java

I have one Spark (version 1.3.1) application. In which, I am trying to convert one Java bean RDD JavaRDD<Message> into Dataframe, it has many fields with different-different Data types (Integer, String, List, Map, Double).
But when, I am executing my Code.
messages.foreachRDD(new Function2<JavaRDD<Message>,Time,Void>(){
#Override
public Void call(JavaRDD<Message> arg0, Time arg1) throws Exception {
SQLContext sqlContext = SparkConnection.getSqlContext();
DataFrame df = sqlContext.createDataFrame(arg0, Message.class);
df.registerTempTable("messages");
I got this error
/06/12 17:27:40 INFO JobScheduler: Starting job streaming job 1434110260000 ms.0 from job set of time 1434110260000 ms
15/06/12 17:27:40 ERROR JobScheduler: Error running job streaming job 1434110260000 ms.1
scala.MatchError: interface java.util.List (of class java.lang.Class)
at org.apache.spark.sql.SQLContext$$anonfun$getSchema$1.apply(SQLContext.scala:1193)
at org.apache.spark.sql.SQLContext$$anonfun$getSchema$1.apply(SQLContext.scala:1192)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:108)
at org.apache.spark.sql.SQLContext.getSchema(SQLContext.scala:1192)
at org.apache.spark.sql.SQLContext.createDataFrame(SQLContext.scala:437)
at org.apache.spark.sql.SQLContext.createDataFrame(SQLContext.scala:465)

If Message has many different fields like List and the error message points to a List match error than that is the is the issue. Also if you look at the source code you can see that List is not in the match.
But beside digging around in the source code this is also very clearly stated in the documentation here under the Java tab:
Currently, Spark SQL does not support JavaBeans that contain nested or contain complex types such as Lists or Arrays.
You may want to switch to Scala as it seems to be supported there:
Case classes can also be nested or contain complex types such as Sequences or Arrays. This RDD can be implicitly converted to a DataFrame and then be registered as a table.
So the solution is either to use Scala or remove the List from you JavaBean.
As a last resort you can take a look at SQLUserDefinedType to define how that List should be persisted, maybe it's possible to hack it together.

I resolved this problem by updating my Spark version from 1.3.1 to 1.4.0. Now, It works file.

Related

Substitute ints into Dataflow via Cloudbuild yaml

I've got a streaming Dataflow pipeline, written in Java with BEAM 2.35. It commits data to BigQuery via StorageWriteApi. Initially the code looks like
BigQueryIO.writeTableRows()
.withTimePartitioning(/* some column */)
.withClustering(/* another column */)
.withMethod(BigQueryIO.Write.Method.STORAGE_WRITE_API)
.withTriggeringFrequency(Duration.standardSeconds(30))
.withNumStorageWriteApiStreams(20) // want to make this dynamic
This code runs in different environment eg Dev & Prod. When I deploy in Dev I want 2 StorageWriteApiStreams, in Prod I want 20, and I'm trying to pass/resolve these values at the moment I deploy with a Cloudbuild.
The cloudbuild-dev.yaml looks like
steps:
- lots-of-steps
args:
--numStorageWriteApiStreams=${_NUM_STORAGEWRITEAPI_STREAMS}
substitutions:
_PROJECT: dev-project
_NUM_STORAGEWRITEAPI_STREAMS: '2'
I expose the substitution in the job code with an interface
ValueProvider<String> getNumStorageWriteApiStreams();
void setNumStorageWriteApiStreams(ValueProvider<String> numStorageWriteApiStreams);
I then refactor the writeTableRows() call to invoke getNumStorageWriteApiStreams()
BigQueryIO.writeTableRows()
.withTimePartitioning(/* some column */)
.withClustering(/* another column */)
.withMethod(BigQueryIO.Write.Method.STORAGE_WRITE_API)
.withTriggeringFrequency(Duration.standardSeconds(30))
.withNumStorageWriteApiStreams(Integer.parseInt(String.valueOf(options.getNumStorageWriteApiStreams())))
Now it's dynamic but I get a build failure on account of java.lang.IllegalArgumentException: methods with same signature getNumStorageWriteApiStreams() but incompatible return types: [class java.lang.Integer, interface org.apache.beam.sdk.options.ValueProvider]
My understanding was that Integer.parseInt returns an int, which I want so I can pass it to withNumStorageWriteApiStreams() which requires an int.
I'd appreciate any help I can get here thanks
Turns out BigQueryOptions.java already has a method getNumStorageWriteApiStreams() that returns an Integer. I was unknowingly trying to rewrite it with a different return, oops.
https://github.com/apache/beam/blob/master/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/BigQueryOptions.java#L95-L98

Kafka streams creating a simple materialized view

I have events coming in to Kafka with a bunch of non-unique String fields and an event timestamp. I want to create a materialized view of these events so that I can query them. For example:
Display all of the events
Display all of the events where field1 = some string
Display all of the events that match multiple fields
Display the events between 2 dates
All of the examples that I have seen have an aggregation, a join or some other transformative operation on the stream. I cannot find a single simple example of creating a view on a set of events. I don't want to perform any operations, I just want to be able to query the original events coming into the stream.
I am using Spring Kafka so an example with Spring Kafka would be ideal.
I am able to get messages into Kafka and to consume them. However, I have not been able to create a materialized view.
I have the following code which filters the events (not really what I wanted, I want all events, but I just wanted to see if I could get a materialized view):
#StreamListener
public void process(#Input("input") KTable<String,MyMessage> myMessages) {
keyValueStore = interactiveQueryService.getQueryableStore(ALL_MESSAGES,QueryableStoreTypes.keyValueStore());
myMessages.filter((key,value) -> (value.getKey() != null));
Materialized.<String,MyMessage,KeyValueStore<Bytes,byte[]>> as(ALL_MESSAGES)
.withKeySerde(Serdes.String())
.withValueSerde(new MyMessageSerde());
This is throwing an exception:
java.lang.ClassCastException: [B cannot be cast to MyMessage
at org.apache.kafka.streams.kstream.internals.KTableFilter.computeValue(KTableFilter.java:57)
at org.apache.kafka.streams.kstream.internals.KTableFilter.access$300(KTableFilter.java:25)
at org.apache.kafka.streams.kstream.internals.KTableFilter$KTableFilterProcessor.process(KTableFilter.java:79)
at org.apache.kafka.streams.kstream.internals.KTableFilter$KTableFilterProcessor.process(KTableFilter.java:63)
at org.apache.kafka.streams.processor.internals.ProcessorNode$1.run(ProcessorNode.java:50)
at org.apache.kafka.streams.processor.internals.ProcessorNode.runAndMeasureLatency(ProcessorNode.java:244)
at org.apache.kafka.streams.processor.internals.ProcessorNode.process(ProcessorNode.java:133)
at org.apache.kafka.streams.processor.internals.ProcessorContextImpl.forward(ProcessorContextImpl.java:143)
at org.apache.kafka.streams.processor.internals.ProcessorContextImpl.forward(ProcessorContextImpl.java:126)
at org.apache.kafka.streams.processor.internals.ProcessorContextImpl.forward(ProcessorContextImpl.java:90)
at org.apache.kafka.streams.kstream.internals.ForwardingCacheFlushListener.apply(ForwardingCacheFlushListener.java:42)
at org.apache.kafka.streams.state.internals.CachingKeyValueStore.putAndMaybeForward(CachingKeyValueStore.java:101)
at org.apache.kafka.streams.state.internals.CachingKeyValueStore.access$000(CachingKeyValueStore.java:38)
at org.apache.kafka.streams.state.internals.CachingKeyValueStore$1.apply(CachingKeyValueStore.java:83)
at org.apache.kafka.streams.state.internals.NamedCache.flush(NamedCache.java:141)
at org.apache.kafka.streams.state.internals.NamedCache.flush(NamedCache.java:99)
at org.apache.kafka.streams.state.internals.ThreadCache.flush(ThreadCache.java:125)
at org.apache.kafka.streams.state.internals.CachingKeyValueStore.flush(CachingKeyValueStore.java:123)
at org.apache.kafka.streams.state.internals.InnerMeteredKeyValueStore.flush(InnerMeteredKeyValueStore.java:284)
at org.apache.kafka.streams.state.internals.MeteredKeyValueBytesStore.flush(MeteredKeyValueBytesStore.java:149)
at org.apache.kafka.streams.processor.internals.ProcessorStateManager.flush(ProcessorStateManager.java:239)
... 21 more
I don't understand why, because I set the valueSerde of the store to MyMessageSerde which knows how to serialize/deserialize MyMessage to a byte array.
Update
I changed the code to the following:
myMessages.filter((key,value) -> (value.getKey() != null));
and added the following to my application.yml
spring.cloud.stream.kafka.streams.bindings.input:
consumer:
materializedAs: all-messages
key-deserializer: org.apache.kafka.common.serialization.StringDeserializer
value-deserializer: MyMessageDeserializer `
Now I get the following stack trace:
Exception in thread "raven-a43f181b-ccb6-4d9b-a8fd-9fe96542c210-StreamThread-1" org.apache.kafka.streams.errors.ProcessorStateException: task [0_3] Failed to flush state store all-messages
at org.apache.kafka.streams.processor.internals.ProcessorStateManager.flush(ProcessorStateManager.java:242)
at org.apache.kafka.streams.processor.internals.AbstractTask.flushState(AbstractTask.java:202)
at org.apache.kafka.streams.processor.internals.StreamTask.flushState(StreamTask.java:420)
at org.apache.kafka.streams.processor.internals.StreamTask.commit(StreamTask.java:394)
at org.apache.kafka.streams.processor.internals.StreamTask.commit(StreamTask.java:382)
at org.apache.kafka.streams.processor.internals.AssignedTasks$1.apply(AssignedTasks.java:67)
at org.apache.kafka.streams.processor.internals.AssignedTasks.applyToRunningTasks(AssignedTasks.java:362)
at org.apache.kafka.streams.processor.internals.AssignedTasks.commit(AssignedTasks.java:352)
at org.apache.kafka.streams.processor.internals.TaskManager.commitAll(TaskManager.java:401)
at org.apache.kafka.streams.processor.internals.StreamThread.maybeCommit(StreamThread.java:1042)
at org.apache.kafka.streams.processor.internals.StreamThread.runOnce(StreamThread.java:845)
at org.apache.kafka.streams.processor.internals.StreamThread.runLoop(StreamThread.java:767)
at org.apache.kafka.streams.processor.internals.StreamThread.run(StreamThread.java:736)
Caused by: java.lang.ClassCastException: [B cannot be cast to MyMessage
at org.apache.kafka.streams.kstream.internals.KTableFilter.computeValue(KTableFilter.java:57)
at org.apache.kafka.streams.kstream.internals.KTableFilter.access$300(KTableFilter.java:25)
at org.apache.kafka.streams.kstream.internals.KTableFilter$KTableFilterProcessor.process(KTableFilter.java:79)
at org.apache.kafka.streams.kstream.internals.KTableFilter$KTableFilterProcessor.process(KTableFilter.java:63)
at org.apache.kafka.streams.processor.internals.ProcessorNode$1.run(ProcessorNode.java:50)
at org.apache.kafka.streams.processor.internals.ProcessorNode.runAndMeasureLatency(ProcessorNode.java:244)
at org.apache.kafka.streams.processor.internals.ProcessorNode.process(ProcessorNode.java:133)
at org.apache.kafka.streams.processor.internals.ProcessorContextImpl.forward(ProcessorContextImpl.java:143)
at org.apache.kafka.streams.processor.internals.ProcessorContextImpl.forward(ProcessorContextImpl.java:126)
at org.apache.kafka.streams.processor.internals.ProcessorContextImpl.forward(ProcessorContextImpl.java:90)
at org.apache.kafka.streams.kstream.internals.ForwardingCacheFlushListener.apply(ForwardingCacheFlushListener.java:42)
at org.apache.kafka.streams.state.internals.CachingKeyValueStore.putAndMaybeForward(CachingKeyValueStore.java:101)
at org.apache.kafka.streams.state.internals.CachingKeyValueStore.access$000(CachingKeyValueStore.java:38)
at org.apache.kafka.streams.state.internals.CachingKeyValueStore$1.apply(CachingKeyValueStore.java:83)
at org.apache.kafka.streams.state.internals.NamedCache.flush(NamedCache.java:141)
at org.apache.kafka.streams.state.internals.NamedCache.flush(NamedCache.java:99)
at org.apache.kafka.streams.state.internals.ThreadCache.flush(ThreadCache.java:125)
at org.apache.kafka.streams.state.internals.CachingKeyValueStore.flush(CachingKeyValueStore.java:123)
at org.apache.kafka.streams.state.internals.InnerMeteredKeyValueStore.flush(InnerMeteredKeyValueStore.java:284)
at org.apache.kafka.streams.state.internals.MeteredKeyValueBytesStore.flush(MeteredKeyValueBytesStore.java:149)
at org.apache.kafka.streams.processor.internals.ProcessorStateManager.flush(ProcessorStateManager.java:239)
... 12 more`
The type of queries you want are not supported easily. Note, that there are not secondary indexes, but only regular key-based lookups and ranges are supported.
If you know all queries upfront, you might be able to re-group the data into derived KTables that have the query attribute as key. Note, that keys must be unique, and hence, if a query attribute contains non-unique data, you would need to use some Collection type as value:
KTable originalTable = builder.table(...)
KTable keyedByFieldATable = originalTable.groupBy(/*select field A*/).aggregate(/* the aggregation return a list or similar of entries for the key*/);
Note that you duplicate your storage requirement each time you re-key the original table.
As an alternative, you can do full table scans over the original table and evaluate your filter conditions when you use the returned iterator.
It's a space vs CPU tradeoff. Maybe Kafka Streams is not the right tool for your problem.
I was able to create the materialized view as follows:
Configuration in application.yml
spring.cloud.stream.kafka.streams.bindings.input:
consumer:
materializedAs: all-messages
keySerde: org.apache.kafka.common.serialization.Serdes$StringSerde
valueSerde: com.me.MyMessageSerde
producer:
keySerde: org.apache.kafka.common.serialization.Serdes$StringSerde
valueSerde: com.me.MyMessageSerde`
This sets up the correct serializers and the materialzed view.
The following code creates the KTable which materializes the view using the above configuration.
public void process(#Input("input") KTable<String,MyMessage> myMessages) {
}

Weird error while parsing JSON in Apache Spark

Trying to parse a JSON document and Spark gives me an error:
Exception in thread "main" org.apache.spark.sql.AnalysisException: Since Spark 2.3, the queries from raw JSON/CSV files are disallowed when the
referenced columns only include the internal corrupt record column
(named _corrupt_record by default). For example:
spark.read.schema(schema).json(file).filter($"_corrupt_record".isNotNull).count()
and spark.read.schema(schema).json(file).select("_corrupt_record").show().
Instead, you can cache or save the parsed results and then send the same query.
For example, val df = spark.read.schema(schema).json(file).cache() and then
df.filter($"_corrupt_record".isNotNull).count().;
at org.apache.spark.sql.execution.datasources.json.JsonFileFormat.buildReader(JsonFileFormat.scala:120)
...
at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:73)
at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3364)
at org.apache.spark.sql.Dataset.head(Dataset.scala:2545)
at org.apache.spark.sql.Dataset.take(Dataset.scala:2759)
at org.apache.spark.sql.Dataset.getRows(Dataset.scala:255)
at org.apache.spark.sql.Dataset.showString(Dataset.scala:292)
at org.apache.spark.sql.Dataset.show(Dataset.scala:746)
at org.apache.spark.sql.Dataset.show(Dataset.scala:705)
at xxx.MyClass.xxx(MyClass.java:25)
I already tried to open the JSON doc in several online editors and it's valid.
This is my code:
Dataset<Row> df = spark.read()
.format("json")
.load("file.json");
df.show(3); // this is line 25
I am using Java 8 and Spark 2.4.
The _corrupt_record column is where Spark stores malformed records when it tries to ingest them. That could be a hint.
Spark also process two types of JSON documents, JSON Lines and normal JSON (in the earlier versions Spark could only do JSON Lines). You can find more in this Manning article.
You can try the multiline option, as in:
Dataset<Row> df = spark.read()
.format("json")
.option("multiline", true)
.load("file.json");
to see if it helps. If not, share your JSON doc (if you can).
set the multiline option to true. If it does not work share your json

Spark reduceByKey function seems not working with single one key

I have a 5 row records in mysql, like
sku:001 seller:A stock:UK margin:10
sku:002 seller:B stock:US margin:5
sku:001 seller:A stock:UK margin:10
sku:001 seller:A stock:UK margin:3
sku:001 seller:A stock:UK margin:7
And I've this rows read into spark and transformed them into
JavaPairRDD<Tuple3<String,String,String>, Map>(<sku,seller,stock>, Map<margin,xxx>).
Seems like works fine until now.
However, When I used the reduceByKey function to sum the margin as the structure like:
JavaPairRDD<Tuple3<String,String,String>, Map>(<sku,seller,stock>, Map<marginSummary, xxx>).
the final result got 2 elements
JavaPairRDD<Tuple3<String,String,String>, Map>(<sku,seller,stock>, Map<margin,xxx>).
JavaPairRDD<Tuple3<String,String,String>, Map>(<sku,seller,stock>, Map<marginSummary, xxx>).
seems like the row2 didn't enter the reduceByKey function body. I was wondering why?
It is expected outcome. func is called only when objects for a single key are merged. If there is only one key, there is no reason to call it.
Unfortunately it looks like you have a bigger problem, which can be inferred from you question. You are trying to change the type of the value in reduceByKey. In general it shouldn't even compile as reduceByKey takes Function2<V,V,V> - input and output types have to be identical.
If you want to change a type, you should use either combineByKey
public <C> JavaPairRDD<K,C> combineByKey(Function<V,C> createCombiner,
Function2<C,V,C> mergeValue,
Function2<C,C,C> mergeCombiners)
or aggregateByKey
public <U> JavaPairRDD<K,U> aggregateByKey(U zeroValue,
Function2<U,V,U> seqFunc,
Function2<U,U,U> combFunc)
Both can change the types and fixed your current problem. Please refer to Java test suite for examples: 1 and 2.

Solr sorting by custom function query

I'm running into some issues developing a custom function query using Solr 3.6.2.
My goal is to be able to implement a custom sorting technique.
I have a field called daily_prices_str, it is a single value str.
Example:
<str name="daily_prices_str">
2014-05-01:130 2014-05-02:130 2014-05-03:130 2014-05-04:130 2014-05-05:130 2014-05-06:130 2014-05-07:130 2014-05-08:130 2014-05-09:130 2014-05-10:130 2014-05-11:130 2014-05-12:130 2014-05-13:130 2014-05-14:130 2014-05-15:130 2014-05-16:130 2014-05-17:130 2014-05-18:130 2014-05-19:130 2014-05-20:130 2014-05-21:130 2014-05-22:130 2014-05-23:130 2014-05-24:130 2014-05-25:130 2014-05-26:130 2014-05-27:130 2014-05-28:130 2014-05-29:130 2014-05-30:130 2014-05-31:130 2014-06-01:130 2014-06-02:130 2014-06-03:130 2014-06-04:130 2014-06-05:130 2014-06-06:130 2014-06-07:130 2014-06-08:130 2014-06-09:130 2014-06-10:130 2014-06-11:130 2014-06-12:130 2014-06-13:130 2014-06-14:130 2014-06-15:130 2014-06-16:130 2014-06-17:130 2014-06-18:130 2014-06-19:130 2014-06-20:130 2014-06-21:130 2014-06-22:130 2014-06-23:130 2014-06-24:130 2014-06-25:130 2014-06-26:130 2014-06-27:130 2014-06-28:130 2014-06-29:130 2014-06-30:130 2014-07-01:130 2014-07-02:130 2014-07-03:130 2014-07-04:130 2014-07-05:130 2014-07-06:130 2014-07-07:130 2014-07-08:130 2014-07-09:130 2014-07-10:130 2014-07-11:130 2014-07-12:130 2014-07-13:130 2014-07-14:130 2014-07-15:130 2014-07-16:130 2014-07-17:130 2014-07-18:130 2014-07-19:170 2014-07-20:170 2014-07-21:170 2014-07-22:170 2014-07-23:170 2014-07-24:170 2014-07-25:170 2014-07-26:170 2014-07-27:170 2014-07-28:170 2014-07-29:170 2014-07-30:170 2014-07-31:170 2014-08-01:170 2014-08-02:170 2014-08-03:170 2014-08-04:170 2014-08-05:170 2014-08-06:170 2014-08-07:170 2014-08-08:170 2014-08-09:170 2014-08-10:170 2014-08-11:170 2014-08-12:170 2014-08-13:170 2014-08-14:170 2014-08-15:170 2014-08-16:170 2014-08-17:170 2014-08-18:170 2014-08-19:170 2014-08-20:170 2014-08-21:170 2014-08-22:170 2014-08-23:170 2014-08-24:170 2014-08-25:170 2014-08-26:170 2014-08-27:170 2014-08-28:170 2014-08-29:170 2014-08-30:170
</str>
As you can see the structure of the string is date:price.
Basically, I would like to parse the string to get the price for a particular period and sort by that price.
I’ve already developed the java plugin for the custom function query and I’m at the point where my code compiles, runs, executes, etc. Solr is happy with my code.
Example:
price(daily_prices_str,2015-01-01,2015-01-03)
If I run this query I can see the correct price in the score field:
/select?price=price(daily_prices_str,2015-01-01,2015-01-03)&q={!func}$price
One of the problems is that I cannot sort by function result.
If I run this query:
/select?price=price(daily_prices_str,2015-01-01,2015-01-03)&q={!func}$price&sort=$price+asc
I get a 404 saying that "sort param could not be parsed as a query, and is not a field that exists in the index: $price"
But it works with a workaround:
/select?price=sum(0,price(daily_prices_str,2015-01-01,2015-01-03))&q={!func}$price&sort=$price+asc
The main problem is that I cannot filter by range:
/select?price=sum(0,price(daily_prices_str,2015-1-1,2015-1-3))&q={!frange l=100 u=400}$price
Maybe I'm going about this totally incorrectly?
Instead of passing the newly created "price" to the "sort" parameter, can you pass the function with data itself like so?
q=*:*&sort=price(daily_prices_str,2015-01-01,2015-01-03) ...

Categories