Substitute ints into Dataflow via Cloudbuild yaml

Substitute ints into Dataflow via Cloudbuild yaml - java

I've got a streaming Dataflow pipeline, written in Java with BEAM 2.35. It commits data to BigQuery via StorageWriteApi. Initially the code looks like
BigQueryIO.writeTableRows()
.withTimePartitioning(/* some column */)
.withClustering(/* another column */)
.withMethod(BigQueryIO.Write.Method.STORAGE_WRITE_API)
.withTriggeringFrequency(Duration.standardSeconds(30))
.withNumStorageWriteApiStreams(20) // want to make this dynamic
This code runs in different environment eg Dev & Prod. When I deploy in Dev I want 2 StorageWriteApiStreams, in Prod I want 20, and I'm trying to pass/resolve these values at the moment I deploy with a Cloudbuild.
The cloudbuild-dev.yaml looks like
steps:
- lots-of-steps
args:
--numStorageWriteApiStreams=${_NUM_STORAGEWRITEAPI_STREAMS}
substitutions:
_PROJECT: dev-project
_NUM_STORAGEWRITEAPI_STREAMS: '2'
I expose the substitution in the job code with an interface
ValueProvider<String> getNumStorageWriteApiStreams();
void setNumStorageWriteApiStreams(ValueProvider<String> numStorageWriteApiStreams);
I then refactor the writeTableRows() call to invoke getNumStorageWriteApiStreams()
BigQueryIO.writeTableRows()
.withTimePartitioning(/* some column */)
.withClustering(/* another column */)
.withMethod(BigQueryIO.Write.Method.STORAGE_WRITE_API)
.withTriggeringFrequency(Duration.standardSeconds(30))
.withNumStorageWriteApiStreams(Integer.parseInt(String.valueOf(options.getNumStorageWriteApiStreams())))
Now it's dynamic but I get a build failure on account of java.lang.IllegalArgumentException: methods with same signature getNumStorageWriteApiStreams() but incompatible return types: [class java.lang.Integer, interface org.apache.beam.sdk.options.ValueProvider]
My understanding was that Integer.parseInt returns an int, which I want so I can pass it to withNumStorageWriteApiStreams() which requires an int.
I'd appreciate any help I can get here thanks

Turns out BigQueryOptions.java already has a method getNumStorageWriteApiStreams() that returns an Integer. I was unknowingly trying to rewrite it with a different return, oops.
https://github.com/apache/beam/blob/master/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/BigQueryOptions.java#L95-L98

Related

Spark reduceByKey function seems not working with single one key

I have a 5 row records in mysql, like
sku:001 seller:A stock:UK margin:10
sku:002 seller:B stock:US margin:5
sku:001 seller:A stock:UK margin:10
sku:001 seller:A stock:UK margin:3
sku:001 seller:A stock:UK margin:7
And I've this rows read into spark and transformed them into
JavaPairRDD<Tuple3<String,String,String>, Map>(<sku,seller,stock>, Map<margin,xxx>).
Seems like works fine until now.
However, When I used the reduceByKey function to sum the margin as the structure like:
JavaPairRDD<Tuple3<String,String,String>, Map>(<sku,seller,stock>, Map<marginSummary, xxx>).
the final result got 2 elements
JavaPairRDD<Tuple3<String,String,String>, Map>(<sku,seller,stock>, Map<margin,xxx>).
JavaPairRDD<Tuple3<String,String,String>, Map>(<sku,seller,stock>, Map<marginSummary, xxx>).
seems like the row2 didn't enter the reduceByKey function body. I was wondering why?

It is expected outcome. func is called only when objects for a single key are merged. If there is only one key, there is no reason to call it.
Unfortunately it looks like you have a bigger problem, which can be inferred from you question. You are trying to change the type of the value in reduceByKey. In general it shouldn't even compile as reduceByKey takes Function2<V,V,V> - input and output types have to be identical.
If you want to change a type, you should use either combineByKey
public <C> JavaPairRDD<K,C> combineByKey(Function<V,C> createCombiner,
Function2<C,V,C> mergeValue,
Function2<C,C,C> mergeCombiners)
or aggregateByKey
public <U> JavaPairRDD<K,U> aggregateByKey(U zeroValue,
Function2<U,V,U> seqFunc,
Function2<U,U,U> combFunc)
Both can change the types and fixed your current problem. Please refer to Java test suite for examples: 1 and 2.

MongoDB: Query has implicit limit(256)?

I've created (in code) a default collection in MongoDB and am querying it, and have discovered that while the code will return all the data when I run it locally, it won't when I query it on a deployment server. It returns a maximum of 256 records.
Notes:
This is not a capped collection.
Locally, I'm running 3.2.5, the remote MongoDB version is 2.4.12
I am not using the limit parameter. When I use it, I can limit both the local and deployment server, but the deployment server will still never return more than 256 records.
The amount of data being fetched from the server is <500K. Nothing huge.
The code is in Clojure, using Monger, which itself just calls the Java com.mongodb stuff.
I can pull in more than 256 records from the remote server using Robomongo though I'm not sure how it does this, as I cannot connect to the remote from the command line (auth failed using the same credentials, so I'm guessing version incompatibility there).
Any help is appreciated.
UPDATE: Found the thing that triggers the problem: When I sort the output, it reduces the output to 256—but only when I pull from Mongo 2.4! I don't know if this is a MongoDB itself, the MongoDB java class, or Monger, but here is the code that illustrates the issue, as simple as I could make it:
(ns mdbtest.core
(:require [monger.core :as mg]
[monger.query :as mq]))
(defn get-list []
(let [coll (mq/with-collection
(mg/get-db
(mg/connect {:host "old-mongo"}) "mydb") "saves"
(mq/sort (array-map :createdDate -1)))] ;;<<==remove sort
coll))

You need to specify a bigger batch-size, the default is 256 records.
Here's an example from my own code:
=> (count (with-db (q/find {:keywords "lisa"})
(q/sort {:datetime 1}) ))
256
=> (count (with-db (q/find {:keywords "lisa"})
(q/sort {:datetime 1})
(q/batch-size 1000) ))
688
See more info here: http://clojuremongodb.info/articles/querying.html#setting_batch_size

scala.MatchError: in Dataframes

I have one Spark (version 1.3.1) application. In which, I am trying to convert one Java bean RDD JavaRDD<Message> into Dataframe, it has many fields with different-different Data types (Integer, String, List, Map, Double).
But when, I am executing my Code.
messages.foreachRDD(new Function2<JavaRDD<Message>,Time,Void>(){
#Override
public Void call(JavaRDD<Message> arg0, Time arg1) throws Exception {
SQLContext sqlContext = SparkConnection.getSqlContext();
DataFrame df = sqlContext.createDataFrame(arg0, Message.class);
df.registerTempTable("messages");
I got this error
/06/12 17:27:40 INFO JobScheduler: Starting job streaming job 1434110260000 ms.0 from job set of time 1434110260000 ms
15/06/12 17:27:40 ERROR JobScheduler: Error running job streaming job 1434110260000 ms.1
scala.MatchError: interface java.util.List (of class java.lang.Class)
at org.apache.spark.sql.SQLContext$$anonfun$getSchema$1.apply(SQLContext.scala:1193)
at org.apache.spark.sql.SQLContext$$anonfun$getSchema$1.apply(SQLContext.scala:1192)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:108)
at org.apache.spark.sql.SQLContext.getSchema(SQLContext.scala:1192)
at org.apache.spark.sql.SQLContext.createDataFrame(SQLContext.scala:437)
at org.apache.spark.sql.SQLContext.createDataFrame(SQLContext.scala:465)

If Message has many different fields like List and the error message points to a List match error than that is the is the issue. Also if you look at the source code you can see that List is not in the match.
But beside digging around in the source code this is also very clearly stated in the documentation here under the Java tab:
Currently, Spark SQL does not support JavaBeans that contain nested or contain complex types such as Lists or Arrays.
You may want to switch to Scala as it seems to be supported there:
Case classes can also be nested or contain complex types such as Sequences or Arrays. This RDD can be implicitly converted to a DataFrame and then be registered as a table.
So the solution is either to use Scala or remove the List from you JavaBean.
As a last resort you can take a look at SQLUserDefinedType to define how that List should be persisted, maybe it's possible to hack it together.

I resolved this problem by updating my Spark version from 1.3.1 to 1.4.0. Now, It works file.

Solr sorting by custom function query

I'm running into some issues developing a custom function query using Solr 3.6.2.
My goal is to be able to implement a custom sorting technique.
I have a field called daily_prices_str, it is a single value str.
Example:
<str name="daily_prices_str">
2014-05-01:130 2014-05-02:130 2014-05-03:130 2014-05-04:130 2014-05-05:130 2014-05-06:130 2014-05-07:130 2014-05-08:130 2014-05-09:130 2014-05-10:130 2014-05-11:130 2014-05-12:130 2014-05-13:130 2014-05-14:130 2014-05-15:130 2014-05-16:130 2014-05-17:130 2014-05-18:130 2014-05-19:130 2014-05-20:130 2014-05-21:130 2014-05-22:130 2014-05-23:130 2014-05-24:130 2014-05-25:130 2014-05-26:130 2014-05-27:130 2014-05-28:130 2014-05-29:130 2014-05-30:130 2014-05-31:130 2014-06-01:130 2014-06-02:130 2014-06-03:130 2014-06-04:130 2014-06-05:130 2014-06-06:130 2014-06-07:130 2014-06-08:130 2014-06-09:130 2014-06-10:130 2014-06-11:130 2014-06-12:130 2014-06-13:130 2014-06-14:130 2014-06-15:130 2014-06-16:130 2014-06-17:130 2014-06-18:130 2014-06-19:130 2014-06-20:130 2014-06-21:130 2014-06-22:130 2014-06-23:130 2014-06-24:130 2014-06-25:130 2014-06-26:130 2014-06-27:130 2014-06-28:130 2014-06-29:130 2014-06-30:130 2014-07-01:130 2014-07-02:130 2014-07-03:130 2014-07-04:130 2014-07-05:130 2014-07-06:130 2014-07-07:130 2014-07-08:130 2014-07-09:130 2014-07-10:130 2014-07-11:130 2014-07-12:130 2014-07-13:130 2014-07-14:130 2014-07-15:130 2014-07-16:130 2014-07-17:130 2014-07-18:130 2014-07-19:170 2014-07-20:170 2014-07-21:170 2014-07-22:170 2014-07-23:170 2014-07-24:170 2014-07-25:170 2014-07-26:170 2014-07-27:170 2014-07-28:170 2014-07-29:170 2014-07-30:170 2014-07-31:170 2014-08-01:170 2014-08-02:170 2014-08-03:170 2014-08-04:170 2014-08-05:170 2014-08-06:170 2014-08-07:170 2014-08-08:170 2014-08-09:170 2014-08-10:170 2014-08-11:170 2014-08-12:170 2014-08-13:170 2014-08-14:170 2014-08-15:170 2014-08-16:170 2014-08-17:170 2014-08-18:170 2014-08-19:170 2014-08-20:170 2014-08-21:170 2014-08-22:170 2014-08-23:170 2014-08-24:170 2014-08-25:170 2014-08-26:170 2014-08-27:170 2014-08-28:170 2014-08-29:170 2014-08-30:170
</str>
As you can see the structure of the string is date:price.
Basically, I would like to parse the string to get the price for a particular period and sort by that price.
I’ve already developed the java plugin for the custom function query and I’m at the point where my code compiles, runs, executes, etc. Solr is happy with my code.
Example:
price(daily_prices_str,2015-01-01,2015-01-03)
If I run this query I can see the correct price in the score field:
/select?price=price(daily_prices_str,2015-01-01,2015-01-03)&q={!func}$price
One of the problems is that I cannot sort by function result.
If I run this query:
/select?price=price(daily_prices_str,2015-01-01,2015-01-03)&q={!func}$price&sort=$price+asc
I get a 404 saying that "sort param could not be parsed as a query, and is not a field that exists in the index: $price"
But it works with a workaround:
/select?price=sum(0,price(daily_prices_str,2015-01-01,2015-01-03))&q={!func}$price&sort=$price+asc
The main problem is that I cannot filter by range:
/select?price=sum(0,price(daily_prices_str,2015-1-1,2015-1-3))&q={!frange l=100 u=400}$price
Maybe I'm going about this totally incorrectly?

Instead of passing the newly created "price" to the "sort" parameter, can you pass the function with data itself like so?
q=*:*&sort=price(daily_prices_str,2015-01-01,2015-01-03) ...

Drools Expert output object in Scala

I'm a novice in both Scala and Drools Expert, and need some help getting information out of a Drools session. I've successfully set up some Scala classes that get manipulated by Drools rules. Now I want to create an object to store a set of output facts for processing outside of Drools. Here's what I've got.
I've got a simple object that stores a numeric result (generated in the RHS of a rule), along with a comment string:
class TestResults {
val results = new MutableList[(Float, String)]()
def add(cost: Float, comment: String) {
results += Tuple2(cost, comment)
}
}
In the DRL file, I have the following:
import my.domain.app.TestResults
global TestResults results
rule "always"
dialect "mvel"
when
//
then
System.out.println("75 (fixed)")
results.add(75, "fixed")
end
When I run the code that includes this, I get the following error:
org.drools.runtime.rule.ConsequenceException: rule: always
at org.drools.runtime.rule.impl.DefaultConsequenceExceptionHandler.handleException(DefaultConsequenceExceptionHandler.java:39)
...
Caused by: [Error: null pointer or function not found: add]
[Near : {... results.add(75, "fixed"); ....}]
^
[Line: 2, Column: 9]
at org.mvel2.optimizers.impl.refl.ReflectiveAccessorOptimizer.getMethod(ReflectiveAccessorOptimizer.java:997)
This looks to me like there's something goofy with my definition of the TestResults object in Scala, such that the Java that Drools compiles down to can't quite see it. Type mismatch, perhaps? I can't figure it out. Any suggestions? Thank you!

You need to initialize your results global variable before executing your session. You can initialize it using:
knowledgeSession.setGlobal("results", new TestResults()))

Try
import my.domain.app.TestResults
global TestResults results
rule "always"
dialect "mvel"
when
//
then
System.out.println("75 (fixed)")
results().add(75.0f, "fixed")
end
My guess is that the types don't line up and the error message is poor. (75 is an Int, wants a Float)

That's right.. and try to add a condition to your rule, so it make more sense (the when part).
The condition evaluation is the most important feature of rule engines, writing rules without conditions doesn't make too much senses.
Cheers

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Substitute ints into Dataflow via Cloudbuild yaml - java

Related

Spark reduceByKey function seems not working with single one key

MongoDB: Query has implicit limit(256)?

scala.MatchError: in Dataframes

Solr sorting by custom function query

Drools Expert output object in Scala

Categories

Resources