I am trying to understand how the new functional model of Spring Cloud Streams works and how the configuration actually works under the hood.
One of the properties I am unable to figure out is spring.cloud.stream.source.
What does this property actually signify ?
I could not understand the documentation :
Note that preceding example does not have any source functions defined
(e.g., Supplier bean) leaving the framework with no trigger to create
source bindings, which would be typical for cases where configuration
contains function beans. So to trigger the creation of source binding
we use spring.cloud.stream.source property where you can declare the
name of your sources. The provided name will be used as a trigger to
create a source binding.
What if I did not need a Supplier ?
What exactly is a source binding and why is it important ?
What if I only wanted to produce to a messaging topic ? Would I still need this property ?
I also could not understand how it is used in the sample here.
Spring cloud stream looks for java.util Function<?, ?, Consumer<?>, Supplier<?> beans and creates bindings for them.
In the supplier case, the framework polls the supplier (each second by default) and sends the resulting data.
For example
#Bean
public Supplier<String> output() {
return () -> "foo";
}
spring.cloud.stream.bindings.output-out-0.destination=bar
will send foo to destination bar each second.
But, what if you don't need a polled source, but you want to configure a binding to which you can send arbitrary data. Enter spring.cloud.stream.source.
spring.cloud.stream.source=output
spring.cloud.stream.bindings.output-out-0.destination=bar
allows you to send arbitrary data to the stream bridge
bridge.send("output-out-0", "test");
In other words, it allows you to configure one or more ouput bindings that you can use in the StreamBridge; otherwise, when you send to the bridge, the binding is created dynamically.
Related
I have multiple databases, all containing the same table Data. I want to read from them, input all Data elements into the MyBean method #Handler public Data updateData(Data data) and write back the output of the method.
from("jpa://Data?persistenceUnit=persUnit1").to("direct:collector");
from("jpa://Data?persistenceUnit=persUnit2").to("direct:collector");
from("jpa://Data?persistenceUnit=persUnit3").to("direct:collector");
...
from("direct:collector").bean(new MyBean()).to("jpa://Data?persistenceUnit=destinationUnit");
However I need the information from which source the Data element came (e.g. the name of the persistence unit) within the bean for validation. What's the best way to do so?
You could set a header
from("jpa://Data?persistenceUnit=persUnit1")
.setHeader("dataSource", constant("dataSource1"))
.to("direct:collector");
from("jpa://Data?persistenceUnit=persUnit2")
.setHeader("dataSource", constant("dataSource2"))
.to("direct:collector");
The exchange object provides the necessary information to about the "from" endpoint (of current exchange):
https://www.javadoc.io/doc/org.apache.camel/camel-api/latest/org/apache/camel/Exchange.html#getFromRouteId()
It's quite usual to put id on routes (but also on endpoints)
from("jpa://Data?persistenceUnit=persUnit1")
.routeId("ComingFromRouteA")
.to("direct:collector");
This way, you can know where you come from, using:
exchange.getFromRouteId()
I'm trying to write to BigTable through a generic Dataflow code. By generic I mean it must be able to write to any BigTable table provided as a parameter at runtime, using a ValueProvider.
The code is not showing any errors but when I try to create a template of the code, I can see below error message:
Exception in thread "main" java.lang.IllegalStateException: Value only available at runtime, but accessed from a non-runtime context: RuntimeValueProvider{propertyName=bigTableInstanceId, default=null}
It is weird as the functionality to give ValueProviders is supported for this.
Below is the code I am using to write to BigTable:
results.get(btSuccessTag).apply("Write to BigTable",
CloudBigtableIO.writeToTable(new CloudBigtableTableConfiguration.Builder()
.withProjectId(options.getProject())
.withInstanceId(options.getBigTableInstanceId())
.withTableId(options.getBigTableTable())
.build()));
The interface defining the ValueProviders is:
public interface BTPipelineOptions extends DataflowPipelineOptions{
#Required
#Description("BigTable Instance Id")
ValueProvider<String> getBigTableInstanceId();
void setBigTableInstanceId(ValueProvider<String> bigTableInstanceId);
#Required
#Description("BigTable Table Destination")
ValueProvider<String> getBigTableTable();
void setBigTableTable(ValueProvider<String> bigTableTable);
#Required
#Description("BT error file path")
ValueProvider<String> getBTErrorFilePath();
void setBTErrorFilePath(ValueProvider<String> btErrorFilePath);
}
Please let me know if I'm missing something here.
Unfortunately, it seems that the CloudBigtableIO parameters are not updated to be modified by templates via a ValueProvider. Though BigtableIO is compatible with ValueProviders.
In order for Dataflow templates to be able to modify a parameter when launched from template, the library transforms (i.e. the source and sinks) it uses must be first updated to user ValueProviders for the parameters all the way into the library code, when the parameter is used. See more details about ValueProvider here.
However, we have example template pipelines which work with BigtableIO instead of CloudBigtableIO. See AvroToBigtable. So I think that you have a few options
Update your custom pipeline, using one of the Bigtable template examples as an example to follow. Be sure to use BigtableIO instead of CloudBigtableIO
Update CloudBigtableIO to use ValueProviders all the way through, until the parameter is used. See creating_templates, and an example of proper ValueProvider usage in BigtableIO. Contribute it to apache beam's github, or extend/modify the class locally.
See if the existing Bigtable template pipelines fit your needs. You can launch them from the Dataflow UI.
I hope this works for you. Let me know if I explained this well. Or if I overlooked something.
I'm building a topology and want to use KStream.process() to write some intermediate values to a database. This step doesn't change the nature of the data and is completely stateless.
Adding a Processor requires to create a ProcessorSupplier and pass this instance to the KStream.process() function along with the name of a state store. This is what I don't understand.
How to add a StateStore object to a topology since it requires a StateStoreSupplier?
Failing to add a said StateStore gives this error when the application is started:
Exception in thread "main" org.apache.kafka.streams.errors.TopologyBuilderException: Invalid topology building: StateStore my-state-store is not added yet.
Why is it necessary for a processor to have a state store? It seems that this could well be optional for processors that are stateless and don't maintain state.
Process all elements in this stream, one element at a time, by applying a Processor.
Here's a simple example on how to use state stores, taken from the Confluent Platform documentation on Kafka Streams.
Step 1: Defining the StateStore/StateStoreSupplier:
StateStoreSupplier countStore = Stores.create("Counts")
.withKeys(Serdes.String())
.withValues(Serdes.Long())
.persistent()
.build();
I don't see a way to add a StateStore object to my topology. It requires a StateStoreSupplier as well though.
Step 2: Adding the state store to your topology.
Option A - When using the Processor API:
TopologyBuilder builder = new TopologyBuilder();
// add the source processor node that takes Kafka topic "source-topic" as input
builder.addSource("Source", "source-topic")
.addProcessor("Process", () -> new WordCountProcessor(), "Source")
// Add the countStore associated with the WordCountProcessor processor
.addStateStore(countStore, "Process")
.addSink("Sink", "sink-topic", "Process");
Option B - When using the Kafka Streams DSL:
Here you need to call KStreamBuilder#addStateStore("name-of-your-store") to add the state store to your processor topology. Then, when calling methods such as KStream#process() or KStream#transform(), you must also pass in the name of the state store -- otherwise your application will fail at runtime.
At the example of KStream#transform():
KStreamBuilder builder = new KStreamBuilder();
// Add the countStore that will be used within the Transformer[Supplier]
// that we pass into `transform()` below.
builder.addStateStore(countStore);
KStream<byte[], String> input = builder.stream("source-topic");
KStream<String, Long> transformed =
input.transform(/* your TransformerSupplier */, countStore.name());
Why is it necessary for a processor to have a state store? It seems that this could well be optional for processors that are stateless and don't maintain state.
You are right -- you don't need a state store if your processor does not maintain state.
When using the DSL, you need to simply call KStreamBuilder#addStateStore("name-of-your-store") to add the state store to your processor topology and reference it later on.
I have an application where I am using Apache Spark 1.4.1 - Standalone cluster. Code from this application has evolved and it's quite complicated (more than a few lines of code as we see in most Apache Spark examples), with lots of method calls from one class to another.
I am trying to add code that when encounters a problem with data (while processing it on the cluster nodes) it notifies an external application. For contacting the external application we have connection details setup in a config file. I want to pass somehow the connection details to the cluster nodes but passing them as parameters to each method that runs on nodes (as parameters or broadcast variable) is not ok for my application because it means that each and every method has to pass them and we've got lots of "chained method calls" (method A calls B, B calls C.....Y calls Z) which is different from most Apache Spark example where we see only one or two method calls.
I am trying to workaround this problem - is there a way to pass data to nodes besides method parameters and broadcast variables ? For example I was looking to setup a env property that would point to the config file (using System.setProperty) and to set it on all nodes, so that I can read connection details on the fly and the code would isolated in one block of code only, but I've got no luck so far.
Actually after some hours of investigation I found a way that really suits my needs. There are two spark properties (one for driver, one for executors) that can be used for passing parameters that can be then read using System.getProperty() :
spark.executor.extraJavaOptions
spark.driver.extraJavaOptions
Using them is more simpler than the approach suggested in above post and you could easily make your application to switch configuration from one environment to another (e.g QA/DEV vs PROD) when you've got all environment setup in your project.
They can be set in the SparkConf object when you're initializing the SparkContext.
The post that helped me a lot in figuring the solution is : http://progexc.blogspot.co.uk/2014/12/spark-configuration-mess-solved.html
The properties you provide as part of --properties-file will be loaded at runtime and will be available only as part of driver but not on any of the executors. But you can always make it available to the executors.
Simple hack:
private static String getPropertyString(String key, Boolean mandatory){
String value=sparkConf.get(key,null);
if(mandatory && value == null ){
value = sparkConf.getenv(key);
if(value == null)
shutDown(key); // Or whatever action you would like to take
}
if(value !=null && sparkConf.getenv(key)==null )
sparkConf.setExecutorEnv(key,value);
return value;
}
First time when your driver kicks, it will find all the properties provided from properties file from sparkconf. As soon as it finds, check whether that key already present in environment if not set those values to executors using setExecutorEnv in your program.
Its tough to distinguish whether your program is in driver or in executor so check whether the property exists in sparkconf if not then check it against environment using getenv(key).
I suggest the following solution:
Put the configuration in a database.
Put the database connection details in a JOCL (Java Object configuration Language) file and have this file available on the class path of each executors.
Make a singleton class that reads the DB connection details from the JOCL, connects to the database, extracts the configuration info and exposes it as getter methods.
Import the class into the context where you have your Spark calls and use it to access the configuration from within them.
I am presented with the following use case.
I am receiving a Message<Foo> object on my input channel, where Foo object has 2 properties
public class Foo {
...
public String getSourcePathString();
public String getTargetPathString();
...
}
sourcePathString is a String which denotes where the source file is located, while targetPathString is a place where the file should be copied to.
Now, I do know how to use file:outbound-channel-adapter to copy the file to a custom target location via FileNameGenerator, however, I am not sure how I can provide the location where to read the file from in file:inbound-channel-adapter and how to activate the reading when the message is received only.
What I have so far is a custom service activator where I perform copying in my own bean, however, I'd like to try and use Spring Integration for it.
So, is there a way to implement triggerable file copying in Spring Integration with already present components?
You cannot currently change the input directory dynamically on the inbound channel adapter.
The upcoming 4.2 release has dynamic pollers which would allow this.
However, it seems the adapter is not really suitable for your use case - it is a polled adapter, whereas you want to fetch the file on demand.
You could minimize your user code by configuring a FileReadingMessageSource, set the directory and call receive() to get the file.