I have an application where I am using Apache Spark 1.4.1 - Standalone cluster. Code from this application has evolved and it's quite complicated (more than a few lines of code as we see in most Apache Spark examples), with lots of method calls from one class to another.
I am trying to add code that when encounters a problem with data (while processing it on the cluster nodes) it notifies an external application. For contacting the external application we have connection details setup in a config file. I want to pass somehow the connection details to the cluster nodes but passing them as parameters to each method that runs on nodes (as parameters or broadcast variable) is not ok for my application because it means that each and every method has to pass them and we've got lots of "chained method calls" (method A calls B, B calls C.....Y calls Z) which is different from most Apache Spark example where we see only one or two method calls.
I am trying to workaround this problem - is there a way to pass data to nodes besides method parameters and broadcast variables ? For example I was looking to setup a env property that would point to the config file (using System.setProperty) and to set it on all nodes, so that I can read connection details on the fly and the code would isolated in one block of code only, but I've got no luck so far.
Actually after some hours of investigation I found a way that really suits my needs. There are two spark properties (one for driver, one for executors) that can be used for passing parameters that can be then read using System.getProperty() :
spark.executor.extraJavaOptions
spark.driver.extraJavaOptions
Using them is more simpler than the approach suggested in above post and you could easily make your application to switch configuration from one environment to another (e.g QA/DEV vs PROD) when you've got all environment setup in your project.
They can be set in the SparkConf object when you're initializing the SparkContext.
The post that helped me a lot in figuring the solution is : http://progexc.blogspot.co.uk/2014/12/spark-configuration-mess-solved.html
The properties you provide as part of --properties-file will be loaded at runtime and will be available only as part of driver but not on any of the executors. But you can always make it available to the executors.
Simple hack:
private static String getPropertyString(String key, Boolean mandatory){
String value=sparkConf.get(key,null);
if(mandatory && value == null ){
value = sparkConf.getenv(key);
if(value == null)
shutDown(key); // Or whatever action you would like to take
}
if(value !=null && sparkConf.getenv(key)==null )
sparkConf.setExecutorEnv(key,value);
return value;
}
First time when your driver kicks, it will find all the properties provided from properties file from sparkconf. As soon as it finds, check whether that key already present in environment if not set those values to executors using setExecutorEnv in your program.
Its tough to distinguish whether your program is in driver or in executor so check whether the property exists in sparkconf if not then check it against environment using getenv(key).
I suggest the following solution:
Put the configuration in a database.
Put the database connection details in a JOCL (Java Object configuration Language) file and have this file available on the class path of each executors.
Make a singleton class that reads the DB connection details from the JOCL, connects to the database, extracts the configuration info and exposes it as getter methods.
Import the class into the context where you have your Spark calls and use it to access the configuration from within them.
Related
I am trying to understand how the new functional model of Spring Cloud Streams works and how the configuration actually works under the hood.
One of the properties I am unable to figure out is spring.cloud.stream.source.
What does this property actually signify ?
I could not understand the documentation :
Note that preceding example does not have any source functions defined
(e.g., Supplier bean) leaving the framework with no trigger to create
source bindings, which would be typical for cases where configuration
contains function beans. So to trigger the creation of source binding
we use spring.cloud.stream.source property where you can declare the
name of your sources. The provided name will be used as a trigger to
create a source binding.
What if I did not need a Supplier ?
What exactly is a source binding and why is it important ?
What if I only wanted to produce to a messaging topic ? Would I still need this property ?
I also could not understand how it is used in the sample here.
Spring cloud stream looks for java.util Function<?, ?, Consumer<?>, Supplier<?> beans and creates bindings for them.
In the supplier case, the framework polls the supplier (each second by default) and sends the resulting data.
For example
#Bean
public Supplier<String> output() {
return () -> "foo";
}
spring.cloud.stream.bindings.output-out-0.destination=bar
will send foo to destination bar each second.
But, what if you don't need a polled source, but you want to configure a binding to which you can send arbitrary data. Enter spring.cloud.stream.source.
spring.cloud.stream.source=output
spring.cloud.stream.bindings.output-out-0.destination=bar
allows you to send arbitrary data to the stream bridge
bridge.send("output-out-0", "test");
In other words, it allows you to configure one or more ouput bindings that you can use in the StreamBridge; otherwise, when you send to the bridge, the binding is created dynamically.
I am presented with the following use case.
I am receiving a Message<Foo> object on my input channel, where Foo object has 2 properties
public class Foo {
...
public String getSourcePathString();
public String getTargetPathString();
...
}
sourcePathString is a String which denotes where the source file is located, while targetPathString is a place where the file should be copied to.
Now, I do know how to use file:outbound-channel-adapter to copy the file to a custom target location via FileNameGenerator, however, I am not sure how I can provide the location where to read the file from in file:inbound-channel-adapter and how to activate the reading when the message is received only.
What I have so far is a custom service activator where I perform copying in my own bean, however, I'd like to try and use Spring Integration for it.
So, is there a way to implement triggerable file copying in Spring Integration with already present components?
You cannot currently change the input directory dynamically on the inbound channel adapter.
The upcoming 4.2 release has dynamic pollers which would allow this.
However, it seems the adapter is not really suitable for your use case - it is a polled adapter, whereas you want to fetch the file on demand.
You could minimize your user code by configuring a FileReadingMessageSource, set the directory and call receive() to get the file.
I am using Neo4j for storing nodes and need to access the Neo4j database across classes which should all be able to connect concurrently to the database.
I currently use
public void setUp()
{
//deleteFileOrDirectory(new File(FILESYSTEM_DB));
graphDb = new GraphDatabaseFactory().newEmbeddedDatabase(FILESYSTEM_DB);
indexManager = graphDb.index();
index = indexManager.forNodes("indexNodes");
registerShutdownHook();
}
to create the database and connect to it, however next time another class tries to run a similar method (or another instance of the same class calls the same setUp() method) I get a quite reasonable
"Error Obtaining Lock (org.neo4j.kernal.StoreLockException)".
How can I check if database is running and if not call newEmbeddedDatabase(FILESYSTEM_DB) otherwise connect to the running instance?
Make sure variables graphDb and others are not local variables but fields of an instance of some class Neo4jConnection. Then, create single instance of that class (singleton), run setUp() once and use that connection whenever you need access the database. How to manage that singleton, depends on your environment (do you use Spring?). Simplest way is to have a static variable referring that singleton. Read https://stackoverflow.com/questions/2832297 and other discussions marked with java+singleton tags
I'm using a third party software library with a log prototype like this:
runtime.getInstance().log(int logtype, String moduleName, String logtext);
I have a utility library that I want to be library independent, but I also want to be able to log things to the software package from my own classes. This is fine and good, as the text messages are pretty universal, things like "you've passed bad data!" and "blah blah was successful!" Additionally, I've already wrapped the software vendor's logging functionality, so I'm not even worried about conforming to some random API.
What I am worried about (why I'm writing this post) is that there are going to be various different modules throughout my system. So the problem is like:
ModuleFoo extends com.thirdpartyvendor.BaseModule
ModuleBar extends com.thirdpartyvendor.BaseModule
ModuleFoo ---contains instance of---> IndependentDataStructure ---tries to write a log entry to my WrappedLogger ---> but data structure doesn't have a reference to ModuleFoo.
ModuleBar ---contains instance of---> IndependentDataStructure ---tries to write a log entry to my WrappedLogger ---> but data structure doesn't have a reference to ModuleBar.
Currently my system passes a field String moduleName around which quite frankly makes me sick... but I want the log entries to tell me what my module is! How can the logger know whether the IndependentDataStructure instance is working with ModuleFoo and not ModuleBar (or some other module) without IndependentDataStructure containing a reference to a BaseModule (or a String moduleName)?
Logging APIs such as Log4J and SLF4J have the concept of a diagnostic context, a way to store various bits of contextual information in a ThreadLocal map which the log message formatters can access to decorate the messages. Typical uses for this are things like putting the name of the currently authenticated user into log messages in a web application (using a servlet filter to store the username in the MDC for each request), would you be able to use a similar concept in your system?
runtime.getInstance().log(int logtype,
this.getClass().getSimpleName(),
String logtext);
If i've got it right...
EDIT: and for automated (but a bit slowly) auto-method name:
Thread.currentThread().getStackTrace()[level].getMethodName();
(where level is an integer specifies the number of classes through the log request is passed)
final RuntimeMXBean remoteRuntime =
ManagementFactory.newPlatformMXBeanProxy(
serverConnection,
ManagementFactory.RUNTIME_MXBEAN_NAME,
RuntimeMXBean.class);
Where the serverConnection is just basically connecting to a jmx server.
What basically is going on is, this piece of code works fine. Let me explain:
The first call of this piece of code calls to server A, I then scrape some data in it and store it into an xml file. Using this information, start up a new server B.
Then, in wanting to verify B, I want to scrape B to compare the metadata. But when I run it I get the exception
java.lang.IllegalArgumentException: java.lang:type=Runtime is not an instance of interface java.lang.management.RuntimeMXBean
at java.lang.management.ManagementFactory.newPlatformMXBeanProxy(ManagementFactory.java:617
)
But, not sure what changes here since the parameters that are giving me problems are managed by the ManagementFactory class I don't have control over.
The problem was with my own MBeanServer implementation.
I had it returning false for the isInstanceOf() method if the passed in objectName returned a null Object. It turns out that this happened at all RunTime Classes so after reading http://tim.oreilly.com/pub/a/onjava/2005/01/26/classloading.html under the Class Loader section, I went with the fact that my ClassLoaderImplementation was incorrect and was loading these incorrectly.
Work around was just to return true in isInstanceOf() for these RunTime classes.