Databricks Spark notebook re-using Scala objects between runs? - java

I have written an Azure Databricks scala notebook (based on a JAR library), and I run it using a Databricks job once every hour.
In the code, I use the Application Insights Java SDK for log tracing, and init a GUID that marks the "RunId". I do this in a Scala 'object' constructor:
object AppInsightsTracer
{
TelemetryConfiguration.getActive().setInstrumentationKey("...");
val tracer = new TelemetryClient();
val properties = new java.util.HashMap[String, String]()
properties.put("RunId", java.util.UUID.randomUUID.toString);
def trackEvent(name: String)
{
tracer.trackEvent(name, properties, null)
}
}
The notebook itself simply calls the code in the JAR:
import com.mypackage._
Flow.go()
I expect to have a different "RunId" every hour. The weird behavior I am seeing is that for all runs, I get exactly the same "RunId" in the logs!
As if the Scala object constructor code is run exactly once, and is re-used between notebook runs...
Do Spark/Databricks notebooks retain context between runs? If so how can this be avoided?

A Jupyter notebook spawns a Spark session (think of it as a process) and keeps it alive until it either dies, or you restart it explicitly. The object is a singleton, so it's initialized once and will be the same for all cell executions of the notebook.

You start with a new context every time you refresh the notebook.
I would recommend saving your RunId to a file to disk, then reading that file on every notebook run and then increment the RunId in the file.

Related

Google Cloud Dataflow: Submitted job is executing but using old code

I'm writing a Dataflow pipeline that should do 3 things:
Reading .csv files from GCP Storage
Parsing the data to BigQuery campatible TableRows
Writing the data to a BigQuery table
Up until now this all worked like a charm. And it still does, but when I change the source and destination variables nothing changes. The job that actually runs is an old one, not the recently changed (and committed) code. Somehow when I run the code from Eclipse using the BlockingDataflowPipelineRunner the code itself is not uploaded but an older version is used.
Normally nothing wrong with the code but to be as complete as possible:
public class BatchPipeline {
String source = "gs://sourcebucket/*.csv";
String destination = "projectID:datasetID.testing1";
//Creation of the pipeline with default arguments
Pipeline p = Pipeline.create(PipelineOptionsFactory.fromArgs(args).withValidation().create());
PCollection<String> line = p.apply(TextIO.Read.named("ReadFromCloudStorage")
.from(source));
#SuppressWarnings("serial")
PCollection<TableRow> tablerows = line.apply(ParDo.named("ParsingCSVLines").of(new DoFn<String, TableRow>(){
#Override
public void processElement(ProcessContext c){
//processing code goes here
}
}));
//Defining the BigQuery table scheme
List<TableFieldSchema> fields = new ArrayList<>();
fields.add(new TableFieldSchema().setName("datetime").setType("TIMESTAMP").setMode("REQUIRED"));
fields.add(new TableFieldSchema().setName("consumption").setType("FLOAT").setMode("REQUIRED"));
fields.add(new TableFieldSchema().setName("meterID").setType("STRING").setMode("REQUIRED"));
TableSchema schema = new TableSchema().setFields(fields);
String table = destination;
tablerows.apply(BigQueryIO.Write
.named("BigQueryWrite")
.to(table)
.withSchema(schema)
.withWriteDisposition(BigQueryIO.Write.WriteDisposition.WRITE_APPEND)
.withCreateDisposition(BigQueryIO.Write.CreateDisposition.CREATE_IF_NEEDED)
.withoutValidation());
//Runs the pipeline
p.run();
}
This problem arose because I've just changed laptops and had to reconfigure everything. I'm working on a clean Ubuntu 16.04 LTS OS with all the dependencies for GCP development installed (normally). Normally everything is configured quite well since I'm able to start a job (which shouldn't be possible if my config is erred, right?). I'm using Eclipse Neon btw.
So where could the problem lie? It seems to me that there is a problem uploading the code, but I've made sure that my cloud git repo is up-to-date and the staging bucket has been cleaned up ...
**** UPDATE ****
I never found what was exactly going wrong but when I checked out the creation dates of the files in my deployed jar, I indeed saw that they were never really updated. The jar file itself had however a recent timestamp which made me overlook that problem completely (rookie mistake).
I eventually got it all working again by simply creating a new Dataflow project in Eclipse and copying my .java files from the broken project into the new one. Everything worked like a charm from then on.
Once you submit a Dataflow job, you can check which artifacts were part of the job specification by inspecting the files that are part of the job description which is available via DataflowPipelineWorkerPoolOptions#getFilesToStage. The code snippet below gives a little sample of how to get this information.
PipelineOptions myOptions = ...
myOptions.setRunner(DataflowPipelineRunner.class);
Pipeline p = Pipeline.create(myOptions);
// Build up your pipeline and run it.
p.apply(...)
p.run();
// At this point in time, the files which were staged by the
// DataflowPipelineRunner will have been populated into the
// DataflowPipelineWorkerPoolOptions#getFilesToStage
List<String> stagedFiles = myOptions.as(DataflowPipelineWorkerPoolOptions.class).getFilesToStage();
for (String stagedFile : stagedFiles) {
System.out.println(stagedFile);
}
The above code should print out something like:
/my/path/to/file/dataflow.jar
/another/path/to/file/myapplication.jar
/a/path/to/file/alibrary.jar
It is likely that the resources part of the job that your uploading are out of date in some way containing your old code. Look through all the directories and jar parts of the staging list and find all instances of BatchPipeline and verify their age. jar files can be extracted using the jar tool or any zip file reader. Alternatively use javap or any other class file inspector to validate that the BatchPipeline class file lines up with the expected changes you have made.

How to get vm creation time from the machine's properties

I'm using vijava (5.1) to fetch data from a vCenter about virtual machines.
For that matter I'm using a filter with some properties (for example, guest.hostName, runtime.powerState etc.).
I need to get the creation time for these virtual machines and from what I saw, this info is available in the event logs of the vCenter.
Is there a way to get this info part of the virtual machine's properties?
I searched this info using the vSphere-Client and I didn't - so I guess the only place is from the event logs - but just to be sure, is that the only way?
Thanks
It is hard to get creation time of virtual machine using vijava api. However you can get other below informations from VirtualMachineConfigInfo.
changeVersion : The changeVersion is a unique identifier for a given version of the configuration. Each change to the configuration updates this value. This is typically implemented as an ever increasing count or a time-stamp. However, a client should always treat this as an opaque string.
modified : Last time a virtual machine's configuration was modified.
Folder rootFolder = serviceInstance.getRootFolder();
InventoryNavigator inventoryNavigator = new InventoryNavigator(rootFolder);
vm = (VirtualMachine) inventoryNavigator.searchManagedEntity(VirtualMachine.class.getSimpleName(), vmName);
VirtualMachineConfigInfo vmConfig = vm.getConfig();
System.out.println(vmConfig.getChangeVersion);
image for information in virtualMachineConfigInfo object
Unless you set the creation time as an extra config property then the event log is the only way I know of. If you want to go the extra config route I created a sample that shows how to use them that is part of the pyvmomi-community-samples project.

Separating Yourkit sessions

I have some segment of code I want to profile on many different inputs (~1000) so it doesn't make sense to manually run each test and save the results. I'm using yourkit in combination with Eclipse to profile. Is there any way to create "new sessions" for profiling? I want to be able to separate each run so that would make the most sense.
You don't really need to create "sessions" for each test. Instead, you have to capture a snapshot of the profiling data at the end of each test, and clear the profiling data before running the next test.
Using the yourkit API, you can do so in a manner similar to:
public void profile(String host, int port, List<InputData> inputDataSet) {
Map<InputData, String> pathMap = new HashMap<InputData, String>(); //If you want to save the location of each file
//Init profiling data collection
com.yourkit.api.Controller controller = new Controller(host, port);
controller.startCPUSampling(/*with your settings*/);
controller.startAllocationRecording(/*With your settings*/);
//controller.startXXX with whatever data you want to collect
for (InputData input: inputDataSet) {
//Run your test
runTest(inputData);
//Save profiling data
String path = controller.captureSnapshot(/*With or without memory dump*/);
pathMap.put(input, path);
//Clear yourkit profiling data
controller.clearAllocationData();
controller.clearCPUData();
//controller.clearXXX with whatever data you are collecting
}
}
I don't think you need to stop collecting, capture snapshot, clear data, restart collecting, you can just capture and clear data, but please double-check.
Once the tests are run, you can open the snapshots in yourkit and analyze the profiling data.
Unfotunately it's not clear how to run your tests. Does each test run in own JVM process or you run all tests in loop inside single JVM?
If you run each test in own JVM then you need 1) Run JVM with profiler agent, i.e. use -agentpath option (the details is here http://www.yourkit.com/docs/java/help/agent.jsp ). 2) Specify what you are profiling on JVM startup (agent option "sampling", "tracing", etc) 3) Capture snapshot file on JVM exit ("onexit" agent option).
Full list of options http://www.yourkit.com/docs/java/help/startup_options.jsp
If you run all tests inside single JVM you can use profiler API http://www.yourkit.com/docs/java/help/api.jsp to start profling before test starts and capture snapshot after test finishes. You need to use com.yourkit.api.Controller class.

How to initialize an application's java object from command line

I have an ear application (myApp) that runs on a Websphere Application Server (WAS). I have a jar (myJar) that is loaded into the classpath of myApp when the WAS server is started. myJar has a class (MyInitClass) that reads from a db and loads a set of data (myData) into memory. This data gets read many times by myApp. The point is to get myData into memory to prevent doing a db call every time this data is used. This part works great!
The solution I am trying to provide is a manual initialization of MyInitClass. myData gets changed from time to time and I would like to be able to reinitialize MyInitClass from a command line so I don't have to restart the application. Is this possible?
myApp calls a class (MyClass) that has something like this:
public static MyInitClass initClass;
public boolean doStuff()
{
if (initClass == null)
{
// this method loads the data into initClass.myData array
initClass.dataInitializer();
}
else
// no need to reload initClass.myData
}
I have created code similar to this in another class (MyManualInit):
public static void main(String[] args)
{
MyClass.initClass = new MyInitClass();
MyClass.initClass.dataInitializer();
}
When I run MyManualInit from command line it prints all the same debug info that gets printed during the initialization from myApp. But myApp does not recognize that MyInitClass has been reinitialized. I have printed System.out.println(System.getProperty("java.home")) from both processes to validate that I am using the same JRE to run both. Am I doing something obviously wrong here or does it just not work like that? I assumed if I ran MyManualInit on the same JRE it would work. MyClass, MyInitClass and MyManualInit are all in myJar.
Please let me know if you need more info.
You are mixing things. Websphere runs in an instance of the JVM and your command line program instantiates a new one, and objects do not communicate between different JVMs (at least without some effort, bringing up sockets, etc.)
Actually your code does a lazy initialization of your initClass object, and it should be enough without any command line interaction. Why is it not enough for you?

JRuby: Calling Java Code From A Rack App And Keeping It In Memory

I currently know Java and Ruby, but have never used JRuby. I want to use some RAM- and computation-intensive Java code inside a Rack (sinatra) web application. In particular, this Java code loads about 200MB of data into RAM, and provides methods for doing various calculations that use this in-memory data.
I know it is possible to call Java code from Ruby in JRuby, but in my case there is an additional requirement: This Java code would need to be loaded once, kept in memory, and kept available as a shared resource for the sinatra code (which is being triggered by multiple web requests) to call out to.
Questions
Is a setup like this even possible?
What would I need to do to accomplish it? I am not even sure if this is a JRuby question per se, or something that would need to be configured in the web server. I have experience with Passenger and Unicorn/nginx, but not with Java servers, so if this does involve configuration of a Java server such as Tomcat, any info about that would help.
I am really not sure where to even start looking, or if there is a better way to be approaching this problem, so any and all recommendations or relevant links are appreciated.
Yes, a setup it's possibile ( see below about Deployment ) and to accomplish it I would suggest to use a Singleton
Singletons in Jruby
with reference to question: best/most elegant way to share objects between a stack of rack mounted apps/middlewares? I agree with Colin Surprenant's answer, namely singleton-as-module pattern which I prefer over using the singleton mixin
Example
I post here some test code you can use as a proof of concept:
JRuby sinatra side:
#file: sample_app.rb
require 'sinatra/base'
require 'java' #https://github.com/jruby/jruby/wiki/CallingJavaFromJRuby
java_import org.rondadev.samples.StatefulCalculator #import you java class here
# singleton-as-module loaded once, kept in memory
module App
module Global extend self
def calc
#calc ||= StatefulCalculator.new
end
end
end
# you could call a method to load data in the statefull java object
App::Global.calc.turn_on
class Sample < Sinatra::Base
get '/' do
"Welcome, calculator register:#{App::Global.calc.display}"
end
get '/add_one' do
"added one to calculator register, new value:#{App::Global.calc.add(1)}"
end
end
You can start it in tomcat with trinidad or simply with rackup config.ru but you need:
#file: config.ru
root = File.dirname(__FILE__) # => "."
require File.join( root, 'sample_app' ) # => true
run Sample # ..in sample_app.rb ..class Sample < Sinatra::Base
something about the Java Side:
package org.rondadev.samples;
public class StatefulCalculator {
private StatelessCalculator calculator;
double register = 0;
public double add(double a) {
register = calculator.add(register, a);
return register;
}
public double display() {
return register;
}
public void clean() {
register = 0;
}
public void turnOff() {
calculator = null;
System.out.println("[StatefulCalculator] Good bye ! ");
}
public void turnOn() {
calculator = new StatelessCalculator();
System.out.println("[StatefulCalculator] Welcome !");
}
}
Please note that the register in here is only a double but in your real code you can have a big data structure in your real scenario
Deployment
You can deploy using Mongrel, Thin (experimental), Webrick (but who would do that?), and even Java-centric application containers like Glassfish, Tomcat, or JBoss. source: jruby deployments
with TorqueBox that is built on the JBoss Application Server.
JBoss AS includes high-performance clustering, caching and messaging functionality.
trinidad is a RubyGem that allows you to run any Rack based applet wrap within an embedded Apache Tomcat container
Thread synchronization
Sinatra will use Mutex#synchronize method to place a lock on every request to avoid race conditions among threads. If your sinatra app is multithreaded and not thread safe, or any gems you use is not thread safe, you would want to do set :lock, true so that only one request is processed at a given time. .. Otherwise by default lock is false, which means the synchronize would yield to the block directly.
source: https://github.com/zhengjia/sinatra-explained/blob/master/app/tutorial_2/tutorial_2.md
Here are some instructions for how to deploy a sinatra app to Tomcat.
The java code can be loaded once and reused if you keep a reference to the java instances you have loaded. You can keep a reference from a global variable in ruby.
One thing to be aware of is that the java library you are using may not be thread safe. If you are running your ruby code in tomact, multiple requests can execute concurrently, and those requests may all access your shared java library. If your library is not thread safe, you will have to use some sort of synchronization to prevent multiple threads accessing it.

Categories