Hello people of the Earth!
I'm using Airflow to schedule and run Spark tasks.
All I found by this time is python DAGs that Airflow can manage.
DAG example:
spark_count_lines.py
import logging
from airflow import DAG
from airflow.operators import PythonOperator
from datetime import datetime
args = {
'owner': 'airflow'
, 'start_date': datetime(2016, 4, 17)
, 'provide_context': True
}
dag = DAG(
'spark_count_lines'
, start_date = datetime(2016, 4, 17)
, schedule_interval = '#hourly'
, default_args = args
)
def run_spark(**kwargs):
import pyspark
sc = pyspark.SparkContext()
df = sc.textFile('file:///opt/spark/current/examples/src/main/resources/people.txt')
logging.info('Number of lines in people.txt = {0}'.format(df.count()))
sc.stop()
t_main = PythonOperator(
task_id = 'call_spark'
, dag = dag
, python_callable = run_spark
)
The problem is I'm not good in Python code and have some tasks written in Java. My question is how to run Spark Java jar in python DAG? Or maybe there is other way yo do it? I found spark submit: http://spark.apache.org/docs/latest/submitting-applications.html
But I don't know how to connect everything together. Maybe someone used it before and has working example. Thank you for your time!
You should be able to use BashOperator. Keeping the rest of your code as is, import required class and system packages:
from airflow.operators.bash_operator import BashOperator
import os
import sys
set required paths:
os.environ['SPARK_HOME'] = '/path/to/spark/root'
sys.path.append(os.path.join(os.environ['SPARK_HOME'], 'bin'))
and add operator:
spark_task = BashOperator(
task_id='spark_java',
bash_command='spark-submit --class {{ params.class }} {{ params.jar }}',
params={'class': 'MainClassName', 'jar': '/path/to/your.jar'},
dag=dag
)
You can easily extend this to provide additional arguments using Jinja templates.
You can of course adjust this for non-Spark scenario by replacing bash_command with a template suitable in your case, for example:
bash_command = 'java -jar {{ params.jar }}'
and adjusting params.
Airflow as of version 1.8 (released today), has
SparkSqlOperator - https://github.com/apache/incubator-airflow/blob/master/airflow/contrib/operators/spark_sql_operator.py ;
SparkSQLHook code - https://github.com/apache/incubator-airflow/blob/master/airflow/contrib/hooks/spark_sql_hook.py
SparkSubmitOperator - https://github.com/apache/incubator-airflow/blob/master/airflow/contrib/operators/spark_submit_operator.py
SparkSubmitHook code - https://github.com/apache/incubator-airflow/blob/master/airflow/contrib/hooks/spark_submit_hook.py
Notice that these two new Spark operators/hooks are in "contrib" branch as of 1.8 version so not (well) documented.
So you can use SparkSubmitOperator to submit your java code for Spark execution.
There is an example of SparkSubmitOperator usage for Spark 2.3.1 on kubernetes (minikube instance):
"""
Code that goes along with the Airflow located at:
http://airflow.readthedocs.org/en/latest/tutorial.html
"""
from airflow import DAG
from airflow.operators.bash_operator import BashOperator
from airflow.contrib.operators.spark_submit_operator import SparkSubmitOperator
from airflow.models import Variable
from datetime import datetime, timedelta
default_args = {
'owner': 'user#mail.com',
'depends_on_past': False,
'start_date': datetime(2018, 7, 27),
'email': ['user#mail.com'],
'email_on_failure': False,
'email_on_retry': False,
'retries': 1,
'retry_delay': timedelta(minutes=5),
# 'queue': 'bash_queue',
# 'pool': 'backfill',
# 'priority_weight': 10,
'end_date': datetime(2018, 7, 29),
}
dag = DAG(
'tutorial_spark_operator', default_args=default_args, schedule_interval=timedelta(1))
t1 = BashOperator(
task_id='print_date',
bash_command='date',
dag=dag)
print_path_env_task = BashOperator(
task_id='print_path_env',
bash_command='echo $PATH',
dag=dag)
spark_submit_task = SparkSubmitOperator(
task_id='spark_submit_job',
conn_id='spark_default',
java_class='com.ibm.cdopoc.DataLoaderDB2COS',
application='local:///opt/spark/examples/jars/cppmpoc-dl-0.1.jar',
total_executor_cores='1',
executor_cores='1',
executor_memory='2g',
num_executors='2',
name='airflowspark-DataLoaderDB2COS',
verbose=True,
driver_memory='1g',
conf={
'spark.DB_URL': 'jdbc:db2://dashdb-dal13.services.dal.bluemix.net:50001/BLUDB:sslConnection=true;',
'spark.DB_USER': Variable.get("CEDP_DB2_WoC_User"),
'spark.DB_PASSWORD': Variable.get("CEDP_DB2_WoC_Password"),
'spark.DB_DRIVER': 'com.ibm.db2.jcc.DB2Driver',
'spark.DB_TABLE': 'MKT_ATBTN.MERGE_STREAM_2000_REST_API',
'spark.COS_API_KEY': Variable.get("COS_API_KEY"),
'spark.COS_SERVICE_ID': Variable.get("COS_SERVICE_ID"),
'spark.COS_ENDPOINT': 's3-api.us-geo.objectstorage.softlayer.net',
'spark.COS_BUCKET': 'data-ingestion-poc',
'spark.COS_OUTPUT_FILENAME': 'cedp-dummy-table-cos2',
'spark.kubernetes.container.image': 'ctipka/spark:spark-docker',
'spark.kubernetes.authenticate.driver.serviceAccountName': 'spark'
},
dag=dag,
)
t1.set_upstream(print_path_env_task)
spark_submit_task.set_upstream(t1)
The code using variables stored in Airflow variables:
Also, you need to create a new spark connection or edit existing 'spark_default' with
extra dictionary {"queue":"root.default", "deploy-mode":"cluster", "spark-home":"", "spark-binary":"spark-submit", "namespace":"default"}:
Go to Admin -> Connection -> Create in Airflow UI. Create a new SSH connection by providing host = IP address, port = 22 and extra as {"key_file": "/path/to/pem/file", "no_host_key_check":true}
This host should be the Spark cluster master from which you can submit spark-jobs. Next, you need to create a DAG with SSHOperator. Following is the template for this.
with DAG(dag_id='ssh-dag-id',
schedule_interval="05 12 * * *",
catchup=False) as dag:
spark_job = ("spark-submit --class fully.qualified.class.name "
"--master yarn "
"--deploy-mode client "
"--driver-memory 6G "
"--executor-memory 6G "
"--num-executors 6 "
"/path/to/your-spark.jar")
ssh_run_query = SSHOperator(
task_id="random_task_id",
ssh_conn_id="name_of_connection_you just_created",
command=spark_job,
get_pty=True,
dag=dag)
ssh_run_query
That's it. You also get the complete logs for this Spark job in Airflow.
Related
I am building a command-line Java application and I have a problem with parsing the command line parameters with Apache Commons CLI.
I am trying to cover my scenario where I need to have two exclusive command-line param groups with long (--abc) and short (-a) arguments as well.
Use case 1
short params: -d oracle -j jdbc:oracle:thin:#//host:port/databa
same but with long params: -dialect oracle -jdbcUrl jdbc:oracle:thin:#//host:port/databa
Use case 2:
short params: -d oracle -h host -p 1521 -s database -U user -P pwd
same but with long params: -dialect oracle -host host -port 1521 -sid database -user user -password pwd
So I created two OptionGroup with the proper Option items:
OptionGroup jdbcUrlGroup = new OptionGroup();
jdbcUrlGroup.setRequired(true);
jdbcUrlGroup.addOption(jdbcUrl);
second group:
OptionGroup customConfigurationGroup = new OptionGroup();
customConfigurationGroup.setRequired(true);
customConfigurationGroup.addOption(host);
customConfigurationGroup.addOption(port);
customConfigurationGroup.addOption(sid);
customConfigurationGroup.addOption(user);
customConfigurationGroup.addOption(password);
Then I build the Options object this way:
Options options = new Options();
options.addOptionGroup(jdbcUrlGroup);
options.addOptionGroup(customConfigurationGroup);
options.addOption(dialect);
But this does not work because it expects to define both groups.
This is how the dialect Option is defined:
Option dialect = Option
.builder("d")
.longOpt("dialect")
.required(false)
.hasArg()
.argName("DIALECT")
.desc("supported SQL dialects: oracle. Default value: oracle")
.build();
The other mandatory Option definitions look similar except this one property:
.required(true)
Result:
-d oracle: Missing required options: [-j ...], [-h ..., -p ..., -s ..., -U ..., -P ...]
-d oracle -jdbcUrl xxx: Missing required option: [-h ..., -p ..., -s ..., -U ..., -P ...]
-d oracle -h yyy: Missing required option: [-j ...]
But what I want is the following: if the JDBC URL is provided then the host, port, etc, params are not needed or the opposite.
I think that it is time to forget Apache Commons CLI and mark it as a deprecated library. Okay, if you have only a few command-line arguments then you can use it, otherwise better not to use. Fact that this Apache project was updated recently (17 February 2019), but still many features are missing from it and a little bit painful to work with Apache Commons CLI library.
The picocli project looks like a better candidate for parsing command line parameters. It is a quite intuitive library, easy to use, and has a nice and comprehensive documentation as well. I think that a middle rated tool with perfect documentation is better than a shiny project without any documentation.
Anyway picocli is a very nice library with perfect documentation, so I give double plus-plus to it :)
This is how I covered my use cases with picocli:
import picocli.CommandLine;
import picocli.CommandLine.ArgGroup;
import picocli.CommandLine.Command;
import picocli.CommandLine.Option;
import picocli.CommandLine.Parameters;
#Command(name = "SqlRunner",
sortOptions = false,
usageHelpWidth = 100,
description = "SQL command line tool. It executes the given SQL and show the result on the standard output.\n",
parameterListHeading = "General options:\n",
footerHeading = "\nPlease report issues at arnold.somogyi#gmail.com.",
footer = "\nDocumentation, source code: https://github.com/zappee/sql-runner.git")
public class SqlRunner implements Runnable {
/**
* Definition of the general command line options.
*/
#Option(names = {"-?", "--help"}, usageHelp = true, description = "Display this help and exit.")
private boolean help;
#Option(names = {"-d", "--dialect"}, defaultValue = "oracle", showDefaultValue = CommandLine.Help.Visibility.ALWAYS, description = "Supported SQL dialects: oracle.")
private static String dialect;
#ArgGroup(exclusive = true, multiplicity = "1", heading = "\nProvide a JDBC URL:\n")
MainArgGroup mainArgGroup;
/**
* Two exclusive parameter groups:
* (1) JDBC URL parameter
* (2) Custom connection parameters
*/
static class MainArgGroup {
/**
* JDBC URL option (only one parameter).
*/
#Option(names = {"-j", "--jdbcUrl"}, arity = "1", description = "JDBC URL, example: jdbc:oracle:<drivertype>:#//<host>:<port>/<database>.")
private static String jdbcUrl;
/**
* Custom connection parameter group.
*/
#ArgGroup(exclusive = false, multiplicity = "1", heading = "\nCustom configuration:\n")
CustomConfigurationGroup customConfigurationGroup;
}
/**
* Definition of the SQL which will be executed.
*/
#Parameters(index = "0", arity = "1", description = "SQL to be executed. Example: 'select 1 from dual'")
String sql;
/**
* Custom connection parameters.
*/
static class CustomConfigurationGroup {
#Option(names = {"-h", "--host"}, required = true, description = "Name of the database server.")
private static String host;
#Option(names = {"-p", "--port"}, required = true, description = "Number of the port where the server listens for requests.")
private static String port;
#Option(names = {"-s", "--sid"}, required = true, description = "Name of the particular database on the server. Also known as the SID in Oracle terminology.")
private static String sid;
#Option(names = {"-U", "--user"}, required = true, description = "Name for the login.")
private static String user;
#Option(names = {"-P", "--password"}, required = true, description = "Password for the connecting user.")
private static String password;
}
/**
* The entry point of the executable JAR.
*
* #param args command line parameters
*/
public static void main(String[] args) {
CommandLine cmd = new CommandLine(new SqlRunner());
int exitCode = cmd.execute(args);
System.exit(exitCode);
}
/**
* It is used to create a thread.
*/
#Override
public void run() {
int exitCode = 0; //executeMyStaff();
System.exit(exitCode);
}
}
And this is how the generated help looks like:
$ java -jar target/sql-runner-1.0-shaded.jar --help
Usage: SqlRunner [-?] [-d=<dialect>] (-j=<jdbcUrl> | (-h=<host> -p=<port> -s=<sid> -U=<user>
-P=<password>)) <sql>
SQL command line tool. It executes the given SQL and show the result on the standard output.
General settings:
<sql> SQL to be executed. Example: 'select 1 from dual'
-?, --help Display this help and exit.
-d, --dialect=<dialect> Supported SQL dialects: oracle.
Default: oracle
Custom configuration:
-h, --host=<host> Name of the database server.
-p, --port=<port> Number of the port where the server listens for requests.
-s, --sid=<sid> Name of the particular database on the server. Also known as the SID in
Oracle terminology.
-U, --user=<user> Name for the login.
-P, --password=<password> Password for the connecting user.
Provide a JDBC URL:
-j, --jdbcUrl=<jdbcUrl> JDBC URL, example: jdbc:oracle:<drivertype>:#//<host>:<port>/<database>.
Please report issues at arnold.somogyi#gmail.com.
Documentation, source code: https://github.com/zappee/sql-runner.git
This look is much better than the Apache CLI generated help.
I have developed a Tensorflow model with python in Linux based on the tutorial here: "http://cv-tricks.com/tensorflow-tutorial/training-convolutional-neural-network-for-image-classification/". I trained and saved the model using "tf.train.Saver". I am able to deploy the model in Linux environment and perform prediction successfully. Now I need to be able to load this saved model in JAVA on WINDOWS. Through extensive research online I have read that it does not work with "tf.train.Saver" and I have to change my code to use "Serving" to be able to load a saved TF model in java! Therefore, I followed the tutorial here:
"https://github.com/tensorflow/serving/blob/master/tensorflow_serving/example/mnist_saved_model.py
" and changed my code. However, I have an error with "tf.FixedLenFeature" where it is asking me to use "FixedLenSequenceFeature". Here is the complete error message:
"ValueError: First dimension of shape for feature x unknown. Consider using FixedLenSequenceFeature."
which is happening here:
feature_configs = {'x': tf.FixedLenFeature(shape=[None, img_size,img_size,num_channels], dtype=tf.float32),}
I am not sure this is the right path to go since I have batch of images of size [batchsize*128*128*3] and should not be using the sequence feature! It would be great if someone could clear this out for me and answer these questions:
1- Do I have to change my code from "tf.train.Saver" to "serving" to be able to load the saved model and deploy it in JAVA?
2- If the answer to the above question is yes, how can I feed the data correctly and solve the aforementioned ERROR?
3- Is there any example of how to DEPLOY the model that was saved using "serving"?
Here is my training code that throws the error:
import dataset
import tensorflow as tf
import time
from datetime import timedelta
import math
import random
import numpy as np
import os
#Adding Seed so that random initialization is consistent
from numpy.random import seed
seed(1)
from tensorflow import set_random_seed
set_random_seed(2)
batch_size = 32
#Prepare input data
classes = ['class1','class2','class3']
num_classes = len(classes)
# 20% of the data will automatically be used for validation
validation_size = 0.2
img_size = 128
num_channels = 3
train_path='/home/user1/Downloads/Expression/Augmented/Data/Train'
# We shall load all the training and validation images and labels into memory using openCV and use that during training
data = dataset.read_train_sets(train_path, img_size, classes, validation_size=validation_size)
print("Complete reading input data. Will Now print a snippet of it")
print("Number of files in Training-set:\t\t{}".format(len(data.train.labels)))
print("Number of files in Validation-set:\t{}".format(len(data.valid.labels)))
session = tf.Session()
serialized_tf_example = tf.placeholder(tf.string, name='tf_example')
feature_configs = {'x': tf.FixedLenFeature(shape=[None, img_size,img_size,num_channels], dtype=tf.float32),}
tf_example = tf.parse_example(serialized_tf_example, feature_configs)
x = tf.identity(tf_example['x'], name='x') # use tf.identity() to assign name
# x = tf.placeholder(tf.float32, shape=[None, img_size,img_size,num_channels], name='x')
## labels
y_true = tf.placeholder(tf.float32, shape=[None, num_classes], name='y_true')
y_true_cls = tf.argmax(y_true, dimension=1)
##Network graph params
filter_size_conv1 = 3
num_filters_conv1 = 32
filter_size_conv2 = 3
num_filters_conv2 = 32
filter_size_conv3 = 3
num_filters_conv3 = 64
fc_layer_size = 128
def create_weights(shape):
return tf.Variable(tf.truncated_normal(shape, stddev=0.05))
def create_biases(size):
return tf.Variable(tf.constant(0.05, shape=[size]))
def create_convolutional_layer(input,
num_input_channels,
conv_filter_size,
num_filters):
## We shall define the weights that will be trained using create_weights function.
weights = create_weights(shape=[conv_filter_size, conv_filter_size, num_input_channels, num_filters])
## We create biases using the create_biases function. These are also trained.
biases = create_biases(num_filters)
## Creating the convolutional layer
layer = tf.nn.conv2d(input=input,
filter=weights,
strides=[1, 1, 1, 1],
padding='SAME')
layer += biases
## We shall be using max-pooling.
layer = tf.nn.max_pool(value=layer,
ksize=[1, 2, 2, 1],
strides=[1, 2, 2, 1],
padding='SAME')
## Output of pooling is fed to Relu which is the activation function for us.
layer = tf.nn.relu(layer)
return layer
def create_flatten_layer(layer):
#We know that the shape of the layer will be [batch_size img_size img_size num_channels]
# But let's get it from the previous layer.
layer_shape = layer.get_shape()
## Number of features will be img_height * img_width* num_channels. But we shall calculate it in place of hard-coding it.
num_features = layer_shape[1:4].num_elements()
## Now, we Flatten the layer so we shall have to reshape to num_features
layer = tf.reshape(layer, [-1, num_features])
return layer
def create_fc_layer(input,
num_inputs,
num_outputs,
use_relu=True):
#Let's define trainable weights and biases.
weights = create_weights(shape=[num_inputs, num_outputs])
biases = create_biases(num_outputs)
# Fully connected layer takes input x and produces wx+b.Since, these are matrices, we use matmul function in Tensorflow
layer = tf.matmul(input, weights) + biases
if use_relu:
layer = tf.nn.relu(layer)
return layer
layer_conv1 = create_convolutional_layer(input=x,
num_input_channels=num_channels,
conv_filter_size=filter_size_conv1,
num_filters=num_filters_conv1)
layer_conv2 = create_convolutional_layer(input=layer_conv1,
num_input_channels=num_filters_conv1,
conv_filter_size=filter_size_conv2,
num_filters=num_filters_conv2)
layer_conv3= create_convolutional_layer(input=layer_conv2,
num_input_channels=num_filters_conv2,
conv_filter_size=filter_size_conv3,
num_filters=num_filters_conv3)
layer_flat = create_flatten_layer(layer_conv3)
layer_fc1 = create_fc_layer(input=layer_flat,
num_inputs=layer_flat.get_shape()[1:4].num_elements(),
num_outputs=fc_layer_size,
use_relu=True)
layer_fc2 = create_fc_layer(input=layer_fc1,
num_inputs=fc_layer_size,
num_outputs=num_classes,
use_relu=False)
y_pred = tf.nn.softmax(layer_fc2,name='y_pred')
y_pred_cls = tf.argmax(y_pred, dimension=1)
values, indices = tf.nn.top_k(y_pred, 3)
table = tf.contrib.lookup.index_to_string_table_from_tensor(
tf.constant([str(i) for i in xrange(3)]))
prediction_classes = table.lookup(tf.to_int64(indices))
session.run(tf.global_variables_initializer())
cross_entropy = tf.nn.softmax_cross_entropy_with_logits(logits=layer_fc2,
labels=y_true)
cost = tf.reduce_mean(cross_entropy)
optimizer = tf.train.AdamOptimizer(learning_rate=1e-4).minimize(cost)
correct_prediction = tf.equal(y_pred_cls, y_true_cls)
accuracy = tf.reduce_mean(tf.cast(correct_prediction, tf.float32))
session.run(tf.global_variables_initializer())
def show_progress(epoch, feed_dict_train, feed_dict_validate, val_loss):
acc = session.run(accuracy, feed_dict=feed_dict_train)
val_acc = session.run(accuracy, feed_dict=feed_dict_validate)
msg = "Training Epoch {0} --- Training Accuracy: {1:>6.1%}, Validation Accuracy: {2:>6.1%}, Validation Loss: {3:.3f}"
print(msg.format(epoch + 1, acc, val_acc, val_loss))
total_iterations = 0
# saver = tf.train.Saver()
def train(num_iteration):
global total_iterations
for i in range(total_iterations,
total_iterations + num_iteration):
x_batch, y_true_batch, _, cls_batch = data.train.next_batch(batch_size)
x_valid_batch, y_valid_batch, _, valid_cls_batch = data.valid.next_batch(batch_size)
feed_dict_tr = {x: x_batch,
y_true: y_true_batch}
feed_dict_val = {x: x_valid_batch,
y_true: y_valid_batch}
session.run(optimizer, feed_dict=feed_dict_tr)
if i % int(data.train.num_examples/batch_size) == 0:
print(i)
val_loss = session.run(cost, feed_dict=feed_dict_val)
epoch = int(i / int(data.train.num_examples/batch_size))
show_progress(epoch, feed_dict_tr, feed_dict_val, val_loss)
print("Saving the model Now!")
# saver.save(session, save_path_full, global_step=i)
total_iterations += num_iteration
train(num_iteration=10000)#3000
# Export model
# WARNING(break-tutorial-inline-code): The following code snippet is
# in-lined in tutorials, please update tutorial documents accordingly
# whenever code changes.
export_path_base = './SavedModel/'
export_path = os.path.join(
tf.compat.as_bytes(export_path_base),
tf.compat.as_bytes(str(1)))
print 'Exporting trained model to', export_path
builder = tf.saved_model.builder.SavedModelBuilder(export_path)
# Build the signature_def_map.
classification_inputs = tf.saved_model.utils.build_tensor_info(
serialized_tf_example)
classification_outputs_classes = tf.saved_model.utils.build_tensor_info(
prediction_classes)
classification_outputs_scores = tf.saved_model.utils.build_tensor_info(values)
classification_signature = (
tf.saved_model.signature_def_utils.build_signature_def(
inputs={
tf.saved_model.signature_constants.CLASSIFY_INPUTS:
classification_inputs
},
outputs={
tf.saved_model.signature_constants.CLASSIFY_OUTPUT_CLASSES:
classification_outputs_classes,
tf.saved_model.signature_constants.CLASSIFY_OUTPUT_SCORES:
classification_outputs_scores
},
method_name=tf.saved_model.signature_constants.CLASSIFY_METHOD_NAME))
tensor_info_x = tf.saved_model.utils.build_tensor_info(x)
tensor_info_y = tf.saved_model.utils.build_tensor_info(y_pred)
prediction_signature = (
tf.saved_model.signature_def_utils.build_signature_def(
inputs={'images': tensor_info_x},
outputs={'scores': tensor_info_y},
method_name=tf.saved_model.signature_constants.PREDICT_METHOD_NAME))
legacy_init_op = tf.group(tf.tables_initializer(), name='legacy_init_op')
builder.add_meta_graph_and_variables(
sess, [tf.saved_model.tag_constants.SERVING],
signature_def_map={
'predict_images':
prediction_signature,
tf.saved_model.signature_constants.DEFAULT_SERVING_SIGNATURE_DEF_KEY:
classification_signature,
},
legacy_init_op=legacy_init_op)
builder.save()
print 'Done exporting!'
We are using Apache-spark with mongo-spark library(for connecting with MongoDB) and spark-redshift library(for connecting with Amazon Redshift DWH).
And we are experiencing very bad performance for our job.
So I am hoping to get some help to understand whether we are doing anything wrong in our program or this is what we can expect with the infrastructure we have used.
We are running our job with MESOS resouce manager on 4 AWS EC2 nodes with following configuration with each node:
RAM: 16GB, CPU cores: 4, SSD: 200GB
We have 3 tables in Redshift cluster:
TABLE_NAME SCHEMA NUMBER_OF_ROWS
table1 (table1Id, table2FkId, table3FkId, ...) 50M
table2 (table2Id, phonenumber, email,...) 700M
table3 (table3Id, ...) 2K
and In MongoDB we have a collection having 35 million documents with a sample document as below (all email and phone numbers are unique here, no duplication):
{
"_id": "19ac0487-a75f-49d9-928e-c300e0ac7c7c",
"idKeys": {
"email": [
"a#gmail.com",
"b#gmail.com"
],
"phonenumber": [
"1111111111",
"2222222222"
]
},
"flag": false,
...
...
...
}
Which we are filtering and flattening(see the code at the end for mongo-spark aggregation pipeline) with spark-mongo connector to following format (as we need to JOIN data from Redshift and Mongo ON email OR phonenumber match for which another option available is array_contains() in spark SQL which is a bit slow) :
{"_id": "19ac0487-a75f-49d9-928e-c300e0ac7c7c", "email": "a#gmail.com", "phonenumber": null},
{"_id": "19ac0487-a75f-49d9-928e-c300e0ac7c7c","email": "b#gmail.com","phonenumber": null},
{"_id": "19ac0487-a75f-49d9-928e-c300e0ac7c7c","email": null,"phonenumber": "1111111111"},
{"_id": "19ac0487-a75f-49d9-928e-c300e0ac7c7c","email": null,"phonenumber": "22222222222"}
Spark computation steps (please refer the code below to understand these steps better):
First we are loading all data from 3 Redshift tables into table1Dataset, table2Dataset, table3Dataset respectively by using spark-redshift connector.
joining these 3 tables with SparkSQL and creating new Dataset redshiftJoinedDataset. (this operation independently finishes in 6 hours)
loading MongoDB data into mongoDataset by using mongo-spark connector.
joining mongoDataset and redshiftJoinedDataset. (here is the bottleneck as we need to join over 50million rows from redshift with over 100million flattened rows from mongodb)
Note:- Also mongo-spark seems to have some internal issue with its aggregation pipeline execution which might be making it very slow.
then we are doing some aggregation and grouping the data on finalId
here is the code for the steps mentioned above:
import com.mongodb.spark.MongoSpark;
import com.mongodb.spark.rdd.api.java.JavaMongoRDD;
import org.apache.spark.SparkContext;
import org.apache.spark.api.java.JavaSparkContext;
import org.apache.spark.sql.Dataset;
import org.apache.spark.sql.SQLContext;
import org.apache.spark.sql.SparkSession;
import org.bson.Document;
import java.util.Arrays;
public class SparkMongoRedshiftTest {
private static SparkSession sparkSession;
private static SparkContext sparkContext;
private static SQLContext sqlContext;
public static void main(String[] args) {
sparkSession = SparkSession.builder().appName("redshift-spark-test").getOrCreate();
sparkContext = sparkSession.sparkContext();
sqlContext = new SQLContext(sparkContext);
Dataset table1Dataset = executeRedshiftQuery("(SELECT table1Id,table2FkId,table3FkId FROM table1)");
table1Dataset.createOrReplaceTempView("table1Dataset");
Dataset table2Dataset = executeRedshiftQuery("(SELECT table2Id,phonenumber,email FROM table2)");
table2Dataset.createOrReplaceTempView("table2Dataset");
Dataset table3Dataset = executeRedshiftQuery("(SELECT table3Id FROM table3");
table3Dataset.createOrReplaceTempView("table3Dataset");
Dataset redshiftJoinedDataset = sqlContext.sql(" SELECT a.*,b.*,c.*" +
" FROM table1Dataset a " +
" LEFT JOIN table2Dataset b ON a.table2FkId = b.table2Id" +
" LEFT JOIN table3Dataset c ON a.table3FkId = c.table3Id");
redshiftJoinedDataset.createOrReplaceTempView("redshiftJoinedDataset");
JavaMongoRDD<Document> userIdentityRDD = MongoSpark.load(getJavaSparkContext());
Dataset mongoDataset = userIdentityRDD.withPipeline(
Arrays.asList(
Document.parse("{$match: {flag: false}}"),
Document.parse("{ $unwind: { path: \"$idKeys.email\" } }"),
Document.parse("{$group: {_id: \"$_id\",emailArr: {$push: {email: \"$idKeys.email\",phonenumber: {$ifNull: [\"$description\", null]}}},\"idKeys\": {$first: \"$idKeys\"}}}"),
Document.parse("{$unwind: \"$idKeys.phonenumber\"}"),
Document.parse("{$group: {_id: \"$_id\",phoneArr: {$push: {phonenumber: \"$idKeys.phonenumber\",email: {$ifNull: [\"$description\", null]}}},\"emailArr\": {$first: \"$emailArr\"}}}"),
Document.parse("{$project: {_id: 1,value: {$setUnion: [\"$emailArr\", \"$phoneArr\"]}}}"),
Document.parse("{$unwind: \"$value\"}"),
Document.parse("{$project: {email: \"$value.email\",phonenumber: \"$value.phonenumber\"}}")
)).toDF();
mongoDataset.createOrReplaceTempView("mongoDataset");
Dataset joinRedshiftAndMongoDataset = sqlContext.sql(" SELECT a.* , b._id AS finalId " +
" FROM redshiftJoinedData AS a INNER JOIN mongoDataset AS b " +
" ON b.email = a.email OR b.phonenumber = a.phonenumber");
//aggregating joinRedshiftAndMongoDataset
//then storing to mysql
}
private static Dataset executeRedshiftQuery(String query) {
return sqlContext.read()
.format("com.databricks.spark.redshift")
.option("url", "jdbc://...")
.option("query", query)
.option("aws_iam_role", "...")
.option("tempdir", "s3a://...")
.load();
}
public static JavaSparkContext getJavaSparkContext() {
sparkContext.conf().set("spark.mongodb.input.uri", "");
sparkContext.conf().set("spark.sql.crossJoin.enabled", "true");
return new JavaSparkContext(sparkContext);
}
}
Time estimation to finish this job on above mentioned infrastructure is over 2 months.
So to summarize the joins quantitatively:
RedshiftDataWithMongoDataJoin => (RedshiftDataJoin) INNER_JOIN (MongoData)
=> (50M LEFT_JOIN 700M LEFT_JOIN 2K) INNER_JOIN (~100M)
=> (50M) INNER_JOIN (~100M)
Any help with this will be appreciated.
So after a lot of investigation we came to know that 90% of data in table2 had either email or phonenumber null and I had missed to handle joins on null values in the query.
So that was the main problem for this performance bottleneck.
After fixing this problem the job now runs within 2 hours.
So there are no issues with spark-redshift or mongo-spark those are performing exceptionally well :)
I go through document but still it is very much confusing how to get data from swift.
I configured swift in my one linux machine. By using below command I am able to get container list,
swift -A https://acc.objectstorage.softlayer.net/auth/v1.0/ -U
username -K passwordkey list
I seen many blog for blumix(https://console.ng.bluemix.net/docs/services/AnalyticsforApacheSpark/index-gentopic1.html#genTopProcId2) and written the below code
sc.textFile("swift://container.myacct/file.xml")
I am looking to integrate in java spark. Where need to configure object storage credential in java code. Is there any sample code or blog?
This notebook illustrates a number of ways to load data using the Scala language. Scala runs on the JVM. Java and Scala classes can be freely mixed, no matter whether they reside in different projects or in the same. Looking at the mechanics of how Scala code interacts with Openstack Swift object storage should help guide you to craft a Java equivalent.
From the above notebook, here are some steps illustrating how to configure and extract data from an Openstack Swift Object Storage instance using the Stocator library using the Scala language. The swift url decomposes into:
swift2d :// container . myacct / filename.extension
^ ^ ^ ^
stocator name of namespace object storage
protocol container filename
Imports
import org.apache.spark.SparkContext
import scala.util.control.NonFatal
import play.api.libs.json.Json
val sqlctx = new SQLContext(sc)
val scplain = sqlctx.sparkContext
Sample Creds
// #hidden_cell
var credentials = scala.collection.mutable.HashMap[String, String](
"auth_url"->"https://identity.open.softlayer.com",
"project"->"object_storage_3xxxxxx3_xxxx_xxxx_xxxx_xxxxxxxxxxxx",
"project_id"->"6xxxxxxxxxx04fxxxxxxxxxx6xxxxxx7",
"region"->"dallas",
"user_id"->"cxxxxxxxxxxaxxxxxxxxxx1xxxxxxxxx",
"domain_id"->"cxxxxxxxxxxaxxyyyyyyxx1xxxxxxxxx",
"domain_name"->"853255",
"username"->"Admin_cxxxxxxxxxxaxxxxxxxxxx1xxxxxxxxx",
"password"->"""&M7372!FAKE""",
"container"->"notebooks",
"tenantId"->"undefined",
"filename"->"file.xml"
)
Helper Method
def setRemoteObjectStorageConfig(name:String, sc: SparkContext, dsConfiguration:String) : Boolean = {
try {
val result = scala.util.parsing.json.JSON.parseFull(dsConfiguration)
result match {
case Some(e:Map[String,String]) => {
val prefix = "fs.swift2d.service." + name
val hconf = sc.hadoopConfiguration
hconf.set("fs.swift2d.impl","com.ibm.stocator.fs.ObjectStoreFileSystem")
hconf.set(prefix + ".auth.url", e("auth_url") + "/v3/auth/tokens")
hconf.set(prefix + ".tenant", e("project_id"))
hconf.set(prefix + ".username", e("user_id"))
hconf.set(prefix + ".password", e("password"))
hconf.set(prefix + "auth.method", "keystoneV3")
hconf.set(prefix + ".region", e("region"))
hconf.setBoolean(prefix + ".public", true)
println("Successfully modified sparkcontext object with remote Object Storage Credentials using datasource name " + name)
println("")
return true
}
case None => println("Failed.")
return false
}
}
catch {
case NonFatal(exc) => println(exc)
return false
}
}
Load the Data
val setObjStor = setRemoteObjectStorageConfig("sparksql", scplain, Json.toJson(credentials.toMap).toString)
val data_rdd = scplain.textFile("swift2d://notebooks.sparksql/" + credentials("filename"))
data_rdd.take(5)
I have a project in Java. This project has a class com.xyz.api.base.models.mongo.Member.
I want to import this Java project to a Scala project to use Member class.
However, I got this error (the library is already downloaded to Scala dependencies):
java.lang.RuntimeException: java.lang.ClassNotFoundException: models.mongo.Member
The strange thing is that there is not compilation error. The error above only happens at runtime. Furthermore, the error message does not mention com.xyz.api.base as the base package of models.mongo.Member.
My code:
import com.redmart.api.base.models.mongo.Member
import com.redmart.api.base.utils.RedisCacheImpl
import redis.RedisClient
object Redis extends App {
implicit val akkaSystem = akka.actor.ActorSystem()
val host: String = "127.0.0.1"
val port: Int = 6379
val db: Int = 0
val timeout: Long = 10000L
val key = "a2IxSE5kdW9HRHZUe"
var redisCacheImpl: RedisCacheImpl = _
try {
RedisCacheImpl.configRedis(host, port, db, timeout)
redisCacheImpl = RedisCacheImpl.getInstance()
val obj = redisCacheImpl.get(key)
val member = obj.asInstanceOf[Member]
println(s"member id ${member.getMemberId}")
}
Thank you for your help.
In this case spring-boot's version 1.2.3.RELEASE use mongo-java-driver 2.12.5. for more details go through this documentation :Link