How to compute summary statistic on Cassandra table with Spark DataFrame? - java

I'm trying to get the min, max mean of some Cassandra/SPARK data but I need to do it with JAVA.
import org.apache.spark.sql.DataFrame;
import static org.apache.spark.sql.functions.*;
DataFrame df = sqlContext.read()
.format("org.apache.spark.sql.cassandra")
.option("table", "someTable")
.option("keyspace", "someKeyspace")
.load();
df.groupBy(col("keyColumn"))
.agg(min("valueColumn"), max("valueColumn"), avg("valueColumn"))
.show();
EDITED to show working version:
Make sure to put " around the someTable and someKeyspace

Just import your data as a DataFrame and apply required aggregations:
import org.apache.spark.sql.DataFrame;
import static org.apache.spark.sql.functions.*;
DataFrame df = sqlContext.read()
.format("org.apache.spark.sql.cassandra")
.option("table", someTable)
.option("keyspace", someKeyspace)
.load();
df.groupBy(col("keyColumn"))
.agg(min("valueColumn"), max("valueColumn"), avg("valueColumn"))
.show();
where someTable and someKeyspace store table name and keyspace respectively.

I suggest checking out https://github.com/datastax/spark-cassandra-connector/tree/master/spark-cassandra-connector-demos
Which contains demos in both Scala and the equivalent Java.
You can also check out: http://spark.apache.org/documentation.html
Which has tons of examples that you can flip between Scala, Java, and Python versions.
I'm almost 100% certain that between those to links, you'll find exactly what you're looking for.
If there's anything you're having trouble with after that, feel free to update your question with a more specific error/problem.

In general,
compile scala file:
$ scalac Main.scala
create your java source file from Main.class file:
$ javap Main
More info is available at following url:
http://alvinalexander.com/scala/scala-class-to-decompiled-java-source-code-classes

Related

How to load a custom transformer in Spark 2.4

I'm trying to create a custom transformer in Spark 2.4.0. Saving it works fine. However, when I try to load it, I get the following error:
java.lang.NoSuchMethodException: TestTransformer.<init>(java.lang.String)
at java.lang.Class.getConstructor0(Class.java:3082)
at java.lang.Class.getConstructor(Class.java:1825)
at org.apache.spark.ml.util.DefaultParamsReader.load(ReadWrite.scala:496)
at org.apache.spark.ml.util.MLReadable$class.load(ReadWrite.scala:380)
at TestTransformer$.load(<console>:40)
... 31 elided
This suggests to me that it can't find my transformer's constructor, which doesn't really make sense to me.
MCVE:
import org.apache.spark.sql.{Dataset, DataFrame}
import org.apache.spark.sql.types.{StructType}
import org.apache.spark.ml.Transformer
import org.apache.spark.ml.param.ParamMap
import org.apache.spark.ml.util.{DefaultParamsReadable, DefaultParamsWritable, Identifiable}
class TestTransformer(override val uid: String) extends Transformer with DefaultParamsWritable{
def this() = this(Identifiable.randomUID("TestTransformer"))
override def transform(df: Dataset[_]): DataFrame = {
val columns = df.columns
df.select(columns.head, columns.tail: _*)
}
override def transformSchema(schema: StructType): StructType = {
schema
}
override def copy(extra: ParamMap): TestTransformer = defaultCopy[TestTransformer](extra)
}
object TestTransformer extends DefaultParamsReadable[TestTransformer]{
override def load(path: String): TestTransformer = super.load(path)
}
val transformer = new TestTransformer("test")
transformer.write.overwrite().save("test_transformer")
TestTransformer.load("test_transformer")
Running this (I'm using a Jupyter notebook) leads to the above error. I've tried compiling and running it as a .jar file, with no difference.
What puzzles me is that the equivalent PySpark code works fine:
from pyspark.sql import SparkSession, DataFrame
from pyspark.ml import Transformer
from pyspark.ml.util import DefaultParamsReadable, DefaultParamsWritable
class TestTransformer(Transformer, DefaultParamsWritable, DefaultParamsReadable):
def transform(self, df: DataFrame) -> DataFrame:
return df
TestTransformer().save('test_transformer')
TestTransformer.load('test_transformer')
How can I make a custom Spark transformer that can be saved and loaded?
I can reproduce your problem in spark-shell.
Trying to find the source of the problem I looked into DefaultParamsReadable and DefaultParamsReader sources and I could see they utilize Java reflection.
https://github.com/apache/spark/blob/v2.4.0/mllib/src/main/scala/org/apache/spark/ml/util/ReadWrite.scala
lines 495-496
val instance =
cls.getConstructor(classOf[String]).newInstance(metadata.uid).asInstanceOf[Params]
I think scala REPLs and Java reflection aren't good friends.
If you run this snippet (after yours):
new TestTransformer().getClass.getConstructors
you'll get the following output:
res1: Array[java.lang.reflect.Constructor[_]] = Array(public TestTransformer($iw), public TestTransformer($iw,java.lang.String))
It is true! TestTransformer.<init>(java.lang.String) doesn't exist.
I found 2 workarounds,
Compiling your code with sbt and creating a jar, then including in spark-shell with :require, worked for me (You mentioned you tried a jar, I don't know how though)
Pasting the code in spark-shell with :paste -raw , worked fine as well. I suppose -raw prevents from REPL doing shenanigans to your classes.
See: https://docs.scala-lang.org/overviews/repl/overview.html
I'm not sure how you can adapt any of these to Jupyter but I hope this info is useful for you.
NOTE: I actually used spark-shell in spark 2.4.1

run python sklearn classifier from java

I trained a SVC classifier in python using Sklearn and other libraries. I did it through building pipeline(sklearn)
I am able to dump the trained model in pickle file and made another python script which would load the pickle file and takes input from command line to do prediction. I am able to call this python script from java and its working fine.
Only issue is that it takes a lot of time, as I have nltk, numpy, panda libraries called in the python script, required for the preprocessing of the input argument. I am calling this python script multiple times and that's increasing the time.
How can I work around this issue.
thats how my pipleline looks
pipeline = Pipeline([
# Use FeatureUnion to combine the features from dataset
('union', FeatureUnion(
transformer_list=[
# Pipeline for getting POS
('ngrams', Pipeline([
('selector', ItemSelector(key='Sentence')),
('vect', CountVectorizer(analyzer='word')),
('tfidf', TfidfTransformer()),
])),
],
# weight components in FeatureUnion
transformer_weights={
'ngrams': 0.7,
},
)),
# Use a SVC classifier on the combined features
('clf', LinearSVC()),
])
Here's an example of setting a simple FLASK serving REST API for a scikit model.
import sys
import os
import time
import traceback
from flask import Flask, request, jsonify
from sklearn.externals import joblib
app = Flask(__name__)
model_directory = 'model'
model_file_name = '%s/model.pkl' % model_directory
# These will be populated at training time
clf = None
#app.route('/predict', methods=['POST'])
def predict():
if clf:
try:
json_ = request.json
# query = get the payload from the json and feed it to your model
prediction = list(clf.predict(query))
return jsonify({'prediction': prediction})
except Exception, e:
return jsonify({'error': str(e), 'trace': traceback.format_exc()})
else:
return 'no model here'
if __name__ == '__main__':
try:
port = int(sys.argv[1])
except Exception, e:
port = 80
try:
clf = joblib.load(model_file_name)
print 'model loaded'
app.run(host='0.0.0.0', port=port, debug=True)

How to import LibSVM into my Java code

In Java programming, we should firstly add weka.jar into our classpath, thus we can call all classify or cluster algorithms in WEKA in the form of the following codes,
import weka.classifiers.trees.RandomForest;
...
RandomForest rf = new RandomForest(); // RandomForest object
But unfortunately, we can not use this way to import LibSVM algorithm, because there is not such class in weka.jar.
So, my question is How to import LibSVM into my Java code? Any help will be grateful :)
Firstly, I'd like to say there are so many methods to solve the problem. The solution I mentioned is quite simple, but other answers from StackOverflow are not detailed descripted, with waste my too much time to verify. So I'm happy to share it with all WEKA beginners :)
a) Download the LibSVM.jar from Maven Repository Center. Note that this LibSVM.jar is different from the libsvm.jar developed by Chih-Chung Chang and Chih-Jen Lin;
b) Add the LibSVM.jar to the classpath of our Java project;
c) Call the classifier LibSVM when you need, see the following Java code.
import weka.classifiers.functions.LibSVM; // contained in LibSVM.jar
String path = "file/train.arff";
Instances train = DataSource.read(path); // load the dataset
train.setClassIndex(train.numAttribute()-1); // set class index
LibSVM svm = new LibSVM(); // load the svm classifier
svm.buildClassifier(train);
Evaluation eval = new Evaluation(train);
eval.crossValidateModel(svm, train, 10, new Random(1)); // 10-fold cross-validation
See: https://weka.wikispaces.com/LibSVM
Use Weka's package manager to install the LibSVM. Suppose "weka.jar" is in your current folder, than run this:
java -cp weka.jar weka.core.WekaPackageManager -install-package LibSVM
During the installation, it shows:
[DefaultPackageManager] Tmp file: /tmp/LibSVM1.0.107382715397815864641.zip
[DefaultPackageManager] Installing: Description.props
[DefaultPackageManager] Installing: LibSVM.jar
[DefaultPackageManager] Installing: build_package.xml
...
You can see that "LibSVM.jar" is installed somewhere. In my case, it is at:
/home/john/wekafiles/packages/LibSVM/LibSVM.jar

Custom recommender jobs using apache mahout 0.11.2 over hadoop

I am newbie to Apache Mahout. I am using Apache mahout 0.11.2. So to give it a try I created java class called samplereccommender.java as shown below.
package f;
import java.io.File;
import java.io.IOException;
import org.apache.mahout.cf.taste.common.TasteException;
import org.apache.mahout.cf.taste.impl.model.file.FileDataModel;
import org.apache.mahout.cf.taste.impl.neighborhood.ThresholdUserNeighborhood;
import org.apache.mahout.cf.taste.impl.recommender.GenericUserBasedRecommender;
import org.apache.mahout.cf.taste.impl.similarity.PearsonCorrelationSimilarity;
import org.apache.mahout.cf.taste.model.DataModel;
import org.apache.mahout.cf.taste.neighborhood.UserNeighborhood;
import org.apache.mahout.cf.taste.recommender.RecommendedItem;
import org.apache.mahout.cf.taste.recommender.UserBasedRecommender;
import org.apache.mahout.cf.taste.similarity.ItemSimilarity;
import org.apache.mahout.cf.taste.similarity.UserSimilarity;
import java.util.List;
public class SampleReccommender {
public static void main(String args[]){
try{
DataModel datamodel = new FileDataModel(new File(args[0]));
//Creating UserSimilarity object.
UserSimilarity usersimilarity = new PearsonCorrelationSimilarity(datamodel);
//Creating UserNeighbourHHood object.
UserNeighborhood userneighborhood = new ThresholdUserNeighborhood(1.0, usersimilarity, datamodel);
//Create UserRecomender
UserBasedRecommender recommender = new GenericUserBasedRecommender(datamodel, userneighborhood, usersimilarity);
List recommendations = (List) recommender.recommend(2, 3);
System.out.println(recommendations.size());
for (int i=0; i< recommendations.size();i++) {
System.out.println(recommendations.get(i));
}
}
catch(Exception e){
e.printStackTrace();
}
}}
I managed to run same code from command line as
java -cp n.jar f.SampleReccommender n_lib/wishlistdata.txt
Now from what I read on the internet and book "Mahout in action" I understood that same code can be run on hadoop by using following commands.
first I need to include my SampleReccommender.java into existing apache-mahout-distribution-0.11.2/mahout-mr-0.11.2-job.jar. So I followed following procedure.
jar uf /Users/rohitjain/Documents/apache-mahout-distribution-0.11.2/mahout-mr-0.11.2-job.jar samplerecommender.jar
then I tried running mahout job using following command
bin/hadoop jar /Users/rohitjain/Documents/apache-mahout-distribution-0.11.2/mahout-mr-0.11.2-job.jar org.apache.mahout.cf.taste.hadoop.item.RecommenderJob -i /input/wishlistdata.txt -o /output/ --recommenderClassName \ f.SampleRecommender
But it gives me an error as :
Unexpected --recommenderClassName while processing Job-Specific Options:
I tried above command based on the syntax given "mahout in action" book which is as mentioned below
hadoop jar mahout-core-0.5-job.jar \ org.apache.mahout.cf.taste.hadoop.pseudo.RecommenderJob \ -Dmapred.input.dir=input/ua.base.hadoop \ -Dmapred.output.dir=output \ --recommenderClassName \ org.apache.mahout.cf.taste.impl.recommender.slopeone.SlopeOneRecommender
Am I doing anything wrong ? Also tell me whether the code I used for standalone implementation same can be be used for recommenderJobs or it requires all together different implementation?
Mahout in Action is out of date and the code you are using is being deprecated.
These days Mahout runs on more modern compute platforms like Spark. For the latest Mahout Recommender you can start with the Command Line Interface to spark-itemsimilarity and integrated it with Solr or Eleasticsearch. Or you can pickup a fully integrated end-to-end solution linked below:
Building a recommender with Mahout: http://mahout.apache.org/users/algorithms/recommender-overview.html
Mahout spark-itemsimilarity: http://mahout.apache.org/users/recommender/intro-cooccurrence-spark.html
Universal Recommender from ActionML: https://github.com/actionml/template-scala-parallel-universal-recommendation
The UR is built on PredictionIO ML Framework here: https://github.com/actionml/PredictionIO

Import a class in Scripting java (javax.script)

I want to import a class that I made in my project, into my script
I did this but it doesn't work:
function doFunction(){
//Objectif Mensuel
importPackage(java.lang);
importClass(KPDataModel.KPData.KPItem); //ERROR HERE, this is my class that I want to import
KPItem kpItem = kpItemList.get(0);
System.out.println(kpItem.CellList.get(2).Value);
System.out.println("-------");
var proposedMediationSum = Integer.parseInt(kpItemList.get(0).CellList.get(2).Value);
var refusedMediationSum = Integer.parseInt(kpItemList.get(0).CellList.get(3).Value)
var totalMediation = proposedMediationSum + refusedMediationSum;
kpItemList.get(0).CellList.get(4).Value = totalMediation;
}
Well, thnx a lot, I found that the problem comes from the import.
This is what it said in the Oracle website :
The Packages global variable can be
used to access Java packages.
Examples: Packages.java.util.Vector,
Packages.javax.swing.JFrame. Please
note that "java" is a shortcut for
"Packages.java". There are equivalent
shortcuts for javax, org, edu, com,
net prefixes, so pratically all JDK
platform classes can be accessed
without the "Packages" prefix.
So, to import my class I used : importClass(Packages.KPDataModel.KPData.KPItem);

Categories