Why is my Spark driver program so slow?

Why is my Spark driver program so slow? - java

My problem: I have a model engine that takes a list of parameter configuration, and evaluates a double value that corresponds to the metric associated to that configuration. I have six parameters, and each of them can vary according to a list. I want to find by brute force the best parameter configuration considering the combination that will produce the higher value for the output metric. Since I'm learning Spark, I realized that with the cartesian product operation I can easily generate the combinations, and split the RDD to be processed in parallel. So, I came up with this driver program:
public static void main(String[] args) {
String scriptName = "model.mry";
String scriptStr = null;
try {
scriptStr = new String(Files.readAllBytes(Paths.get(scriptName)));
} catch (IOException ex) {
Logger.getLogger(BruteForceDriver.class.getName()).log(Level.SEVERE, null, ex);
System.exit(1);
}
final String script = scriptStr;
SparkConf conf = new SparkConf()
.setAppName("wordCount")
.setSparkHome("/home/danilo/bin/spark-2.2.0-bin-hadoop2.7")
.setJars(new String[]{"/home/danilo/NetBeansProjects/SparkHello1/target/SparkHello1-1.0.jar",
"/home/danilo/.m2/repository/org/modcs/mercury/4.7/mercury-4.7.jar"})
.setMaster("spark://danilo-desktop:7077");
String baseDir = "/home/danilo/NetBeansProjects/SimulationOptimization/workspace/";
JavaSparkContext sc = new JavaSparkContext(conf);
final int NUM_SERVICES = 6;
final int QTD = 3;
JavaRDD<Service>[] providers = new JavaRDD[NUM_SERVICES];
for (int i = 1; i <= NUM_SERVICES; i++) {
providers[i - 1] = sc.textFile(baseDir + "provider"
+ i
+ ".mat")
.filter((t1) -> !t1.contains("#") && !t1.trim().isEmpty())
.map(Service.createParser("" + i))
.zipWithIndex().filter((t1) -> {
return t1._2 < QTD;
}).keys();
}
JavaPairRDD c = null;
JavaRDD<Service> p = providers[0];
for (int i = 1; i < NUM_SERVICES; i++) {
if (c == null) {
c = p.cartesian(providers[i]);
} else {
c = c.cartesian(providers[i]);
}
}
JavaRDD<List<Service>> cartesian = c.map(new FlattenTuple<>());
final Broadcast<ModelEvaluator> model = sc.broadcast(new ModelEvaluator(script));
JavaPairRDD<Double, List<Service>> results = cartesian.mapToPair(
(t) -> {
try {
double val = model.value().evaluateModel(t);
System.out.println(val);
return new Tuple2<>(val, t);
} catch (Exception ex) {
return null;
}
}
);
results.sortByKey().collect().forEach((t) -> {
System.out.println(t._1 + ", " + t._2);
});
sc.close();
}
The "QTD" variable allows me to control the size the interval which each parameter will vary. For QTD = 3, I'll have 3^6 = 729 combinations. The problem is that it is taking so long to compute all those combinations. I wrote a implementations using only normal Java threads, and the runtime is about 40 seconds. Using my Spark driver program, the runtime more than 6 minutes. Why is my Spark program so slow compared to the plain Java multi-thread program?
Edit:
I put:
results = results.cache();
before sorting the results and now the runtime is 2.5 minutes.
Edit 2:
I created a RDD with the cartesian product of the parameters by hand instead of using the operation provided by the framework. Now my runtime is 1'25''. It does make sense now, since there is some overhead to start the driver and move the jars to the workers.

Related

How to use parallel processing in most efficient and elegant way in java

I have different sources of data from which I want to request in parallel (since each of this request is an http call and may be pretty time consuming). But I'm going to use only 1 response from these requests. So I kind of prioritize them. If the first response is invalid I'm going to check the second one. If it's also invalid I want to use the third, etc.
But I want to stop processing and return the result as soon as I receive the first correct response.
To simulate the problem I created the following code, where I'm trying to use java parallel streaming. But the problem is that I receive final results only after processing all requests.
public class ParallelExecution {
private static Supplier<Optional<Integer>> testMethod(String strInt) {
return () -> {
Optional<Integer> result = Optional.empty();
try {
result = Optional.of(Integer.valueOf(strInt));
System.out.printf("converted string %s to int %d\n",
strInt,
result.orElse(null));
} catch (NumberFormatException ex) {
System.out.printf("CANNOT CONVERT %s to int\n", strInt);
}
try {
int randomValue = result.orElse(10000);
TimeUnit.MILLISECONDS.sleep(randomValue);
System.out.printf("converted string %s to int %d in %d milliseconds\n",
strInt,
result.orElse(null), randomValue);
} catch (InterruptedException e) {
e.printStackTrace();
}
return result;
};
}
public static void main(String[] args) {
Instant start = Instant.now();
System.out.println("Starting program: " + start.toString());
List<Supplier<Optional<Integer>>> listOfFunctions = new ArrayList();
for (String arg: args) {
listOfFunctions.add(testMethod(arg));
}
Integer value = listOfFunctions.parallelStream()
.map(function -> function.get())
.filter(optValue -> optValue.isPresent()).map(val-> {
System.out.println("************** VAL: " + val);
return val;
}).findFirst().orElse(null).get();
Instant end = Instant.now();
Long diff = end.toEpochMilli() - start.toEpochMilli();
System.out.println("final value:" + value + ", worked during " + diff + "ms");
}
}
So when I execute the program using the following command:
$java ParallelExecution dfafj 34 1341 4656 dfad 245df 5767
I want to get the result "34" as soon as possible (around after 34 milliseconds) but in fact, I'm waiting for more than 10 seconds.
Could you help to find the most efficient solution for this problem?

ExecutorService#invokeAny looks like a good option.
List<Callable<Optional<Integer>>> tasks = listOfFunctions
.stream()
.<Callable<Optional<Integer>>>map(f -> f::get)
.collect(Collectors.toList());
ExecutorService service = Executors.newCachedThreadPool();
Optional<Integer> value = service.invokeAny(tasks);
service.shutdown();
I converted your List<Supplier<Optional<Integer>>> into a List<Callable<Optional<Integer>>> to be able to pass it in invokeAny. You may build Callables initially. Then, I created an ExecutorService and submitted the tasks.
The result of the first successfully executed task will be returned as soon as that result is returned from a task. Other tasks will end up interrupted.
You also may want to look into CompletionService.
List<Callable<Optional<Integer>>> tasks = Arrays
.stream(args)
.<Callable<Optional<Integer>>>map(arg -> () -> testMethod(arg).get())
.collect(Collectors.toList());
final ExecutorService underlyingService = Executors.newCachedThreadPool();
final ExecutorCompletionService<Optional<Integer>> service = new ExecutorCompletionService<>(underlyingService);
tasks.forEach(service::submit);
Optional<Integer> value = service.take().get();
underlyingService.shutdownNow();

You can use a queue to put your results in:
private static void testMethod(String strInt, BlockingQueue<Integer> queue) {
// your code, but instead of returning anything:
result.ifPresent(queue::add);
}
and then call it with
for (String s : args) {
CompletableFuture.runAsync(() -> testMethod(s, queue));
}
Integer result = queue.take();
Note that this will only handle the first result, as in your sample.

I have tried it using competableFutures and anyOf method. It will return when any one of the future is completed. Now, key to stop other tasks is to provide your own executor service to the completableFuture(s) and shutting it down when required.
public static void main(String[] args) {
Instant start = Instant.now();
System.out.println("Starting program: " + start.toString());
CompletableFuture<Optional<Integer>> completableFutures[] = new CompletableFuture[args.length];
ExecutorService es = Executors.newFixedThreadPool(args.length,r -> {
Thread t = new Thread(r);
t.setDaemon(false);
return t;
});
for (int i = 0;i < args.length; i++) {
completableFutures[i] = CompletableFuture.supplyAsync(testMethod(args[i]),es);
}
CompletableFuture.anyOf(completableFutures).
thenAccept(res-> {
System.out.println("Result - " + res + ", Time Taken : " + (Instant.now().toEpochMilli()-start.toEpochMilli()));
es.shutdownNow();
});
}
PS :It will throw interrupted exceptions that you can ignore in try catch block and not print the stack trace.Also, your thread pool size ideally should be same as length of args array.

Fixing GC Overhead Limit Exceeded Without Increasing Heap Size

I am working on a Java program that will take data from a Sybase database and, using UCanAccess, import it into a Microsoft Access Database. However, I am currently running into a problem, receiving the error “java.lang.OutOfMemoryError: GC overhead limit exceeded”.
To put the situation into context, I am attempting to import approximately 1.3 million records into the Access Database. The program currently encounters the error after approximately 800,000 of these records have been imported, about ten minutes at run time, and long after the ResultSet has been retrieved from the Sybase Database.
I have attempted to modify the heap size, but that causes the program to slow down significantly. Note that this is an ad hoc program to be run multiple times as needed, so the run time should be in the order of minutes or possibly hours, whereas increasing the heap size, based on my observations, would increase the run time to the order of days.
For reference, the error occurs in the main method, during the subroutine called getRecords (the exact line of code that this occurs on varies on a run-by-run basis). I have included the code to the program below, with some minor changes to parts of the code, such as the exact query I am using and the username and password to the access database, so as not to reveal sensitive information.
Is there anything that I can change in the code of my program to ease the load on the garbage collector without increasing the run time beyond a few hours?
EDIT: It appears that I was mistaken as to the default max heap size of Java. When I thought I was increasing the heap size by setting it to 512m, I was unintentionally cutting the heap size in half. When I set the heap size to 2048m instead, I got a java heap space error. I would still like to solve the problem without modifying the heap size, if possible.
EDIT 2: Apparently, I was misled as to a number of records I needed to process. It is double the size I originally thought it was, which indicates that I need to drastically change my approach. Going to go ahead and accept an answer, because that answer did result in large improvements.
getRecords method:
public static void getRecords(SybaseDatabase sdb, AccessDatabase adb)
{
ArrayList<Record> records = new ArrayList<Record>();
StringBuffer sql = new StringBuffer();
Record currentRecord = null;
try{
Statement sybStat = sdb.connection.createStatement();
PreparedStatement resetADB = adb.connection.prepareStatement("DELETE FROM Table");
PreparedStatement accStat = adb.connection.prepareStatement("INSERT INTO Table (A,B,C,D,E,F,G,H,I,J,K,L,M,N,O,P) VALUES (?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?)");
sql.append(query);//query is a placeholder, as I cannot give out the actual query to the database. I have confirmed that the query itself gives the ResultSet that I am looking for
ResultSet rs = sybStat.executeQuery(sql.toString());
resetADB.executeUpdate();
boolean nextWatch = true;
Integer i = 1;
Record r = new Record();
while(nextWatch)
{
for (int j = 0; j < 1000 && nextWatch; j++)
{
nextWatch = rs.next();
r.setColumn(i, 0);
r.setColumn(rs.getString("B"), 1);
r.setColumn(rs.getString("C"), 2);
r.setColumn(rs.getString("D"), 3);
r.setColumn(rs.getString("E"), 4);
r.setColumn(rs.getString("F"), 5);
r.setColumn(rs.getString("G"), 6);
r.setColumn(rs.getString("H"), 7);
r.setColumn(rs.getString("I"), 8);
r.setColumn(rs.getString("J"), 9);
r.setColumn(rs.getString("K"), 10);
r.setColumn(rs.getInt("L"), 11);
r.setColumn(rs.getString("M"), 12);
r.setColumn(rs.getString("N"), 13);
r.setColumn(rs.getString("O"), 14);
r.setColumn(rs.getString("P"), 15);
records.add(r);
i++;
}
for(int k = 0; k < records.size(); k++)
{
currentRecord = records.get(k);
for(int m = 0; m < currentRecord.getNumOfColumns(); m++)
{
if (currentRecord.getColumn(m) instanceof String)
{
accStat.setString(m + 1, "\"" + currentRecord.getColumn(m) + "\"");
}
else
{
accStat.setInt(m + 1, Integer.parseInt(currentRecord.getColumn(m).toString()));
}
}
accStat.addBatch();
}
accStat.executeBatch();
accStat.clearBatch();
records.clear();
}
adb.connection.commit();
}
catch(Exception e){
e.printStackTrace();
}
finally{
}
}
}
Full code:
import java.util.*;
import java.sql.*;
import com.sybase.jdbc2.jdbc.SybDriver;//This is an external file that is used to connect to the Sybase database. I will not include the full code here for the sake of space but will provide it upon request.
public class SybaseToAccess {
public static void main(String[] args){
String accessDBPath = "C:/Users/me/Desktop/Database21.accdb";//This is a placeholder, as I cannot give out the exact file path. However, I have confirmed that it points to the correct file on the system.
String sybaseDBPath = "{sybServerName}:{sybServerPort}/{sybDatabase}";//See above comment
try{
AccessDatabase adb = new AccessDatabase(accessDBPath);
SybaseDatabase sdb = new SybaseDatabase(sybaseDBPath, "user", "password");
getRecords(sdb, adb);
}
catch(Exception e){
e.printStackTrace();
}
finally{
}
}
public static void getRecords(SybaseDatabase sdb, AccessDatabase adb)
{
ArrayList<Record> records = new ArrayList<Record>();
StringBuffer sql = new StringBuffer();
Record currentRecord = null;
try{
Statement sybStat = sdb.connection.createStatement();
PreparedStatement resetADB = adb.connection.prepareStatement("DELETE FROM Table");
PreparedStatement accStat = adb.connection.prepareStatement("INSERT INTO Table (A,B,C,D,E,F,G,H,I,J,K,L,M,N,O,P) VALUES (?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?)");
sql.append(query);//query is a placeholder, as I cannot give out the actual query to the database. I have confirmed that the query itself gives the ResultSet that I am looking for
ResultSet rs = sybStat.executeQuery(sql.toString());
resetADB.executeUpdate();
boolean nextWatch = true;
Integer i = 1;
Record r = new Record();
while(nextWatch)
{
for (int j = 0; j < 1000 && nextWatch; j++)
{
nextWatch = rs.next();
r.setColumn(i, 0);
r.setColumn(rs.getString("B"), 1);
r.setColumn(rs.getString("C"), 2);
r.setColumn(rs.getString("D"), 3);
r.setColumn(rs.getString("E"), 4);
r.setColumn(rs.getString("F"), 5);
r.setColumn(rs.getString("G"), 6);
r.setColumn(rs.getString("H"), 7);
r.setColumn(rs.getString("I"), 8);
r.setColumn(rs.getString("J"), 9);
r.setColumn(rs.getString("K"), 10);
r.setColumn(rs.getInt("L"), 11);
r.setColumn(rs.getString("M"), 12);
r.setColumn(rs.getString("N"), 13);
r.setColumn(rs.getString("O"), 14);
r.setColumn(rs.getString("P"), 15);
records.add(r);
i++;
}
for(int k = 0; k < records.size(); k++)
{
currentRecord = records.get(k);
for(int m = 0; m < currentRecord.getNumOfColumns(); m++)
{
if (currentRecord.getColumn(m) instanceof String)
{
accStat.setString(m + 1, "\"" + currentRecord.getColumn(m) + "\"");
}
else
{
accStat.setInt(m + 1, Integer.parseInt(currentRecord.getColumn(m).toString()));
}
}
accStat.addBatch();
}
accStat.executeBatch();
accStat.clearBatch();
records.clear();
}
adb.connection.commit();
}
catch(Exception e){
e.printStackTrace();
}
finally{
}
}
}
class AccessDatabase{
public Connection connection = null;
public AccessDatabase(String filePath)
throws Exception
{
String dbString = null;
dbString = "jdbc:ucanaccess://" + filePath;
connection = DriverManager.getConnection(dbString);
connection.setAutoCommit(false);
}
}
class Record{
ArrayList<Object> columns;
public
Record(){
columns = new ArrayList<Object>();
columns.add("Placeholder1");
columns.add("Placeholder2");
columns.add("Placeholder3");
columns.add("Placeholder4");
columns.add("Placeholder5");
columns.add("Placeholder6");
columns.add("Placeholder7");
columns.add("Placeholder8");
columns.add("Placeholder9");
columns.add("Placeholder10");
columns.add("Placeholder11");
columns.add("Placeholder12");
columns.add("Placeholder13");
columns.add("Placeholder14");
columns.add("Placeholder15");
columns.add("Placeholder16");
}
<T> void setColumn(T input, int colNum){
columns.set(colNum, input);
}
Object getColumn(int colNum){
return columns.get(colNum);
}
int getNumOfColumns()
{
return columns.size();
}
}
class SybaseDatabase{
public Connection connection;
#SuppressWarnings("deprecation")
public SybaseDatabase(String filePath, String Username, String Password)
throws Exception
{
SybDriver driver;
try
{
driver = (SybDriver)Class.forName("com.sybase.jdbc2.jdbc.SybDriver").newInstance();
driver.setVersion(SybDriver.VERSION_6);
DriverManager.registerDriver(driver);
}
catch (Exception e)
{
e.printStackTrace(System.err);
}
connection = DriverManager.getConnection("jdbc:sybase:Tds:" + filePath, Username, Password);
}
}

If you want to use less memory, you should process less lines in same time but reuse all objects you can reuse (like the PreparedStatement)
First : You use an ArrayList<> in Record with a fixed size. You can just use an array Record[] for that. The principle of ArrayList is to have an array with a dynamic size which you don't need here
Second : don't load all the data from database before handle it, load a few part of data and process it, and continue.
You can do that by extracting the part of your code processing some rows and changing your query by limiting the number of returned rows.
Now, you load 1000 rows (from index 0 to 999), you process and commit them. Then you load 1000 rows (from index 1000 to 1999), you process and commit them. And then you continue. Between each pack of rows, don't keep any reference on precessed data (like on records) to avoid them to be kept in memory (like that they will be garbage-collected when necessary).
If you still have not enought memory, i guess you kept a reference on some objects which are not garbage collected due to that, causing a memory leak problem : your program need more and more memory when processing each data. You can use some tools like the jvisualvm (provided within java) to investigate the use of the memory

How to measure the time Spark needs to run an action on partitioned RDD?

I wrote a small Spark application which should measure the time that Spark needs to run an action on a partitioned RDD (combineByKey function to sum a value).
My problem is, that the first iteration seems to work correctly (calculated duration ~25 ms), but the next ones take much less time (~5 ms). It seems to me, that Spark persists the data without any request to do so!? Can I avoid that programmatically?
I have to know the duration that Spark needs to calculate a new RDD (without any caching / persisting of earlier iterations) --> I think the duration should always be about 20-25 ms!
To ensure the recalculation I moved the SparkContext generation into the for-loops, but this didn't bring any changes...
Thanks for your advices!
Here my code which seems to persist any data:
public static void main(String[] args) {
switchOffLogging();
// jetzt
try {
// Setup: Read out parameters & initialize SparkContext
String path = args[0];
SparkConf conf = new SparkConf(true);
JavaSparkContext sc;
// Create output file & writer
System.out.println("\npar.\tCount\tinput.p\tcons.p\tTime");
// The RDDs used for the benchmark
JavaRDD<String> input = null;
JavaPairRDD<Integer, String> pairRDD = null;
JavaPairRDD<Integer, String> partitionedRDD = null;
JavaPairRDD<Integer, Float> consumptionRDD = null;
// Do the tasks iterative (10 times the same benchmark for testing)
for (int i = 0; i < 10; i++) {
boolean partitioning = true;
int partitionsCount = 8;
sc = new JavaSparkContext(conf);
setS3credentials(sc, path);
input = sc.textFile(path);
pairRDD = mapToPair(input);
partitionedRDD = partition(pairRDD, partitioning, partitionsCount);
// Measure the duration
long duration = System.currentTimeMillis();
// Do the relevant function
consumptionRDD = partitionedRDD.combineByKey(createCombiner, mergeValue, mergeCombiners);
duration = System.currentTimeMillis() - duration;
// So some action to invoke the calculation
System.out.println(consumptionRDD.collect().size());
// Print the results
System.out.println("\n" + partitioning + "\t" + partitionsCount + "\t" + input.partitions().size() + "\t" + consumptionRDD.partitions().size() + "\t" + duration + " ms");
input = null;
pairRDD = null;
partitionedRDD = null;
consumptionRDD = null;
sc.close();
sc.stop();
}
} catch (Exception e) {
e.printStackTrace();
System.out.println(e.getMessage());
}
}
Some helper functions (should not be the problem):
private static void switchOffLogging() {
Logger.getLogger("org").setLevel(Level.OFF);
Logger.getLogger("akka").setLevel(Level.OFF);
}
private static void setS3credentials(JavaSparkContext sc, String path) {
if (path.startsWith("s3n://")) {
Configuration hadoopConf = sc.hadoopConfiguration();
hadoopConf.set("fs.s3n.impl", "org.apache.hadoop.fs.s3native.NativeS3FileSystem");
hadoopConf.set("fs.s3.impl", "org.apache.hadoop.fs.s3native.NativeS3FileSystem");
hadoopConf.set("fs.s3n.awsAccessKeyId", "mycredentials");
hadoopConf.set("fs.s3n.awsSecretAccessKey", "mycredentials");
}
}
// Initial element
private static Function<String, Float> createCombiner = new Function<String, Float>() {
public Float call(String dataSet) throws Exception {
String[] data = dataSet.split(",");
float value = Float.valueOf(data[2]);
return value;
}
};
// merging function for a new dataset
private static Function2<Float, String, Float> mergeValue = new Function2<Float, String, Float>() {
public Float call(Float sumYet, String dataSet) throws Exception {
String[] data = dataSet.split(",");
float value = Float.valueOf(data[2]);
sumYet += value;
return sumYet;
}
};
// function to sum the consumption
private static Function2<Float, Float, Float> mergeCombiners = new Function2<Float, Float, Float>() {
public Float call(Float a, Float b) throws Exception {
a += b;
return a;
}
};
private static JavaPairRDD<Integer, String> partition(JavaPairRDD<Integer, String> pairRDD, boolean partitioning, int partitionsCount) {
if (partitioning) {
return pairRDD.partitionBy(new HashPartitioner(partitionsCount));
} else {
return pairRDD;
}
}
private static JavaPairRDD<Integer, String> mapToPair(JavaRDD<String> input) {
return input.mapToPair(new PairFunction<String, Integer, String>() {
public Tuple2<Integer, String> call(String debsDataSet) throws Exception {
String[] data = debsDataSet.split(",");
int houseId = Integer.valueOf(data[6]);
return new Tuple2<Integer, String>(houseId, debsDataSet);
}
});
}
And finally the output of the Spark console:
part. Count input.p cons.p Time
true 8 6 8 20 ms
true 8 6 8 23 ms
true 8 6 8 7 ms // Too less!!!
true 8 6 8 21 ms
true 8 6 8 13 ms
true 8 6 8 6 ms // Too less!!!
true 8 6 8 5 ms // Too less!!!
true 8 6 8 6 ms // Too less!!!
true 8 6 8 4 ms // Too less!!!
true 8 6 8 7 ms // Too less!!!

I found a solution for me now: I wrote a separate class which calls the spark-submit command on a new process. This can be done in a loop, so every benchmark is started in a new thread and sparkContext is also separated per process. So garbage collection is done and everything works fine!
String submitCommand = "/root/spark/bin/spark-submit " + submitParams + " -- class partitioning.PartitionExample /root/partitioning.jar " + javaFlags;
Process p = Runtime.getRuntime().exec(submitCommand);
BufferedReader reader;
String line;
System.out.println(p.waitFor());
reader = new BufferedReader(new InputStreamReader(p.getInputStream()));
while ((line = reader.readLine())!= null) {
System.out.println(line);
}

If the shuffle output is small enough, then the Spark shuffle files will write to the OS buffer cache as fsync is not explicitly called...this means that, as long as there is room, your data will remain in memory.
If a cold performance test is truly necessary then you can try something like this attempt to flush the disk, but that is going to slow down the in-between each test. Could you just spin the context up and down? That might solve your need.

set Exampleset to ExampleSet2SimilarityExampleSet Rapidminer Operator with java

I create a process that measure similarity. I use ExampleSet2SimilarityExampleSetoperator and I want to add in input an example set which generated in my code.
The first idea is to write my example set in local repository. Then I will use retrieve operator to read example set and connect the output of operator retrieve to input of operator ExampleSet2SimilarityExampleSet.
So I wonder if I can avoid read/write.
public class Test {
public static void main(String[] args) throws OperatorCreationException, OperatorException{
Test t = new Test();
t.createprocess();
}
public void createprocess() throws OperatorCreationException, OperatorException{
RapidMiner.init();
ExampleSet exampleset = getExampleset();
Operator silimarityOperator = OperatorService.createOperator(ExampleSet2SimilarityExampleSet.class);
// silimarityOperator.setParameter("measure_type", "NumericalMeasures");
// silimarityOperator.setParameter("numerical_measure", "CosineSimilarity");
Process process = new Process();
process.getRootOperator().getSubprocess(0).addOperator(silimarityOperator);
process.getRootOperator().getSubprocess(0).getInnerSources().getPortByIndex(0).connectTo( silimarityOperator.getInputPorts().getPortByName("input"));
// run the process with new IOContainer using the created exampleSet
IOContainer run = process.run(new IOContainer(exampleset));
System.out.println(run.toString());
}
public ExampleSet getExampleset() {
// construct attribute set
Attribute[] attributes = new Attribute[3];
attributes[0] = AttributeFactory.createAttribute("Topic1", Ontology.STRING);
attributes[1] = AttributeFactory.createAttribute("Topic2", Ontology.STRING);
attributes[2] = AttributeFactory.createAttribute("Topic3", Ontology.STRING);
MemoryExampleTable table = new MemoryExampleTable(attributes);
DataRowFactory ROW_FACTORY = new DataRowFactory(0);
Double[] strings = new Double[3];
double a = 0;
for (int i = 0; i < 3; i++) {
a++;
strings[i] = a;
// make and add row
DataRow row = ROW_FACTORY.create(strings, attributes);
table.addDataRow(row);
}
ExampleSet exampleSet = table.createExampleSet();
return exampleSet;
}
}

I found solution. I have error at connection of operator. It is not exist port input, so I get port 0.
process.getRootOperator().getSubprocess(0).getInnerSources().getPortByIndex(0).connectTo(
silimarityOperator.getInputPorts().getPortByIndex(0));
silimarityOperator.getOutputPorts().getPortByIndex(0).connectTo(
process.getRootOperator().getSubprocess(0).getInnerSinks().getPortByIndex(0));

LensKit: LensKit demo is not reading my data file

When I run the LensKit demo program I get this error:
[main] ERROR org.grouplens.lenskit.data.dao.DelimitedTextRatingCursor - C:\Users\sean\Desktop\ml-100k\u - Copy.data:4: invalid input, skipping line
I reworked the ML 100k data set so that it only holds this line although I dont see how this would effect it:
196 242 3 881250949
186 302 3 891717742
22 377 1 878887116
244
Here is the code I am using too:
public class HelloLenskit implements Runnable {
public static void main(String[] args) {
HelloLenskit hello = new HelloLenskit(args);
try {
hello.run();
} catch (RuntimeException e) {
System.err.println(e.getMessage());
System.exit(1);
}
}
private String delimiter = "\t";
private File inputFile = new File("C:\\Users\\sean\\Desktop\\ml-100k\\u - Copy.data");
private List<Long> users;
public HelloLenskit(String[] args) {
int nextArg = 0;
boolean done = false;
while (!done && nextArg < args.length) {
String arg = args[nextArg];
if (arg.equals("-e")) {
delimiter = args[nextArg + 1];
nextArg += 2;
} else if (arg.startsWith("-")) {
throw new RuntimeException("unknown option: " + arg);
} else {
inputFile = new File(arg);
nextArg += 1;
done = true;
}
}
users = new ArrayList<Long>(args.length - nextArg);
for (; nextArg < args.length; nextArg++) {
users.add(Long.parseLong(args[nextArg]));
}
}
public void run() {
// We first need to configure the data access.
// We will use a simple delimited file; you can use something else like
// a database (see JDBCRatingDAO).
EventDAO base = new SimpleFileRatingDAO(inputFile, "\t");
// Reading directly from CSV files is slow, so we'll cache it in memory.
// You can use SoftFactory here to allow ratings to be expunged and re-read
// as memory limits demand. If you're using a database, just use it directly.
EventDAO dao = new EventCollectionDAO(Cursors.makeList(base.streamEvents()));
// Second step is to create the LensKit configuration...
LenskitConfiguration config = new LenskitConfiguration();
// ... configure the data source
config.bind(EventDAO.class).to(dao);
// ... and configure the item scorer. The bind and set methods
// are what you use to do that. Here, we want an item-item scorer.
config.bind(ItemScorer.class)
.to(ItemItemScorer.class);
// let's use personalized mean rating as the baseline/fallback predictor.
// 2-step process:
// First, use the user mean rating as the baseline scorer
config.bind(BaselineScorer.class, ItemScorer.class)
.to(UserMeanItemScorer.class);
// Second, use the item mean rating as the base for user means
config.bind(UserMeanBaseline.class, ItemScorer.class)
.to(ItemMeanRatingItemScorer.class);
// and normalize ratings by baseline prior to computing similarities
config.bind(UserVectorNormalizer.class)
.to(BaselineSubtractingUserVectorNormalizer.class);
// There are more parameters, roles, and components that can be set. See the
// JavaDoc for each recommender algorithm for more information.
// Now that we have a factory, build a recommender from the configuration
// and data source. This will compute the similarity matrix and return a recommender
// that uses it.
Recommender rec = null;
try {
rec = LenskitRecommender.build(config);
} catch (RecommenderBuildException e) {
throw new RuntimeException("recommender build failed", e);
}
// we want to recommend items
ItemRecommender irec = rec.getItemRecommender();
assert irec != null; // not null because we configured one
// for users
for (long user: users) {
// get 10 recommendation for the user
List<ScoredId> recs = irec.recommend(user, 10);
System.out.format("Recommendations for %d:\n", user);
for (ScoredId item: recs) {
System.out.format("\t%d\n", item.getId());
}
}
}
}
I am really lost on this one and would appreciate any help. Thanks for your time.

The last line of your input file only contains one field. Each input file line needs to contain 3 or 4 fields.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Why is my Spark driver program so slow? - java

Related

How to use parallel processing in most efficient and elegant way in java

Fixing GC Overhead Limit Exceeded Without Increasing Heap Size

How to measure the time Spark needs to run an action on partitioned RDD?

set Exampleset to ExampleSet2SimilarityExampleSet Rapidminer Operator with java

LensKit: LensKit demo is not reading my data file

Categories

Resources