Java - Encog 3.2 - RPROP network not updating weights - java

I researched a lot of questions and examples but I can't seem to find out what's wrong with my RPROP NN. It's also the first time I use Encog so I'm wondering if it's something I'm doing wrong.
I am trying to train the network to recognize a cat by feeding it images (50x50), then converting it to grayscale and feeding the network an input double[][] along with a target double[][]. I noticed that the error is constantly at 4.0, so I performed a dumpWeights() with every training iteration to see what's going on. I noticed that the weights were constantly zero. I then went back to the basics to see if I'm doing things right so I modified it for an XOR problem:
//////////First created the network:
BasicNetwork network = new BasicNetwork();
network.addLayer(new BasicLayer(null, true, 2));
network.addLayer(new BasicLayer(new ActivationBiPolar(), true, 2));
network.addLayer(new BasicLayer(new ActivationBiPolar(), false, 1));
network.getStructure().finalizeStructure();
network.reset();
//////Then created my data set and target vector (ideal vector) and fed it to a new RPROP training class:
final double targetVector[][] = { { -1 }, { 1.0 }, { 1.0 }, { -1 } };
final double inputData[][] = { { -1, -1 }, { 1.0, -1 },{ -1, 1.0 }, { 1.0, 1.0 } };
MLDataSet trainingSet = new BasicMLDataSet(inputData, targetVector);
final ResilientPropagation train = new ResilientPropagation(network, trainingSet);
///////train network
int epoch = 1;
do{
train.iteration();
System.out.println("Epoch #" + epoch + " Error : " + train.getError()) ;
epoch++;
System.out.println(network.dumpWeights());
}while(train.getError() > 0.01) ;
train.finishTraining();
System.out.println("End of training");
I get the following output, notice the lines of 0.0 as a result of the network.dumpWeights() method:
Epoch #132636 Error : 2.0
0,0,0,0,0,0,0,0,0
Epoch #132637 Error : 2.0
0,0,0,0,0,0,0,0,0
Epoch #132638 Error : 2.0
0,0,0,0,0,0,0,0,0
Epoch #132639 Error : 2.0
0,0,0,0,0,0,0,0,0
Epoch #132640 Error : 2.0
... and so on.
Anything obvious you can see that I'm doing wrong here? I also tried a 2-3-1 architecture as the XORHelloWorld.java example implemented.
Any help would be greatly appreciated.

Try switching your hidden layer to a TANH activation function, such as this:
network.addLayer(new BasicLayer(null, true, 2));
network.addLayer(new BasicLayer(new ActivationTANH(), true, 2));
network.addLayer(new BasicLayer(new ActivationBiPolar(), false, 1));
With this change, I can get your example above to converge. I think it will work better than Sigmoid, if you are using -1 to 1 as the input. It is okay to is a linear activation function (i.e. ActivationBiPolar as the output activation function) but you need something such as sigmoid/tanh as the hidden. Something that does not just return 1.0 as the derivative, like the linear functions do.

Related

Iterative GraphFrames AggregateMessages hitting memory limits

I'm using GraphFrame's aggregateMessages capability to build a custom clustering algorithm. I tested this algorithm on a small sample dataset (~100 items) and verified that it works. But when I run this on my real dataset of 50k items, I am getting OOM errors after ~10 iterations. Interestingly, the first few iterations are processed in a couple mins and mem is the normal range. It's after iteration 6 that mem usage creeps to ~30GB and eventually bombs. I am running this on a 2 node cluster 16cores with 32GB.
Since this is an iterative algorithm and the fact that the mem after each iteration only increases, I wonder if I need to release memory somehow. I added the unpersist blocks at the end of the the loop but that hasnt helped.
Are there any other efficiencies I could use? Are there best practices around using GraphFrames in an iterative setting?
Another thing I've noticed is that on the spark UI of the executor page, the used "storage memory" for ~300MB, but the spark process is infact taking ~30GB. Not sure if this is a memory leak!
while ( true ) {
System.out.println("["+new Date()+"] Running " + i);
Dataset<Row> lastRoutesDs = groups;
Dataset<Row> groupUnwind = groups.withColumn("id", explode(col("routeItems")));
GraphFrame gf = new GraphFrame(groupUnwind, edgesDs);
Dataset<Row> lvl1 = gf.aggregateMessages()
.sendToSrc(when(
callUDF("contains_in_array_str", AggregateMessages.dst().getField("routeItems"),
AggregateMessages.src().getField("id")).equalTo(false),
struct(AggregateMessages.dst().getField("routeItems").as("routeItems"),
AggregateMessages.dst().getField("routeScores").as("routeScores"),
AggregateMessages.dst().getField("grpId").as("grpId"),
AggregateMessages.dst().getField("grpScore").as("grpScore"),
AggregateMessages.edge().getField("score").as("edgeScore"))))
.agg(collect_set(AggregateMessages.msg()).as("incomings"))
.withColumn("inItem", explode(col("incomings")))
.groupBy("id", "inItem.grpId")
.agg(first("inItem.routeItems").as("routeItems"), first("inItem.routeScores").as("routeScores"),
first("inItem.grpScore").as("grpScore"), collect_list("inItem.edgeScore").as("inScores"))
.groupBy("grpId")
.agg(bestRouteAgg.apply(col("routeItems"), col("routeScores"), col("inScores"), col("grpScore"),
col("id"), col("grpScore")).as("best"))
.withColumn("newScore", callUDF("calcRouteScores", expr("size(best.routeItems)+1"),
col("best.routeScores"), col("best.inScores")))
.withColumn("edgeCount", expr("size(best.routeScores)"))
.persist(StorageLevel.MEMORY_AND_DISK());
lvl1
.filter("newScore > " + groupMaxScore)
.withColumn("itr", lit(i))
.select("grpId", "best.routeItems","best.routeScores", "best.grpScore", "edgeCount", "itr")
.write()
.mode(SaveMode.Append)
.json(workspaceDir + "clusters-rank-collect");
if (lvl1.count() == 0) {
System.out.println("****** End reached " + i);
break;
}
Dataset<Row> newGroups = lvl1.filter("newScore <= " + groupMaxScore)
.withColumn("routeItems_new",
callUDF("merge2Array", col("best.routeItems"), array(col("best.newNode"))))
.withColumn("routeScores_new",
callUDF("merge2ArrayDouble", col("best.routeScores"), col("best.inScores")))
.select(col("grpId"), col("routeItems_new").as("routeItems"),
col("routeScores_new").as("routeScores"), col("newScore").as("grpScore"));
if (i > 0 && (i % 2) == 0) {
newGroups = newGroups
.checkpoint();
}
newGroups = newGroups
.persist(StorageLevel.DISK_ONLY());
System.out.println( newGroups.count() );
groups.unpersist();
lastRoutesDs.unpersist();
groupUnwind.unpersist();
lvl1.unpersist();
groups = newGroups;
i++;
}

Loading saved Tensorflow model in Java

I have developed a Tensorflow model with python in Linux based on the tutorial here: "http://cv-tricks.com/tensorflow-tutorial/training-convolutional-neural-network-for-image-classification/". I trained and saved the model using "tf.train.Saver". I am able to deploy the model in Linux environment and perform prediction successfully. Now I need to be able to load this saved model in JAVA on WINDOWS. Through extensive research online I have read that it does not work with "tf.train.Saver" and I have to change my code to use "Serving" to be able to load a saved TF model in java! Therefore, I followed the tutorial here:
"https://github.com/tensorflow/serving/blob/master/tensorflow_serving/example/mnist_saved_model.py
" and changed my code. However, I have an error with "tf.FixedLenFeature" where it is asking me to use "FixedLenSequenceFeature". Here is the complete error message:
"ValueError: First dimension of shape for feature x unknown. Consider using FixedLenSequenceFeature."
which is happening here:
feature_configs = {'x': tf.FixedLenFeature(shape=[None, img_size,img_size,num_channels], dtype=tf.float32),}
I am not sure this is the right path to go since I have batch of images of size [batchsize*128*128*3] and should not be using the sequence feature! It would be great if someone could clear this out for me and answer these questions:
1- Do I have to change my code from "tf.train.Saver" to "serving" to be able to load the saved model and deploy it in JAVA?
2- If the answer to the above question is yes, how can I feed the data correctly and solve the aforementioned ERROR?
3- Is there any example of how to DEPLOY the model that was saved using "serving"?
Here is my training code that throws the error:
import dataset
import tensorflow as tf
import time
from datetime import timedelta
import math
import random
import numpy as np
import os
#Adding Seed so that random initialization is consistent
from numpy.random import seed
seed(1)
from tensorflow import set_random_seed
set_random_seed(2)
batch_size = 32
#Prepare input data
classes = ['class1','class2','class3']
num_classes = len(classes)
# 20% of the data will automatically be used for validation
validation_size = 0.2
img_size = 128
num_channels = 3
train_path='/home/user1/Downloads/Expression/Augmented/Data/Train'
# We shall load all the training and validation images and labels into memory using openCV and use that during training
data = dataset.read_train_sets(train_path, img_size, classes, validation_size=validation_size)
print("Complete reading input data. Will Now print a snippet of it")
print("Number of files in Training-set:\t\t{}".format(len(data.train.labels)))
print("Number of files in Validation-set:\t{}".format(len(data.valid.labels)))
session = tf.Session()
serialized_tf_example = tf.placeholder(tf.string, name='tf_example')
feature_configs = {'x': tf.FixedLenFeature(shape=[None, img_size,img_size,num_channels], dtype=tf.float32),}
tf_example = tf.parse_example(serialized_tf_example, feature_configs)
x = tf.identity(tf_example['x'], name='x') # use tf.identity() to assign name
# x = tf.placeholder(tf.float32, shape=[None, img_size,img_size,num_channels], name='x')
## labels
y_true = tf.placeholder(tf.float32, shape=[None, num_classes], name='y_true')
y_true_cls = tf.argmax(y_true, dimension=1)
##Network graph params
filter_size_conv1 = 3
num_filters_conv1 = 32
filter_size_conv2 = 3
num_filters_conv2 = 32
filter_size_conv3 = 3
num_filters_conv3 = 64
fc_layer_size = 128
def create_weights(shape):
return tf.Variable(tf.truncated_normal(shape, stddev=0.05))
def create_biases(size):
return tf.Variable(tf.constant(0.05, shape=[size]))
def create_convolutional_layer(input,
num_input_channels,
conv_filter_size,
num_filters):
## We shall define the weights that will be trained using create_weights function.
weights = create_weights(shape=[conv_filter_size, conv_filter_size, num_input_channels, num_filters])
## We create biases using the create_biases function. These are also trained.
biases = create_biases(num_filters)
## Creating the convolutional layer
layer = tf.nn.conv2d(input=input,
filter=weights,
strides=[1, 1, 1, 1],
padding='SAME')
layer += biases
## We shall be using max-pooling.
layer = tf.nn.max_pool(value=layer,
ksize=[1, 2, 2, 1],
strides=[1, 2, 2, 1],
padding='SAME')
## Output of pooling is fed to Relu which is the activation function for us.
layer = tf.nn.relu(layer)
return layer
def create_flatten_layer(layer):
#We know that the shape of the layer will be [batch_size img_size img_size num_channels]
# But let's get it from the previous layer.
layer_shape = layer.get_shape()
## Number of features will be img_height * img_width* num_channels. But we shall calculate it in place of hard-coding it.
num_features = layer_shape[1:4].num_elements()
## Now, we Flatten the layer so we shall have to reshape to num_features
layer = tf.reshape(layer, [-1, num_features])
return layer
def create_fc_layer(input,
num_inputs,
num_outputs,
use_relu=True):
#Let's define trainable weights and biases.
weights = create_weights(shape=[num_inputs, num_outputs])
biases = create_biases(num_outputs)
# Fully connected layer takes input x and produces wx+b.Since, these are matrices, we use matmul function in Tensorflow
layer = tf.matmul(input, weights) + biases
if use_relu:
layer = tf.nn.relu(layer)
return layer
layer_conv1 = create_convolutional_layer(input=x,
num_input_channels=num_channels,
conv_filter_size=filter_size_conv1,
num_filters=num_filters_conv1)
layer_conv2 = create_convolutional_layer(input=layer_conv1,
num_input_channels=num_filters_conv1,
conv_filter_size=filter_size_conv2,
num_filters=num_filters_conv2)
layer_conv3= create_convolutional_layer(input=layer_conv2,
num_input_channels=num_filters_conv2,
conv_filter_size=filter_size_conv3,
num_filters=num_filters_conv3)
layer_flat = create_flatten_layer(layer_conv3)
layer_fc1 = create_fc_layer(input=layer_flat,
num_inputs=layer_flat.get_shape()[1:4].num_elements(),
num_outputs=fc_layer_size,
use_relu=True)
layer_fc2 = create_fc_layer(input=layer_fc1,
num_inputs=fc_layer_size,
num_outputs=num_classes,
use_relu=False)
y_pred = tf.nn.softmax(layer_fc2,name='y_pred')
y_pred_cls = tf.argmax(y_pred, dimension=1)
values, indices = tf.nn.top_k(y_pred, 3)
table = tf.contrib.lookup.index_to_string_table_from_tensor(
tf.constant([str(i) for i in xrange(3)]))
prediction_classes = table.lookup(tf.to_int64(indices))
session.run(tf.global_variables_initializer())
cross_entropy = tf.nn.softmax_cross_entropy_with_logits(logits=layer_fc2,
labels=y_true)
cost = tf.reduce_mean(cross_entropy)
optimizer = tf.train.AdamOptimizer(learning_rate=1e-4).minimize(cost)
correct_prediction = tf.equal(y_pred_cls, y_true_cls)
accuracy = tf.reduce_mean(tf.cast(correct_prediction, tf.float32))
session.run(tf.global_variables_initializer())
def show_progress(epoch, feed_dict_train, feed_dict_validate, val_loss):
acc = session.run(accuracy, feed_dict=feed_dict_train)
val_acc = session.run(accuracy, feed_dict=feed_dict_validate)
msg = "Training Epoch {0} --- Training Accuracy: {1:>6.1%}, Validation Accuracy: {2:>6.1%}, Validation Loss: {3:.3f}"
print(msg.format(epoch + 1, acc, val_acc, val_loss))
total_iterations = 0
# saver = tf.train.Saver()
def train(num_iteration):
global total_iterations
for i in range(total_iterations,
total_iterations + num_iteration):
x_batch, y_true_batch, _, cls_batch = data.train.next_batch(batch_size)
x_valid_batch, y_valid_batch, _, valid_cls_batch = data.valid.next_batch(batch_size)
feed_dict_tr = {x: x_batch,
y_true: y_true_batch}
feed_dict_val = {x: x_valid_batch,
y_true: y_valid_batch}
session.run(optimizer, feed_dict=feed_dict_tr)
if i % int(data.train.num_examples/batch_size) == 0:
print(i)
val_loss = session.run(cost, feed_dict=feed_dict_val)
epoch = int(i / int(data.train.num_examples/batch_size))
show_progress(epoch, feed_dict_tr, feed_dict_val, val_loss)
print("Saving the model Now!")
# saver.save(session, save_path_full, global_step=i)
total_iterations += num_iteration
train(num_iteration=10000)#3000
# Export model
# WARNING(break-tutorial-inline-code): The following code snippet is
# in-lined in tutorials, please update tutorial documents accordingly
# whenever code changes.
export_path_base = './SavedModel/'
export_path = os.path.join(
tf.compat.as_bytes(export_path_base),
tf.compat.as_bytes(str(1)))
print 'Exporting trained model to', export_path
builder = tf.saved_model.builder.SavedModelBuilder(export_path)
# Build the signature_def_map.
classification_inputs = tf.saved_model.utils.build_tensor_info(
serialized_tf_example)
classification_outputs_classes = tf.saved_model.utils.build_tensor_info(
prediction_classes)
classification_outputs_scores = tf.saved_model.utils.build_tensor_info(values)
classification_signature = (
tf.saved_model.signature_def_utils.build_signature_def(
inputs={
tf.saved_model.signature_constants.CLASSIFY_INPUTS:
classification_inputs
},
outputs={
tf.saved_model.signature_constants.CLASSIFY_OUTPUT_CLASSES:
classification_outputs_classes,
tf.saved_model.signature_constants.CLASSIFY_OUTPUT_SCORES:
classification_outputs_scores
},
method_name=tf.saved_model.signature_constants.CLASSIFY_METHOD_NAME))
tensor_info_x = tf.saved_model.utils.build_tensor_info(x)
tensor_info_y = tf.saved_model.utils.build_tensor_info(y_pred)
prediction_signature = (
tf.saved_model.signature_def_utils.build_signature_def(
inputs={'images': tensor_info_x},
outputs={'scores': tensor_info_y},
method_name=tf.saved_model.signature_constants.PREDICT_METHOD_NAME))
legacy_init_op = tf.group(tf.tables_initializer(), name='legacy_init_op')
builder.add_meta_graph_and_variables(
sess, [tf.saved_model.tag_constants.SERVING],
signature_def_map={
'predict_images':
prediction_signature,
tf.saved_model.signature_constants.DEFAULT_SERVING_SIGNATURE_DEF_KEY:
classification_signature,
},
legacy_init_op=legacy_init_op)
builder.save()
print 'Done exporting!'

OPENCV crash when saving to file trained machine learning data (like SVM or ANN)

I have built a simple project in android studio, and included OpenCV in order to train either an SVM (support vector machine) or an ANN (artificial neural network). Everything seems to go well, including data creation, training and inspection of trained data, except for saving. Whenever I save a opencv ml-object (like ann.save(...) or svm.save(...)), android studio crashes.
SVM -
When I extract supportvectors using the line
classifier.getSupportVectors()
the numbers seem sain. However, the app crashes when I move past a breakpoint placed at
classifier.save("C:\\foo\\trentsvm.txt");
In logCat I dig up the following feedback:
07-04 14:36:10.939 25258-25258/com.example.tbrandsa.opencvtest A/libc:
Fatal signal 11 (SIGSEGV), code 2, fault addr 0x7f755f53f0 in tid
25258 (ndsa.opencvtest) [ 07-04 14:36:10.942 439: 439 W/]
debuggerd: handling request: pid=25258 uid=10227 gid=10227 tid=25258
I get a similar error if i instead try to save an artificial neural network (ANN), see update far below.
I have tried saving the file as XML and txt, and as "C:\trentsvm.someformat", and as "trentsvm.someformat". I also get the same error in my Eclipse java project. High pain, no gain. Full code is below. Could you help?
PS: I use OpenCv version 3.2.0. and Android Studio 2.3.2
// I based this code on stuff i found online. Not sure if all is as important or good.
// Purpose: multilabel classification - digit recognition for android app.
// Create data and labels for a digit recognition algorithm
int numTargets = 10; // (0-9 => 10 types of labels)
int totalSamples = 100; // Could have been number of images of digits
int totalIndicators = 10; // Could have been number of properites per digit image.
Mat labels = new Mat(totalSamples,1, CvType.CV_16S);
Mat data = new Mat(totalSamples, totalIndicators,CvType.CV_16S);
// Fill with dummy values:
for (int s = 0; s<totalSamples; s++)
{
int someLabel = s%numTargets;
labels.put(s,0, (double)someLabel);
for (int m = 0; m<totalIndicators; m++)
{
int someDataValue = (s%numTargets)*totalIndicators + m;
data.put(s, m, (double)someDataValue);
}
}
data.convertTo(data, CvType.CV_32F);
labels.convertTo(labels, CvType.CV_32S);
SVM classifier = SVM.create();
TermCriteria criteria = new TermCriteria(TermCriteria.EPS + TermCriteria.MAX_ITER,100,0.1);
classifier.setKernel(SVM.LINEAR);
classifier.setType(SVM.C_SVC); //We choose here the type CvSVM::C_SVC that can be used for n-class classification (n >= 2).
classifier.setGamma(0.5);
classifier.setNu(0.5);
classifier.setC(1);
classifier.setTermCriteria(criteria);
classifier.train(data, Ml.ROW_SAMPLE, labels);
// Check how trained SVM predicts the training data
Mat estimates = new Mat(totalSamples, 1, CvType.CV_32F);
classifier.predict(data, estimates, StatModel.RAW_OUTPUT);
for (int i = 0; i<totalSamples; i++)
{
double l = labels.get(i, 0)[0];
double e = estimates.get(i, 0)[0];
System.out.print("\n fact: "+l+", estimat: "+e);
}
Mat suppV = classifier.getSupportVectors();
try {
if (classifier.isTrained()){
// It crashes at the next line!
classifier.save("C:\\foo\\trentsvm.txt");
}
}
catch (Exception e)
{
}
Update july 5th: As suggested by ZdaR, I tried the to use an in-phone adress, but it did not solve the problem.
String address = Environment.getExternalStorageDirectory().getPath()+"/trentsvm.xml";
// address now has value "storage/emulated/0/trentsvm.xml"
classifier.save(address);
In logcat:
07-05 14:50:12.420 11743-11743/com.example.tbrandsa.opencv2 A/libc:
Fatal signal 11 (SIGSEGV), code 2, fault addr 0x7d517f1990 in tid
11743 (brandsa.opencv2)
[ 07-05 14:50:12.424 3134: 3134 W/ ] debuggerd: handling
request: pid=11743 uid=10319 gid=10319 tid=11743
Update july 6th:
When I run the same script in eclipse and use a debugger (JUnit 4, VM arguments: -Djava.library.path=C:\Users\tbrandsa\Downloads\opencv\build\java\x64;src\test\jniLibs, ) debugging on the pc without device, the caught exception "e" says the following
cause= Exception,
detailMessage= "Unknown Exception" ,
stackTrace=> StackTraceElement[0] ,
suppressedExeptions= Collections$UnmodifiableRandomAccessList,
Update july 13th:
I just tried with an artificial neural network (ANN) too, and it crashes when trying to save.
Error:
Fatal signal 11 (SIGSEGV), code 1, fault addr 0x15a57e688000c in tid
8507 (brandsa.opencv2) debuggerd: handling request: pid=8507
uid=10319 gid=10319 tid=8507
Code:
// Mat data is of size 100*20*CV_32FC1,
// Mat labels is of size 100*1*CV_32FC1
// layerSizes is of size 3*1*CV_8UC1
int[] hiddenLayers = {10};
Mat layerSizes = new Mat(2 + hiddenLayers.length,1,CvType.CV_8U);
layerSizes.put(0, 0, data.width());
for (int l = 0; l< hiddenLayers.length; l++){
layerSizes.put(1 + l, 0,hiddenLayers[l]);}
layerSizes.put(1 + hiddenLayers.length, 0,labels.width());
ANN_MLP ann = ANN_MLP.create();
ann.setLayerSizes(layerSizes);
ann.setActivationFunction(ANN_MLP.SIGMOID_SYM);
ann.train(data, Ml.ROW_SAMPLE , labels);
ann.save("/storage/emulated/0/Pictures/no.rema.priceagent.test/trentann.xml");

Neuroph cannot train set

I've been trying so hard to train a network but I cannot do it. Neuroph Studio doesn't help at all, it always return null when training.
Then I tried this code in a Java app :
// create new perceptron network
NeuralNetwork neuralNetwork = new Perceptron(2, 1);
// create training set
DataSet trainingSet = new DataSet(2, 1);
// add training data to training set (logical OR function)
trainingSet.addRow(new DataSetRow(new double[]{0, 0}, new double[]{0.5d}));
trainingSet.addRow(new DataSetRow(new double[]{0, 1}, new double[]{1}));
trainingSet.addRow(new DataSetRow(new double[]{1, 0}, new double[]{1}));
trainingSet.addRow(new DataSetRow(new double[]{1, 1}, new double[]{1}));
// learn the training set
neuralNetwork.learn(trainingSet);
// save the trained network into file
neuralNetwork.save("or_perceptron.nnet");
// load the saved network
neuralNetwork = NeuralNetwork.createFromFile("or_perceptron.nnet");
// set network input
neuralNetwork.setInput(1, 1);
// calculate network
neuralNetwork.calculate();
// get network output
double[] networkOutput = neuralNetwork.getOutput();
for (double res : networkOutput) {
System.out.println(res);
}
This works, but I want to train something like this:
Input: 0.3 , 0.5
Output : 0.2
It keeps training forever, what is wrong with neuroph, or it doesn't work at all ?
At the end the only thing that worked was to load the training set from external files. Maybe there is another solution, but that was the only thing that worked for me at the end.

How to overcome SVMWithSGD that throws ArrayIndexOutOfBoundsException for index bigger that 5000?

In order to detect visitors demographics based on their behavior I used SVM algorithm from SPARK MLlib:
JavaRDD<LabeledPoint> data = MLUtils.loadLibSVMFile(sc.sc(), "labels.txt").toJavaRDD();
JavaRDD<LabeledPoint> training = data.sample(false, 0.6, 11L);
training.cache();
JavaRDD<LabeledPoint> test = data.subtract(training);
// Run training algorithm to build the model.
int numIterations = 100;
final SVMModel model = SVMWithSGD.train(training.rdd(), numIterations);
// Clear the default threshold.
model.clearThreshold();
JavaRDD<Tuple2<Object, Object>> scoreAndLabels = test.map(new SVMTestMapper(model));
Unfortunately final SVMModel model = SVMWithSGD.train(training.rdd(), numIterations); throws ArrayIndexOutOfBoundsException :
Caused by: java.lang.ArrayIndexOutOfBoundsException: 4857
labels.txt is a txt file composed from:
Visitor criteria(is male) | List[siteId: access number]
1 27349:1 23478:1 35752:1 9704:2 27896:1 30050:2 30018:1
1 36214:1 26378:1 26606:1 26850:1 17968:2
1 21870:1 41294:1 37388:1 38626:1 10711:1 28392:1 20749:1
1 29328:1 34370:1 19727:1 29542:1 37621:1 20588:1 42426:1 30050:6 28666:1 23190:3 7882:1 35387:1 6637:1 32131:1 23453:1
I tried with a lot of data and algorithms and as seen it gives an error for site Ids bigger than 5000.
Is there any solution to overcome it or there is another library for this issue? Or because the data is matrix is too sparse should use SVD?

Categories