Stanford Topic Modeling Toolbox: Exception - java

I'm trying to work with the Stanford Topic Modeling Toolbox. I downloaded the "tmt-0.4.0.jar"-File from here: and I tried to examples.
Example 0 and 1 worked fine, but trying example 2 (no code-changes), I receive the following exception:
[cell] loading pubmed-oa-subset.csv.term-counts.cache.70108071.gz
[Concurrent] 32 permits Exception in thread "Thread-3"
java.lang.ArrayIndexOutOfBoundsException: -1 at
scalanlp.stage.text.TermCounts$class.getDF(TermFilters.scala:64) at
scalanlp.stage.text.TermCounts$$anon$2.getDF(TermFilters.scala:84) at
at scala.collection.Iterator$$anon$22.hasNext(Iterator.scala:390) at
scala.collection.Iterator$$anon$22.hasNext(Iterator.scala:388) at
scala.collection.Iterator$class.foreach(Iterator.scala:660) at
scala.collection.Iterator$$anon$22.foreach(Iterator.scala:382) at
at scala.collection.Iterator$$anon$22.hasNext(Iterator.scala:390) at$$anonfun$map$2.apply(Concurrent.scala:100)
Why do I receive this exception, and how can this be fixed?
Thanks a lot for your help!
PS: The code is the same as in example 2 of the website:
// Stanford TMT Example 2 - Learning an LDA model
// tells Scala where to find the TMT classes
import scalanlp.stage._;
import scalanlp.stage.text._;
import scalanlp.text.tokenize._;
import edu.stanford.nlp.tmt.stage._;
import edu.stanford.nlp.tmt.model.lda._;
import edu.stanford.nlp.tmt.model.llda._;
val source = CSVFile("pubmed-oa-subset.csv") ~> IDColumn(1);
val tokenizer = {
SimpleEnglishTokenizer() ~> // tokenize on space and punctuation
CaseFolder() ~> // lowercase everything
WordsAndNumbersOnlyFilter() ~> // ignore non-words and non-numbers
MinimumLengthFilter(3) // take terms with >=3 characters
val text = {
source ~> // read from the source file
Column(4) ~> // select column containing text
TokenizeWith(tokenizer) ~> // tokenize with tokenizer above
TermCounter() ~> // collect counts (needed below)
TermMinimumDocumentCountFilter(4) ~> // filter terms in <4 docs
TermDynamicStopListFilter(30) ~> // filter out 30 most common terms
DocumentMinimumLengthFilter(5) // take only docs with >=5 terms
// turn the text into a dataset ready to be used with LDA
val dataset = LDADataset(text);
// define the model parameters
val params = LDAModelParams(numTopics = 30, dataset = dataset,
topicSmoothing = 0.01, termSmoothing = 0.01);
// Name of the output model folder to generate
val modelPath = file("lda-"+dataset.signature+"-"+params.signature);
// Trains the model: the model (and intermediate models) are written to the
// output folder. If a partially trained model with the same dataset and
// parameters exists in that folder, training will be resumed.
TrainCVB0LDA(params, dataset, output=modelPath, maxIterations=1000);
// To use the Gibbs sampler for inference, instead use
// TrainGibbsLDA(params, dataset, output=modelPath, maxIterations=1500);

The answer has been posted by the author of the tool. Please have a look here.
This usually happens when you have a stale .cache file - unfortunately the
error message isn't particularly useful. Try deleting cache in the run
folder and running again.


Optimize Random Forest parameters in weka?

I am trying to optimize random forest parameters using weka, the java class is as the following:
package pkg10foldcrossvalidation;
import weka.core.*;
import weka.classifiers.meta.*;
import weka.classifiers.trees.RandomForest;
public class RF_Optimizer {
public static void main(String[] args) throws Exception {
// load data
BufferedReader reader = new BufferedReader(new FileReader("C:\\Prediction Results on the testing set\\Dataset.arff"));
Instances data = new Instances(reader);
data.setClassIndex(data.numAttributes() - 1);
// setup classifier
CVParameterSelection ps = new CVParameterSelection();
ps.setClassifier(new RandomForest());
ps.setNumFolds(10); // using 10-fold CV
ps.addCVParameter("C 0.1 0.5 5");
// build and output best options
But I am facing difficulty of understanding which parameters should replace the "C" and how the range of each one could be determined? And is it workable to use .addCVParameter several times for several parameters at the same time?
I tried to search for some youtube or website tutorials that explain how to change random forest parameters in java but nothing found.
Thank you
I think what you are describing, -C are the Cross-Validation parameters, not the RandomForest parameters.
Can't you just use the Explorer GUI, open a sample dataset such as glass.arff, and then right-click on the bold RandomForest string at the top of the window, then from the context menu choose "copy configuration to clipboard", and then paste that string into your java code?
After doing this right now, I've copied this string to the clipboard:
weka.classifiers.trees.RandomForest -P 100 -I 100 -num-slots 1 -K 0 -M 1.0 -V 0.001 -S 1
These are the default parameters for Weka's RandomForest learner. What these parameters mean, and which of them is most suitable for optimization, and which range of values to use for optimization I really can't tell. Most likely a very important parameter is numIterations, the -I parameter. Maybe vary it from 100, 200,... to 1000 and plot numIterations vs Accuracy, and check if the curve has smoothed out already.

H2O : NullPointerException error while building ensemble model using deep learning grid

I am trying to build a stacked ensemble model to predict merchant churn using R (version 3.3.3) and deep learning in h2o (version The response variable is binary. At the moment I am trying run the code to build a stacked ensemble model using the top 5 models developed by the grid search. However, when the code is run, I get the java.lang.NullPointerException error with the following output:
at hex.StackedEnsembleModel.checkAndInheritModelProperties(
at hex.ensemble.StackedEnsemble$StackedEnsembleDriver.computeImpl(
at hex.ModelBuilder$Driver.compute2(
at water.H2O$H2OCountedCompleter.compute(
at jsr166y.CountedCompleter.exec(
at jsr166y.ForkJoinTask.doExec(
at jsr166y.ForkJoinPool$WorkQueue.runTask(
at jsr166y.ForkJoinPool.runWorker(
Below is the code that I've used to do the hyper-parameter grid search and build the ensemble model:
hyper_params <- list(
rho = c(0.9,0.95,0.99,0.999),
search_criteria <- list(
strategy = "RandomDiscrete",
max_runtime_secs = 3600,
max_models = 100,
dl_ensemble_grid <- h2o.grid(
hyper_params = hyper_params,
search_criteria = search_criteria,
grid_id = "final_grid_ensemble_dl",
training_frame = h2o.rbind(train, valid, test),
keep_cross_validation_predictions = TRUE,
keep_cross_validation_fold_assignment = TRUE,
max_runtime_secs = 3600,
seed = 1234,
DLsortedGridEnsemble_logloss <- h2o.getGrid("final_grid_ensemble_dl",sort_by="logloss",decreasing=FALSE)
ensemble <- h2o.stackedEnsemble(x = predictors,
y = response,
training_frame = h2o.rbind(train,valid,test),
base_models = list(
Note: what I have realised so far is that h2o.stackedEnsemble function works when there's only one base model and it gives the Java error as soon as there's two or more base models.
I would really appreciate if I could get some feedback as to how this could be resolved.
The error refers to a line of the code that checks that the training_frame in the base models and the training_frame in h2o.stackedEnsemble() have the same checksum. I think the problem is caused because you dynamically created the training frame, rather than defining it explicitly (even though that should work since it's the same data in the end). So, rather than setting training_frame = h2o.rbind(train, valid, test) in the grid and ensemble functions, set the following at the top of your code:
df <- h2o.rbind(train, valid, test)
And then set training_frame = df in the grid and ensemble functions.
As a side note, you may get better DL models if you use a validation frame (for early stopping), rather than using all your data for the training frame. Also, if you want to use all the models in your grid (might lead to better performance, but not always), you can set base_models = DLsortedGridEnsemble_logloss#model_ids in the h2o.stackedEnsemble() function.

How to use a TensorFlow LinearClassifier in Java

In Python I've trained a TensorFlow LinearClassifier and saved it like:
model = tf.contrib.learn.LinearClassifier(feature_columns=columns), steps=100)
model.export_savedmodel(export_dir, parsing_serving_input_fn)
By using the TensorFlow Java API I am able to load this model in Java using:
model = SavedModelBundle.load(export_dir, "serve");
It seems I should be able to run the graph using something like
model.session().runner().feed(???, ???).fetch(???, ???).run()
but what variable names/data should I feed to/fetch from the graph to provide it features and to fetch the probabilities of the classes? The Java documentation is lacking this information as far as I can see.
The names of the nodes to feed would depend on what parsing_serving_input_fn does, in particular they should be the names of the Tensor objects that are returned by parsing_serving_input_fn. The names of the nodes to fetch would depend on what you're predicting (arguments to model.predict() if using your model from Python).
That said, the TensorFlow saved model format does include the "signature" of the model (i.e., the names of all Tensors that can be fed or fetched) as metadata that can provide hints.
From Python you can load the saved model and list out its signature using something like:
with tf.Session() as sess:
md = tf.saved_model.loader.load(sess, ['serve'], export_dir)
sig = md.signature_def[tf.saved_model.signature_constants.DEFAULT_SERVING_SIGNATURE_DEF_KEY]
Which will print something like:
inputs {
key: "inputs"
value {
name: "input_example_tensor:0"
dtype: DT_STRING
tensor_shape {
dim {
size: -1
outputs {
key: "scores"
value {
name: "linear/binary_logistic_head/predictions/probabilities:0"
dtype: DT_FLOAT
tensor_shape {
dim {
size: -1
dim {
size: 2
method_name: "tensorflow/serving/classify"
Suggesting that what you want to do in Java is:
Tensor t = /* Tensor object to be fed */
model.session().runner().feed("input_example_tensor", t).fetch("linear/binary_logistic_head/predictions/probabilities").run()
You can also extract this information purely within Java if your program includes the generated Java code for TensorFlow protocol buffers (packaged in the org.tensorflow:proto artifact) using something like this:
// Same as tf.saved_model.signature_constants.DEFAULT_SERVING_SIGNATURE_DEF_KEY
// in Python. Perhaps this should be an exported constant in TensorFlow's Java API.
final String DEFAULT_SERVING_SIGNATURE_DEF_KEY = "serving_default";
final SignatureDef sig =
You will have to add:
import org.tensorflow.framework.MetaGraphDef;
import org.tensorflow.framework.SignatureDef;
Since the Java API and the saved-model-format are somewhat new, there is much room for improvement in the documentation.
Hope that helps.

Apache Mahout not giving any recommendation

I am trying to use mahout for the recommendation but getting none.
My dataset :
Code :
DataModel datamodel = new FileDataModel(new File("dataset.csv"));
// Creating UserSimilarity object.
UserSimilarity usersimilarity = new PearsonCorrelationSimilarity(datamodel);
// Creating UserNeighbourHHood object.
UserNeighborhood userneighborhood = new ThresholdUserNeighborhood(0.1, usersimilarity, datamodel);
// Create UserRecomender
UserBasedRecommender recommender = new GenericUserBasedRecommender(datamodel, userneighborhood, usersimilarity);
List<RecommendedItem> recommendations = recommender.recommend(0, 1);
for (RecommendedItem recommendation : recommendations) {
I am using Mahout version : 0.13.0
Ideally, it should recommend item_id = 101' to 'user_id = 0' asuser = 0anduser = 1have item 102 common show it should recommenditem_id = 101touser_id = 0`
Logs :
18:08:11.669 [main] INFO - Creating FileDataModel for file dataset.csv
18:08:11.700 [main] INFO - Reading file info...
18:08:11.702 [main] INFO - Read lines: 3
18:08:11.722 [main] INFO - Processed 2 users
18:08:11.738 [main] DEBUG - Recommending items for user ID '0'
The Hadoop Mapreduce code in Mahout is being deprecated. The new recommender code starts with #rawkintrevo 's examples. If you are a Scala programmer follow them.
Most Engineers would like a system that works with no modification, The Mahout algorithm is encapsulated in The Universal Recommender built on top of Apache PredictionIO. It has a server to accept events, like the ones in your example, it has internal event storage, and a query server for results. There are numerous improvements over the old Mapreduce code, including using real-time user behavior to make recommendations. Neither the new Mahout nor the old included servers for input and query, the Universal Recommender has REST endpoints for both.
Given that the code you are using will be deprecated I strongly suggest that you dive into Mahout code (#rawkintrevo's example) or look at The Universal Recommender, which is an entire end-to-end system.
Install PredictionIO with a "single machine" setup here or to really shortcut setup use our prepackaged AWS AMI here It includes PIO and The Universal Recommender pre-installed.
Add the UR Template here
A Java SDK for sending events to the recommender here
Once you have this setup you deal with config, REST or Java SDK and the PIO CLI. No Scala coding required.
I have three examples that are based on version 0.13.0 (and Scala, which is required for Samsara, the R-Like Scala DSL Mahout utilizes v0.10+)
The first example is a very slow walk through:
and is meant as a companion to Pat Ferrels blog post/slide deck found here.
The second example is a little more "real" in that it utilizes the SimilarityAnalysis.cooccurrencesIDSs(... which is the propper interface for the CCO algorithm.
Here we use 'real' data. The MovieLens data set doesn't have enough going on to showcase CCO's multi-modal power (the ability to recommend on multiple user behaviors). Here we load 'real' data and generate recommendations.
I know you specifically asked for Java, however Apache Mahout isn't geared for Java at the moment. In theory you could import Scala into your java, or maybe wrap the functions in another more Java friendly function... I've heard rumors late at night (or possibly in a dream) that some grad students some where were working on a Java API, but its not in the trunk at the moment, nor is there a PR, nor is their a bullet in the road map.
Hope the above provides some insight.
The most trivial example for Stackoverflow (you can run this interactively in the Mahout spark shell by typing $MAHOUT_HOME/bin/mahout spark-shell (assuming SPARK_HOME, JAVA_HOME and MAHOUT_HOME are set):
val inputRDD = sc.parallelize(Array( ("u1", "purchase", "iphone"),
("u4","category-browse","tablets")) )
import org.apache.mahout.math.indexeddataset.{IndexedDataset, BiDictionary}
import org.apache.mahout.sparkbindings.indexeddataset.IndexedDatasetSpark
val purchasesIDS = IndexedDatasetSpark.apply(inputRDD.filter(_._2 == "purchase").map(o => (o._1, o._3)))(sc)
val browseIDS = IndexedDatasetSpark.apply(inputRDD.filter(_._2 == "category-browse").map(o => (o._1, o._3)))(sc)
val llrDrmList = SimilarityAnalysis.cooccurrencesIDSs(Array(purchasesIDS, browseIDS),
randomSeed = 1234,
maxInterestingItemsPerThing = 3,
maxNumInteractions = 4)
val llrAtA = llrDrmList(0).matrix.collect
IndexedDatasetSpark.apply( requires an RDD[(String, String)] where the first string is the 'row' (e.g. users), second string is the 'behavior' so for the 'buy matrix', the columns would be 'products', but this could also be a 'gender' matrix, with two columns (male/female)
Then you pass an array of IndexedDataSets to SimilarityAnalysis.cooccurrencesIDSs(

How to correlate similar messages using NLP

I have couple of tweets which needs to be processed. I am trying to find occurrences of messages where it mean some harm to a person. How do I go about achieving this via NLP
I bought my son a toy gun
I shot my neighbor with a gun
I don't like this gun
I would love to own this gun
This gun is a very good buy
Feel like shooting myself with a gun
In the above sentences, the 2nd, 6th one is what I would like to find.
If the problem is restricted only to guns and shooting, then you could use a dependency parser (like the Stanford Parser) to find verbs and their (prepositional) objects, starting with the verb and tracing its dependants in the parse tree. For example, in both 2 and 6 these would be "shoot, with, gun".
Then you can use a list of (near) synonyms for "shoot" ("kill", "murder", "wound", etc) and "gun" ("weapon", "rifle", etc) to check if they occur in this pattern (verb - preposition - noun) in each sentence.
There will be other ways to express the same idea, e.g. "I bought a gun to shoot my neighbor", where the dependency relation is different, and you'd need to detect these types of dependencies too.
All of vpekar's suggestions are good. Here is some python code that will at least parse the sentences and see if they contain verbs in a user defined set of harm words. Note: most 'harm words' probably have multiple senses, many of which could have nothing to do with harm. This approach does not attempt to disambiguate word sense.
(This code assumes you have NLTK and Stanford CoreNLP)
import os
import subprocess
from xml.dom import minidom
from nltk.corpus import wordnet as wn
def StanfordCoreNLP_Plain(inFile):
#Create the startup info so the java program runs in the background (for windows computers)
startupinfo = None
if == 'nt':
startupinfo = subprocess.STARTUPINFO()
startupinfo.dwFlags |= subprocess.STARTF_USESHOWWINDOW
#Execute the stanford parser from the command line
cmd = ['java', '-Xmx1g','-cp', 'stanford-corenlp-1.3.5.jar;stanford-corenlp-1.3.5-models.jar;xom.jar;joda-time.jar', 'edu.stanford.nlp.pipeline.StanfordCoreNLP', '-annotators', 'tokenize,ssplit,pos', '-file', inFile]
output = subprocess.Popen(cmd, stdout=subprocess.PIPE, startupinfo=startupinfo).communicate()
outFile = file(inFile[(str(inFile).rfind('\\'))+1:] + '.xml')
xmldoc = minidom.parse(outFile)
itemlist = xmldoc.getElementsByTagName('sentence')
Document = []
#Get the data out of the xml document and into python lists
for item in itemlist:
SentNum = item.getAttribute('id')
sentList = []
tokens = item.getElementsByTagName('token')
for d in tokens:
word = d.getElementsByTagName('word')[0]
pos = d.getElementsByTagName('POS')[0]
sentList.append([str(pos.strip()), str(word.strip())])
return Document
def FindHarmSentence(Document):
#Loop through sentences in the document. Look for verbs in the Harm Words Set.
VerbTags = ['VBN', 'VB', 'VBZ', 'VBD', 'VBG', 'VBP', 'V']
HarmWords = ("shoot", "kill")
ReturnSentences = []
for Sentence in Document:
for word in Sentence:
if word[0] in VerbTags:
wordRoot = wn.morphy(word[1],wn.VERB)
if wordRoot in HarmWords:
print "This message could indicate harm:" , str(Sentence)
except: pass
return ReturnSentences
#Assuming your input is a string, we need to put the strings in some file.
Sentences = "I bought my son a toy gun. I shot my neighbor with a gun. I don't like this gun. I would love to own this gun. This gun is a very good buy. Feel like shooting myself with a gun."
ProcessFile = "ProcFile.txt"
OpenProcessFile = open(ProcessFile, 'w')
#Sentence split, tokenize, and part of speech tag the data using Stanford Core NLP
Document = StanfordCoreNLP_Plain(ProcessFile)
#Find sentences in the document with harm words
HarmSentences = FindHarmSentence(Document)
This outputs the following:
This message could indicate harm: [['PRP', 'I'], ['VBD', 'shot'], ['PRP$', 'my'], ['NN', 'neighbor'], ['IN', 'with'], ['DT', 'a'], ['NN', 'gun'], ['.', '.']]
This message could indicate harm: [['NNP', 'Feel'], ['IN', 'like'], ['VBG', 'shooting'], ['PRP', 'myself'], ['IN', 'with'], ['DT', 'a'], ['NN', 'gun'], ['.', '.']]
I would have a look at SenticNet
It provides an open source knowledge base and parser that assigns emotional value to text fragments. Using the library, you could train it to recognize statements that you're interested in.
