Training OpenNLP document classification - java

I'm trying to use OpenNLP to classify invoices. Based on it's description I will group it into two classes. I have built a training file with 20K descriptions and tagged each one into the correct class.
The training data looks like (first column is a code, that I use as class, and the second column is the invoice description):
85171231 IPHONE 5S CINZA ESPACIAL 16GB (ME432BZA)
85171231 Galaxy S6 SM-G920I
85171231 motorola - MOTO G5 XT1672
00000000 MOTONETA ITALIKA AT110
00000000 CJ BOX UNIBOX MOLA 138X57X188 VINHO
Using DocumentCategorizer from OpenNLP, I achieved 98,5% of correctness. But, trying to improve the efficience, I took the wrong categorized documents and used it to expand the training data.
For instance, when I first run it, the "MOTONETA ITALIKA AT110" was classified as "85171231". It's ok, since into the first run the "MOTONETA ITALIKA AT110" wasn't classified. So, I teached the classifier explicitly puting "MOTONETA ITALIKA AT110" tagged as "00000000".
But, running it again, OpenNLP insists to classify it as "85171231" even though the training data contains an explicity map to "000000".
So my question is: Am I teaching OpenNLP wright? How do I improve it's efficiency?
The code that I'm using is:
MarkableFileInputStreamFactory dataIn = new MarkableFileInputStreamFactory("data.train");
ObjectStream<String> lineStream = new PlainTextByLineStream(dataIn, StandardCharsets.UTF_8);
ObjectStream<DocumentSample> sampleStream = new DocumentSampleStream(lineStream);
TrainingParameters params = new TrainingParameters();
params.put(TrainingParameters.ITERATIONS_PARAM, "100");
params.put(TrainingParameters.CUTOFF_PARAM, "0");
DoccatModel model = DocumentCategorizerME.train("pt", sampleStream, params, new DoccatFactory());
DocumentCategorizer doccat = new DocumentCategorizerME(model);
double[] aProbs = doccat.categorize("MOTONETA ITALIKA AT110".replaceAll("[^A-Za-z0-9 ]", " ").split(" "));
doccat.getBestCategory(aProbs);

By default, DocumentCategorizer will use bag of words. It means that the sequence of terms are not take into account.
If any term of MOTONETA ITALIKA AT110 occurs with high frequency in the group 85171231, the classifier would be inclined to use that group.
You have a few alternatives:
You can add more variants of MOTONETA ITALIKA AT110 to the group 000000;
Try the to change the feature generator.
The second option would be to change the creation of your model, like this:
int minNgramSize = 2;
int maxNgramSize = 3;
DoccatFactory customFactory = new DoccatFactory(
new FeatureGenerator[]{
new BagOfWordsFeatureGenerator(),
new NGramFeatureGenerator(minNgramSize, maxNgramSize)
}
);
DoccatModel model = DocumentCategorizerME.train("pt", sampleStream, params, customFactory);
You can play with the feature generator by removing the BagOfWordsFeatureGenerator and changing the min and max ngram size.

Related

How can I manage availability of warehouses?

I'm working on a tool about humanitarian logistics. In this model I have some lorries which pick items for support affected people by an earthquake and, after picking them, go to eartquake epicenter to drop these items. I need to manage the availabilty of these warehouses: for example, if a warehouse has 5 items availables and lorries has a transport capacity by 2, availabilty have to become 3 for that warehouse. I need to realize obviously this process for all warehouse of my Supply Chain. I've dropped (as you can see in the pic that I've uploaded) a parameter (availability) in the class of the warehouses [named Magazzini]).
This is the algorithm that manages lorries movement, in which I need to code this command to change availabilty.
List <Magazzini> subsetlist = findAll(main.magazzinis, w->w.capacita>0);
List <Magazzini> sortmag = new ArrayList<Magazzini>();
List <Double> distance = new ArrayList<Double>();
sortmag = subsetlist;
System.out.println(sortmag);
for (Magazzini m : subsetlist)
{
m.distance = distanceTo(m);
}
sortmag = sortAscending(sortmag, p-> p.distance);
//main.magazzinis.cap = main.magazzinis.cap - 2;
moveTo(sortmag.get(0));
System.out.println(sortmag);
partenza = time();
I write a possible command to do it, but it doesn't work. How can I fix it?

DeepLearning4j NN for prediction function doesn't converge

I'm trying to do a simple prediction in DL4j (going to use it later for a large dataset with n features) but no matter what I do my network just doesn't want to learn and behaves very weird. Of course I studied all the tutorials and did the same steps shown in dl4j repo, but it doesn't work for me somehow.
For dummy features data I use:
*double[val][x] features; where val = linspace(-10,10)...; and x= Math.sqrt(Math.abs(val)) * val;
my y is : double[y] labels; where y = Math.sin(val) / val
DataSetIterator dataset_train_iter = getTrainingData(x_features, y_outputs_train, batchSize, rnd);
DataSetIterator dataset_test_iter = getTrainingData(x_features_test, y_outputs_test, batchSize, rnd);
// Normalize data, including labels (fitLabel=true)
NormalizerMinMaxScaler normalizer = new NormalizerMinMaxScaler(0, 1);
normalizer.fitLabel(false);
normalizer.fit(dataset_train_iter);
normalizer.fit(dataset_test_iter);
// Use the .transform function only if you are working with a small dataset and no iterator
normalizer.transform(dataset_train_iter.next());
normalizer.transform(dataset_test_iter.next());
dataset_train_iter.setPreProcessor(normalizer);
dataset_test_iter.setPreProcessor(normalizer);
//DataSet setNormal = dataset.next();
//Create the network
MultiLayerConfiguration conf = new NeuralNetConfiguration.Builder()
.seed(seed)
.optimizationAlgo(OptimizationAlgorithm.STOCHASTIC_GRADIENT_DESCENT)
.weightInit(WeightInit.XAVIER)
//.miniBatch(true)
//.l2(1e-4)
//.activation(Activation.TANH)
.updater(new Nesterovs(0.1,0.3))
.list()
.layer(new DenseLayer.Builder().nIn(numInputs).nOut(20).activation(Activation.TANH)
.build())
.layer(new DenseLayer.Builder().nIn(20).nOut(10).activation(Activation.TANH)
.build())
.layer( new DenseLayer.Builder().nIn(10).nOut(6).activation(Activation.TANH)
.build())
.layer(new OutputLayer.Builder(LossFunctions.LossFunction.MSE)
.activation(Activation.IDENTITY)
.nIn(6).nOut(1).build())
.build();
//Train and fit network
final MultiLayerNetwork net = new MultiLayerNetwork(conf);
net.init();
net.setListeners(new ScoreIterationListener(100));
//Train the network on the full data set, and evaluate in periodically
final INDArray[] networkPredictions = new INDArray[nEpochs / plotFrequency];
for (int i = 0; i < nEpochs; i++) {
//in fit we have already Backpropagation. See Release deeplearning
// https://deeplearning4j.konduit.ai/release-notes/1.0.0-beta3
net.fit(dataset_train_iter);
dataset_train_iter.reset();
if((i+1) % plotFrequency == 0) networkPredictions[i/ plotFrequency] = net.output(x_features, false);
}
// evaluate and plot
dataset_test_iter.reset();
dataset_train_iter.reset();
INDArray predicted = net.output(dataset_test_iter, false);
System.out.println("PREDICTED ARRAY " + predicted);
INDArray output_train = net.output(dataset_train_iter, false);
//Revert data back to original values for plotting
// normalizer.revertLabels(predicted);
normalizer.revertLabels(output_train);
normalizer.revertLabels(predicted);
PlotUtil.plot(om, y_outputs_train, networkPredictions);
My output seems then very weird (see picture below), even when I use miniBatch (1, 20,100 Samples/Batch) change number of epochs or add hidden nodes and hidden Layers (tryed to add 1000 Nodes and 5 Layers). The network either outputs very stochastic values or the one constant y. I just can't recognize, what is going wrong here. Why the network even doesn't approach the train function.
Another question: what doesn iter.reset() do exactly. Does the Iterator turn the pointer back to 0-Batch in the DataSetIterator?
A pretty common problem is people doing toy problems like this is dl4j's assumption of minibatches (which 99% of problems tend to be). You aren't actually doing minibatch learning (which actually defeats the point of actually using an iterator, which is meant to iterate through slices of a dataset, not an in memory small dataset) - a small recommendation is to just use the normal dataset api (which is what's returned from dataset.next())
Ensure you turn off the minibatch penalty dl4j assigns to all losses with:
.minibatch(false) - you can see that configuration here:
https://github.com/eclipse/deeplearning4j/blob/master/deeplearning4j/deeplearning4j-nn/src/main/java/org/deeplearning4j/nn/conf/NeuralNetConfiguration.java#L434
A unit test testing this behavior can be found here:
https://github.com/eclipse/deeplearning4j/blob/b4047006ac8175df295c2f3c008e7601437ea4dc/deeplearning4j/deeplearning4j-core/src/test/java/org/deeplearning4j/gradientcheck/GradientCheckTests.java#L94
For posterity, here is the relevant configuration:
MultiLayerConfiguration conf = new NeuralNetConfiguration.Builder().miniBatch(false)
.dataType(DataType.DOUBLE)
.optimizationAlgo(OptimizationAlgorithm.STOCHASTIC_GRADIENT_DESCENT).updater(new NoOp())
.list()
.layer(0,
new DenseLayer.Builder().nIn(4).nOut(3)
.dist(new NormalDistribution(0, 1))
.activation(Activation.TANH)
.build())
.layer(1, new OutputLayer.Builder(LossFunction.MCXENT)
.activation(Activation.SOFTMAX).nIn(3).nOut(3).build())
.build();
You'll notice 2 things: 1 is minibatch is false and 2 is the configuration for data type double. You are also welcome to try that for your problem.
Dl4j to save memory tends to also assume float for the default data type.
This is a reasonable assumption when working on larger problems, but may not work well for toy problems.
For reference, you can find the application of the minibatch math here:
https://github.com/eclipse/deeplearning4j/blob/fc735d30023981ebbb0fafa55ea9520ec44292e0/deeplearning4j/deeplearning4j-nn/src/main/java/org/deeplearning4j/nn/updater/BaseMultiLayerUpdater.java#L332
This affects the gradient updates.
The score penalty can be found in the output layer:
https://github.com/eclipse/deeplearning4j/blob/master/deeplearning4j/deeplearning4j-nn/src/main/java/org/deeplearning4j/nn/layers/BaseOutputLayer.java#L84
Essentially, both of these automatically penalize the loss update for your dataset reflected in both the loss and the gradient updates.

How correctly make TF-IDF vectors of sentences in Apache Spark with Java?

I have this code,
public class TfIdfExample {
public static void main(String[] args){
JavaSparkContext sc = SparkSingleton.getContext();
SparkSession spark = SparkSession.builder()
.config("spark.sql.warehouse.dir", "spark-warehouse")
.getOrCreate();
JavaRDD<List<String>> documents = sc.parallelize(Arrays.asList(
Arrays.asList("this is a sentence".split(" ")),
Arrays.asList("this is another sentence".split(" ")),
Arrays.asList("this is still a sentence".split(" "))), 2);
HashingTF hashingTF = new HashingTF();
documents.cache();
JavaRDD<Vector> featurizedData = hashingTF.transform(documents);
// alternatively, CountVectorizer can also be used to get term frequency vectors
IDF idf = new IDF();
IDFModel idfModel = idf.fit(featurizedData);
featurizedData.cache();
JavaRDD<Vector> tfidfs = idfModel.transform(featurizedData);
System.out.println(tfidfs.collect());
KMeansProcessor kMeansProcessor = new KMeansProcessor();
JavaPairRDD<Vector,Integer> result = kMeansProcessor.Process(tfidfs);
result.collect().forEach(System.out::println);
}
}
I need get Vectors for k-means, but I getting odd Vectors
[(1048576,[489554,540177,736740,894973],[0.28768207245178085,0.0,0.0,0.0]),
(1048576,[455491,540177,736740,894973],[0.6931471805599453,0.0,0.0,0.0]),
(1048576,[489554,540177,560488,736740,894973],[0.28768207245178085,0.0,0.6931471805599453,0.0,0.0])]
after k-means work I getting it
((1048576,[489554,540177,736740,894973],[0.28768207245178085,0.0,0.0,0.0]),1)
((1048576,[489554,540177,736740,894973],[0.28768207245178085,0.0,0.0,0.0]),0)
((1048576,[489554,540177,736740,894973],[0.28768207245178085,0.0,0.0,0.0]),1)
((1048576,[455491,540177,736740,894973],[0.6931471805599453,0.0,0.0,0.0]),1)
((1048576,[489554,540177,560488,736740,894973],[0.28768207245178085,0.0,0.6931471805599453,0.0,0.0]),1)
((1048576,[455491,540177,736740,894973],[0.6931471805599453,0.0,0.0,0.0]),0)
((1048576,[455491,540177,736740,894973],[0.6931471805599453,0.0,0.0,0.0]),1)
((1048576,[489554,540177,560488,736740,894973],[0.28768207245178085,0.0,0.6931471805599453,0.0,0.0]),0)
((1048576,[489554,540177,560488,736740,894973],[0.28768207245178085,0.0,0.6931471805599453,0.0,0.0]),1)
But I think it work not correctly, because tf-idf must have another view.
I think mllib have ready methods for this, but I tested documentation examples and don't receive what I need. Custom solution for Spark I have not found. May be somebody work with it and give me answer what I doing wrong? May be I am not correctly use mllib functional?
What you are getting after TF-IDF is a SparseVector.
To understand the values better, let me start with TF vectors:
(1048576,[489554,540177,736740,894973],[1.0,1.0,1.0,1.0])
(1048576,[455491,540177,736740,894973],[1.0,1.0,1.0,1.0])
(1048576,[489554,540177,560488,736740,894973],[1.0,1.0,1.0,1.0,1.0])
For instance, TF vector corresponding to the first sentence is a 1048576 (= 2^20) component vector, with 4 non-zero values corresponding to indices the 489554,540177,736740 and 894973, all other values are zeros and therefore not stored in the sparse vector representation.
The dimensionality of the feature vectors is equal to the number of buckets you hash into: 1048576 = 2^20 buckets in your case.
For a corpus of this size, you should consider reducing the number of buckets:
HashingTF hashingTF = new HashingTF(32);
powers of 2 are recommended to minimize number of hash collisions.
Next, you apply IDF weights:
(1048576,[489554,540177,736740,894973],[0.28768207245178085,0.0,0.0,0.0])
(1048576,[455491,540177,736740,894973],[0.6931471805599453,0.0,0.0,0.0])
(1048576,[489554,540177,560488,736740,894973],[0.28768207245178085,0.0,0.6931471805599453,0.0,0.0])
If we look at the first sentence again, we got 3 zeros - which is expected, since the terms "this", "is", and "sentence" appear in every document of the corpus, so by definition of IDF will be equal to zero.
Why do the zero values still in the (sparse) vector? Because in the current implementation, the size of the vector is kept the same and only the values are multiplied by IDF.

Train recurrent neural net in deeplearning4j with data that is generated during runtime

I'm new to the deeplearning4j library, but I've got some experience with neural networks in general.
I'm trying to train a recurrent neural network (a LSTM in particular) which is supposed to detect beats in music in realtime. All examples for using recurrent neural nets with deeplearning4j that I've found so far use a reader which reads the training data from a file. As I want to record music in realtime via a microphone, I can't read some pregenerated file, so the data which is fed into the neural network is generated in realtime by my application.
This is the code that I'm using to generate my network:
NeuralNetConfiguration.ListBuilder builder = new NeuralNetConfiguration.Builder()
.optimizationAlgo(OptimizationAlgorithm.STOCHASTIC_GRADIENT_DESCENT).iterations(1)
.learningRate(0.1)
.rmsDecay(0.95)
.regularization(true)
.l2(0.001)
.weightInit(WeightInit.XAVIER)
.updater(Updater.RMSPROP)
.list();
int nextIn = hiddenLayers.length > 0 ? hiddenLayers[0] : numOutputs;
builder = builder.layer(0, new GravesLSTM.Builder().nIn(numInputs).nOut(nextIn).activation("softsign").build());
for(int i = 0; i < hiddenLayers.length - 1; i++){
nextIn = hiddenLayers[i + 1];
builder = builder.layer(i + 1, new GravesLSTM.Builder().nIn(hiddenLayers[i]).nOut(nextIn).activation("softsign").build());
}
builder = builder.layer(hiddenLayers.length, new RnnOutputLayer.Builder(LossFunctions.LossFunction.MCXENT).nIn(nextIn).nOut(numOutputs).activation("softsign").build());
MultiLayerConfiguration conf = builder.backpropType(BackpropType.TruncatedBPTT).tBPTTForwardLength(DEFAULT_RECURRENCE_DEPTH).tBPTTBackwardLength(DEFAULT_RECURRENCE_DEPTH)
.pretrain(false).backprop(true)
.build();
net = new MultiLayerNetwork(conf);
net.init();
In this case I'm using about 700 inputs (which is mostly FFT-data of the recorded audio), 1 output (which is supposed to output a number between 0 [no beat] and 1 [beat]) and my hiddenLayers array consists of the ints {50, 25, 10}.
For getting the output of the network I'm using this code:
double[] output = new double[]{net.rnnTimeStep(Nd4j.create(netInputData)).getDouble(0)};
where netInputData is the data I want to input into the network as a one-dimensional double array.
I'm relatively sure that this code is working fine, since I get some output for an untrained network which looks something like this when I plot it.
However, once I try to train a network (even if I train it just for a short time, which should alter the weights of the network just a little bit, so that the output should be very similar to the untrained network), I get an output which looks like a constant.
This is the code which I'm using to train the network:
for(int timestep = 0; timestep < trainingData.length - DEFAULT_RECURRENCE_DEPTH; timestep++){
INDArray inputDataArray = Nd4j.create(new int[]{1, numInputs, DEFAULT_RECURRENCE_DEPTH},'f');
for(int inputPos = 0; inputPos < trainingData[timestep].length; inputPos++)
for(int inputTimeWindowPos = 0; inputTimeWindowPos < DEFAULT_RECURRENCE_DEPTH; inputTimeWindowPos++)
inputDataArray.putScalar(new int[]{0, inputPos, inputTimeWindowPos}, trainingData[timestep + inputTimeWindowPos][inputPos]);
INDArray desiredOutputDataArray = Nd4j.create(new int[]{1, numOutputs, DEFAULT_RECURRENCE_DEPTH},'f');
for(int outputPos = 0; outputPos < desiredOutputData[timestep].length; outputPos++)
for(int inputTimeWindowPos = 0; inputTimeWindowPos < DEFAULT_RECURRENCE_DEPTH; inputTimeWindowPos++)
desiredOutputDataArray.putScalar(new int[]{0, outputPos, inputTimeWindowPos}, desiredOutputData[timestep + inputTimeWindowPos][outputPos]);
net.fit(new DataSet(inputDataArray, desiredOutputDataArray));
}
Once again, I've got my data for the input and for the desired output as a double array. This time the two arrays are two-dimensional. The first index represents the time (where index 0 is the first audio data of the recorded audio) and the second index represents the input (or respectively the desired output) for this time step.
Given the shown output after training a network, I tend to think that there must be something wrong with my code used for creating the INDArrays from my data. Am I missing some important step for initializing these arrays or did I mess up the order I need to put my data into these arrays?
Thank you for any help in advance.
I'm not sure, but perhaps 99.99% of your training examples are 0, with only an occasional 1 exactly where the beat occurs. This might be too imbalanced to learn. Good luck.

Creating dense matrix using org.javatuples.Pair and HashMap is too slow

I have a dense symmetric matrix of size about 30000 X 30000 that contains distances between strings. Since the distance is symmetric, the upper triangle of the matrix is stored in a tab-separated 3-column file of the form
stringA<tab>stringB<tab>distance
I am using HashMap and org.javatuples.Pair to create a map to quickly look up distances for given pairs of string as follows:
import org.javatuples.Pair;
HashMap<Pair<String,String>,Double> pairScores = new HashMap<Pair<String,String>,Double>();
BufferedReader bufferedReader = new BufferedReader(new FileReader("data.txt"));
String line = null;
while((line = bufferedReader.readLine()) != null) {
String [] parts = line.split("\t");
String d1 = parts[0];
String d2 = parts[1];
Double score = Double.parseDouble(parts[2]);
Pair<String,String> p12 = new Pair<String,String>(d1,d2);
Pair<String,String> p21 = new Pair<String,String>(d2,d1);
pairScores.put(p12, score);
pairScores.put(p21, score);
}
data.txt is very big (~400M lines) and the process eventually slows down to a crawl with most time being spent in java.util.HashMap.put.
I don't think there should be (m)any hash code collisions on pairs but I might be wrong. How can I verify this? Is it enough to simply look at how unique p12.hashCode() and p12.hashCode() are?
If there are no collisions, what else could be causing to slow down?
Is there a batter way to construct this matrix for quick lookup?
I am now using Guava's Table<Integer, Integer, Double> after also realizing that my strings are unique enough that I could use their hashes, instead of the strings themselves, as keys, to reduce memory requirements. The creation of the table runs in reasonable time, however, there are issues with serializing and deserializing the resulting objects: I ran into out of memory errors even with the move from String to Integer. It seems to be working after I decided to not store both a-b and b-a pairs, but I might be balancing on the edge of what my machine can handle

Categories