The base problem is trying to use a custom data model to create a DataSetIterator to be used in a deeplearning4j network.
The data model I am trying to work with is a java class that holds a bunch of doubles, created from quotes on a specific stock, such as timestamp, open, close, high, low, volume, technical indicator 1, technical indicator 2, etc.
I query an internet source, example, (also several other indicators from the same site) which provide json strings that I convert into my data model for easier access and to store in an sqlite database.
Now I have a List of these data models that I would like to use to train an LSTM network, each double being a feature. Per the Deeplearning4j documentation and several examples, the way to use training data is to use the ETL processes described here to create a DataSetIterator which is then used by the network.
I don't see a clean way to convert my data model using any of the provided RecordReaders without first converting them to some other format, such as a CSV or other file. I would like to avoid this because it would use up a lot of resources. It seems like there would be a better way to do this simple case. Is there a better approach that I am just missing?
Ethan!
First of all, Deeplearning4j uses ND4j as backend, so your data will have to eventually be converted into INDArray objects in order to be used in your model. If your trianing data is two array of doubles, inputsArray and desiredOutputsArray, you can do the following:
INDArray inputs = Nd4j.create(inputsArray, new int[]{numSamples, inputDim});
INDArray desiredOutputs = Nd4j.create(desiredOutputsArray, new int[]{numSamples, outputDim});
And then you can train your model using those vectors directly:
for (int epoch = 0; epoch < nEpochs; epoch++)
model.fit(inputs, desiredOutputs);
Alternatively you can create a DataSet object and used it for training:
DataSet ds = new DataSet(inputs, desiredOutputs);
for (int epoch = 0; epoch < nEpochs; epoch++)
model.fit(ds);
But creating a custom iterator is the safest approach, specially in larger sets since it gives you more control over your data and keep things organized.
In your DataSetIterator implementation you must pass your data and in the implementation of the next() method you should return a DataSet object comprising the next batch of your training data. It would look like this:
public class MyCustomIterator implements DataSetIterator {
private INDArray inputs, desiredOutputs;
private int itPosition = 0; // the iterator position in the set.
public MyCustomIterator(float[] inputsArray,
float[] desiredOutputsArray,
int numSamples,
int inputDim,
int outputDim) {
inputs = Nd4j.create(inputsArray, new int[]{numSamples, inputDim});
desiredOutputs = Nd4j.create(desiredOutputsArray, new int[]{numSamples, outputDim});
}
public DataSet next(int num) {
// get a view containing the next num samples and desired outs.
INDArray dsInput = inputs.get(
NDArrayIndex.interval(itPosition, itPosition + num),
NDArrayIndex.all());
INDArray dsDesired = desiredOutputs.get(
NDArrayIndex.interval(itPosition, itPosition + num),
NDArrayIndex.all());
itPosition += num;
return new DataSet(dsInput, dsDesired);
}
// implement the remaining virtual methods...
}
The NDArrayIndex methods you see above are used to access parts of a INDArray. Then now you can use it for training:
MyCustomIterator it = new MyCustomIterator(
inputs,
desiredOutputs,
numSamples,
inputDim,
outputDim);
for (int epoch = 0; epoch < nEpochs; epoch++)
model.fit(it);
This example will be particularly useful to you, since it implements a LSTM network and it has a custom iterator implementation (which can be a guide for implementing the remaining methods). Also, for more information on NDArray, this is helpful. It gives detailed information on creating, modifying and accessing parts of an NDArray.
deeplearning4j creator here.
You should not in any but all very special setting create a data set iterator. You should be using datavec. We cover this in numerous places ranging from our data vec page to our examples:
https://deeplearning4j.konduit.ai/datavec/overview
https://github.com/eclipse/deeplearning4j-examples
Datavec is our dedicated library for doing data transformations. You create custom record readers for your use case. Deeplearning4j for legacy reasons has a few "special" iterators for certain datasets. Many of those came before datavec existed. We built datavec as a way of pre processing data.
Now you use the RecordReaderDataSetIterator, SequenceRecordReaderDataSetIterator (see our javadoc for more information) and their multi dataset equivalents.
If you do this, you don't have to worry about masking, thread safety, or anything else that involves fast loading of data.
As an aside, I would love to know where you are getting the idea to create your own iterator, we now have it right in our readme not to do that. If there's another place you were looking that is not obvious, we would love to fix that.
Edit:
I've updated the links to the new pages. This post is very old now.
Please see the new links here:
https://deeplearning4j.konduit.ai/datavec/overview
https://github.com/eclipse/deeplearning4j-examples
Related
How to fetch the layer input and output size in dl4j ?
For example, something like this :
MultiLayerNetwork network = model.init()
for(Layer layer : network.getLayers()) {
int[] outputShape = layer.shape()
}
It is a bit more involved than that because DL4J supports layers which are more complex than simple dense or fully connected layers.
If you want to have that information to print it out, it is probably easier to use
String summary = model.summary();
If you want to do something with that information, you can take a look at the implementation of the summary method itself.
https://github.com/eclipse/deeplearning4j/blob/master/deeplearning4j/deeplearning4j-nn/src/main/java/org/deeplearning4j/nn/multilayer/MultiLayerNetwork.java#L3636-L3757
In particular linkes 3679 to 3699 as they are all about getting the input and output sizes of layers.
I need to modify a file. We've already written a reasonably complex component to build sets of indexes describing where interesting things are in this file, but now I need to edit this file using that set of indexes and that's proving difficult.
Specifically, my dream API is something like this
//if you'll let me use kotlin for a second, assume we have a simple tuple class
data class IdentifiedCharacterSubsequence { val indexOfFirstChar : int, val existingContent : String }
//given these two structures
List<IdentifiedCharacterSubsequences> interestingSpotsInFile = scanFileAsPerExistingBusinessLogic(file, businessObjects);
Map<IdentifiedCharacterSubsequences, String> newContentByPreviousContentsLocation = generateNewValues(inbterestingSpotsInFile, moreBusinessObjects);
//I want something like this:
try(MutableFile mutableFile = new com.maybeGoogle.orApache.MutableFile(file)){
for(IdentifiedCharacterSubsequences seqToReplace : interestingSpotsInFile){
String newContent = newContentByPreviousContentsLocation.get(seqToReplace);
mutableFile.replace(seqToReplace.indexOfFirstChar, seqtoReplace.existingContent.length, newContent);
//very similar to StringBuilder interface
//'enqueues' data changes in memory, doesnt actually modify file until flush call...
}
mutableFile.flush();
// ...at which point a single write-pass is made.
// assumption: changes will change many small regions of text (instead of large portions of text)
// -> buffering makes sense
}
Some notes:
I cant use RandomAccessFile because my changes are not in-place (the length of newContent may be longer or shorter than that of seq.existingContent)
The files are often many megabytes big, thus simply reading the whole thing into memory and modifying it as an array is not appropriate.
Does something like this exist or am I reduced to writing my own implementation using BufferedWriters and the like? It seems like such an obvious evolution from io.Streams for a language which typically emphasizes indexed based behaviour heavily, but I cant find an existing implementation.
Lastly: I have very little domain experience with files and encoding schemes, so I have taken no effort to address the 'two-index' character described in questions like these: Java charAt used with characters that have two code units. Any help on this front is much appreciated. Is this perhaps why I'm having trouble finding an implementation like this? Because indexes in UTF-8 encoded files are so pesky and bug-prone?
I am trying to create a custom map format for my own little 2D RPG, so my question is rather how do I manage reading and creating a custom map format properly and flexible. First off, I am writing my code in Java. The idea was to have a class called 'TileMap'. This class defines a 2-dimensional integer - array where all my entities are stored ( I'm using an entity-system to realize my game ). I also want to save and parse some information about the size of the map before the actual reading process happens. The map file should look much like this:
#This is a test map
width=4
height=3
layercount=1
tilesize=32
[1;0;0;0]
[23;1;0;0]
[5;0;1;0]
where layercount is the number of layers the z-dimension offers. and tilesize is the size of every tile in pixels. Entities are defined in between the brackets. The pattern goes: [entity_id;x_pos;y_pos;z_pos]. I already wrote the code to parse a file like this but its not very flexible because you just have to put one tiny whitespace in front of the square brackets and the map can't load up. I just need some few helpful tips to do this in a flexible way. Can anybody help me out?
I think that may have 3 different ways to solve that:
First, you can use a Map with Maps: Map<Serializable,Map<String,Object>> where Serializable is your entity_id, and the map are the attributes that you need, like ("width",4), ("height",3):
public static final String WIDTH = "WIDTH";
public static final String HEIGHT = "HEIGHT";
...
Map<String,Object> mapProperties = new HashMap<String,Object>();
mapProperties.put(WIDTH, 4);
mapProperties.put(HEIGHT, 3);
....
Map<Serializable,Map<String,Object>> map = new HashMap<Serializable,Map<String,Object>>();
map.put(myEntity.getId(), mapProperties);
Second way could be like this: http://java.dzone.com/articles/hashmap-%E2%80%93-single-key-and
Third way could be like this: Java Tuple Without Creating Multiple Type Parameters
I am calling the Google Protocol Buffers Java API from Matlab. This works pretty well, but I have hit a big performance bottleneck. The bulk of the data are returned as objects of type:
java.util.Collections$UnmodifiableRandomAccessList
They actually contain a list of floats. I need to convert this to a Matlab matrix. The best approach I have found so far is to call:
cell2mat(cell(Q.toArray()))
However, that one line is a huge performance bottleneck in the code.
Note I am aware of the FarSounder Matlab parser generators for the Google Protocol Buffers, unfortunately these are very slow. See below for some rough benchmark speeds for my problem (YMMV). High is good.
Farsounder Matlab: 0.03
Pure Python: 1
Java API called from Matlab (parsing and extracting metadata only): 10
Java API called from Matlab (parsing and extracting both metadata and data): 0.25
If it wasn't for the overhead of converting the java.util.Collections$UnmodifiableRandomAccessList
to a Matlab matrix, then the approach of calling the Java API from Matlab would look quite promising.
Is there a better way of converting this Java object into a Matlab matrix?
Bear in mind that the method returning this type is in automatically generated code.
You might be best writing a tiny piece of extra java code, like so:
import java.util.List;
import java.util.ListIterator;
class Helper {
public static float[] toFloatArray(List l) {
float retValue[] = new float[l.size()];
ListIterator iterator = l.listIterator();
for (int idx = 0; idx < retValue.length; ++idx ){
// List had better contain float values,
// or else the following line will ClassCastException.
retValue[idx] = (float) iterator.next();
}
return retValue;
}
}
with which I see:
>> j = java.util.LinkedList;
>> for idx = 1:1e5, j.add(single(idx)); end
>> tic, out = Helper.toFloatArray(j); toc
Elapsed time is 0.006553 seconds.
>> tic, cell2mat(cell(j.toArray)); toc
Elapsed time is 0.305973 seconds.
In my experience, the most performant solution is write a little set of java helpers, that converts the lists to plain arrays of primitive types.
These are well mapped to matrices by matlab.
If the above e.g. gives a an array of java.lang.Floats, the helper could look like this:
public static float[] toFloats(Float[] floats) {
float[] rv = new float[floats.length];
for (int i=0; i < floats.length; i++) rv[i] = (float) floats[i];
return rv;
}
In matlab cell2mat(cell(Q.toArray())) hence would become:
some.package.toFloats(Q.toArray());
Obviously you could modify the helper function to directly take your list as well, avoiding the need for the toArray() call (does this actually make a copy?).
I am trying to use Weka for feature selection using PCA algorithm.
My original feature space contains ~9000 attributes, in 2700 samples.
I tried to reduce dimensionality of the data using the following code:
AttributeSelection selector = new AttributeSelection();
PrincipalComponents pca = new PrincipalComponents();
Ranker ranker = new Ranker();
selector.setEvaluator(pca);
selector.setSearch(ranker);
Instances instances = SamplesManager.asWekaInstances(trainSet);
try {
selector.SelectAttributes(instances);
return SamplesManager.asSamplesList(selector.reduceDimensionality(instances));
} catch (Exception e ) {
...
}
However, It did not finish to run within 12 hours. It is stuck in the method selector.SelectAttributes(instances);.
My questions are:
Is so long computation time expected for weka's PCA? Or am I using PCA wrongly?
If the long run time is expected:
How can I tune the PCA algorithm to run much faster? Can you suggest an alternative? (+ example code how to use it)?
If it is not:
What am I doing wrong? How should I invoke PCA using weka and get my reduced dimensionality?
Update: The comments confirms my suspicion that it is taking much more time than expected.
I'd like to know: How can I get PCA in java - using weka or an alternative library.
Added a bounty for this one.
After deepening in the WEKA code, the bottle neck is creating the covariance matrix, and then calculating the eigenvectors for this matrix. Even trying to switch to sparsed matrix implementation (I used COLT's SparseDoubleMatrix2D) did not help.
The solution I came up with was first reduce the dimensionality using a first fast method (I used information gain ranker, and filtering based on document frequencey), and then use PCA on the reduced dimensionality to reduce it farther.
The code is more complex, but it essentially comes down to this:
Ranker ranker = new Ranker();
InfoGainAttributeEval ig = new InfoGainAttributeEval();
Instances instances = SamplesManager.asWekaInstances(trainSet);
ig.buildEvaluator(instances);
firstAttributes = ranker.search(ig,instances);
candidates = Arrays.copyOfRange(firstAttributes, 0, FIRST_SIZE_REDUCTION);
instances = reduceDimenstions(instances, candidates)
PrincipalComponents pca = new PrincipalComponents();
pca.setVarianceCovered(var);
ranker = new Ranker();
ranker.setNumToSelect(numFeatures);
selection = new AttributeSelection();
selection.setEvaluator(pca);
selection.setSearch(ranker);
selection.SelectAttributes(instances );
instances = selection.reduceDimensionality(wekaInstances);
However, this method scored worse then using a greedy information gain and a ranker, when I cross-validated for estimated accuracy.
It looks like you're using the default configuration for the PCA, which judging by the long runtime, it is likely that it is doing way too much work for your purposes.
Take a look at the options for PrincipalComponents.
I'm not sure if -D means they will normalize it for you or if you have to do it yourself. You want your data to be normalized (centered about the mean) though, so I would do this yourself manually first.
-R sets the amount of variance you want accounted for. Default is 0.95. The correlation in your data might not be good so try setting it lower to something like 0.8.
-A sets the maximum number of attributes to include. I presume the default is all of them. Again, you should try setting it to something lower.
I suggest first starting out with very lax settings (e.g. -R=0.1 and -A=2) then working your way up to acceptable results.
Best
for the construction of your covariance matrix, you can use the following formula which is also used by matlab. It is faster then the apache library.
Whereby Matrix is an m x n matrix. (m --> #databaseFaces)