I am calling the Google Protocol Buffers Java API from Matlab. This works pretty well, but I have hit a big performance bottleneck. The bulk of the data are returned as objects of type:
java.util.Collections$UnmodifiableRandomAccessList
They actually contain a list of floats. I need to convert this to a Matlab matrix. The best approach I have found so far is to call:
cell2mat(cell(Q.toArray()))
However, that one line is a huge performance bottleneck in the code.
Note I am aware of the FarSounder Matlab parser generators for the Google Protocol Buffers, unfortunately these are very slow. See below for some rough benchmark speeds for my problem (YMMV). High is good.
Farsounder Matlab: 0.03
Pure Python: 1
Java API called from Matlab (parsing and extracting metadata only): 10
Java API called from Matlab (parsing and extracting both metadata and data): 0.25
If it wasn't for the overhead of converting the java.util.Collections$UnmodifiableRandomAccessList
to a Matlab matrix, then the approach of calling the Java API from Matlab would look quite promising.
Is there a better way of converting this Java object into a Matlab matrix?
Bear in mind that the method returning this type is in automatically generated code.
You might be best writing a tiny piece of extra java code, like so:
import java.util.List;
import java.util.ListIterator;
class Helper {
public static float[] toFloatArray(List l) {
float retValue[] = new float[l.size()];
ListIterator iterator = l.listIterator();
for (int idx = 0; idx < retValue.length; ++idx ){
// List had better contain float values,
// or else the following line will ClassCastException.
retValue[idx] = (float) iterator.next();
}
return retValue;
}
}
with which I see:
>> j = java.util.LinkedList;
>> for idx = 1:1e5, j.add(single(idx)); end
>> tic, out = Helper.toFloatArray(j); toc
Elapsed time is 0.006553 seconds.
>> tic, cell2mat(cell(j.toArray)); toc
Elapsed time is 0.305973 seconds.
In my experience, the most performant solution is write a little set of java helpers, that converts the lists to plain arrays of primitive types.
These are well mapped to matrices by matlab.
If the above e.g. gives a an array of java.lang.Floats, the helper could look like this:
public static float[] toFloats(Float[] floats) {
float[] rv = new float[floats.length];
for (int i=0; i < floats.length; i++) rv[i] = (float) floats[i];
return rv;
}
In matlab cell2mat(cell(Q.toArray())) hence would become:
some.package.toFloats(Q.toArray());
Obviously you could modify the helper function to directly take your list as well, avoiding the need for the toArray() call (does this actually make a copy?).
Related
The base problem is trying to use a custom data model to create a DataSetIterator to be used in a deeplearning4j network.
The data model I am trying to work with is a java class that holds a bunch of doubles, created from quotes on a specific stock, such as timestamp, open, close, high, low, volume, technical indicator 1, technical indicator 2, etc.
I query an internet source, example, (also several other indicators from the same site) which provide json strings that I convert into my data model for easier access and to store in an sqlite database.
Now I have a List of these data models that I would like to use to train an LSTM network, each double being a feature. Per the Deeplearning4j documentation and several examples, the way to use training data is to use the ETL processes described here to create a DataSetIterator which is then used by the network.
I don't see a clean way to convert my data model using any of the provided RecordReaders without first converting them to some other format, such as a CSV or other file. I would like to avoid this because it would use up a lot of resources. It seems like there would be a better way to do this simple case. Is there a better approach that I am just missing?
Ethan!
First of all, Deeplearning4j uses ND4j as backend, so your data will have to eventually be converted into INDArray objects in order to be used in your model. If your trianing data is two array of doubles, inputsArray and desiredOutputsArray, you can do the following:
INDArray inputs = Nd4j.create(inputsArray, new int[]{numSamples, inputDim});
INDArray desiredOutputs = Nd4j.create(desiredOutputsArray, new int[]{numSamples, outputDim});
And then you can train your model using those vectors directly:
for (int epoch = 0; epoch < nEpochs; epoch++)
model.fit(inputs, desiredOutputs);
Alternatively you can create a DataSet object and used it for training:
DataSet ds = new DataSet(inputs, desiredOutputs);
for (int epoch = 0; epoch < nEpochs; epoch++)
model.fit(ds);
But creating a custom iterator is the safest approach, specially in larger sets since it gives you more control over your data and keep things organized.
In your DataSetIterator implementation you must pass your data and in the implementation of the next() method you should return a DataSet object comprising the next batch of your training data. It would look like this:
public class MyCustomIterator implements DataSetIterator {
private INDArray inputs, desiredOutputs;
private int itPosition = 0; // the iterator position in the set.
public MyCustomIterator(float[] inputsArray,
float[] desiredOutputsArray,
int numSamples,
int inputDim,
int outputDim) {
inputs = Nd4j.create(inputsArray, new int[]{numSamples, inputDim});
desiredOutputs = Nd4j.create(desiredOutputsArray, new int[]{numSamples, outputDim});
}
public DataSet next(int num) {
// get a view containing the next num samples and desired outs.
INDArray dsInput = inputs.get(
NDArrayIndex.interval(itPosition, itPosition + num),
NDArrayIndex.all());
INDArray dsDesired = desiredOutputs.get(
NDArrayIndex.interval(itPosition, itPosition + num),
NDArrayIndex.all());
itPosition += num;
return new DataSet(dsInput, dsDesired);
}
// implement the remaining virtual methods...
}
The NDArrayIndex methods you see above are used to access parts of a INDArray. Then now you can use it for training:
MyCustomIterator it = new MyCustomIterator(
inputs,
desiredOutputs,
numSamples,
inputDim,
outputDim);
for (int epoch = 0; epoch < nEpochs; epoch++)
model.fit(it);
This example will be particularly useful to you, since it implements a LSTM network and it has a custom iterator implementation (which can be a guide for implementing the remaining methods). Also, for more information on NDArray, this is helpful. It gives detailed information on creating, modifying and accessing parts of an NDArray.
deeplearning4j creator here.
You should not in any but all very special setting create a data set iterator. You should be using datavec. We cover this in numerous places ranging from our data vec page to our examples:
https://deeplearning4j.konduit.ai/datavec/overview
https://github.com/eclipse/deeplearning4j-examples
Datavec is our dedicated library for doing data transformations. You create custom record readers for your use case. Deeplearning4j for legacy reasons has a few "special" iterators for certain datasets. Many of those came before datavec existed. We built datavec as a way of pre processing data.
Now you use the RecordReaderDataSetIterator, SequenceRecordReaderDataSetIterator (see our javadoc for more information) and their multi dataset equivalents.
If you do this, you don't have to worry about masking, thread safety, or anything else that involves fast loading of data.
As an aside, I would love to know where you are getting the idea to create your own iterator, we now have it right in our readme not to do that. If there's another place you were looking that is not obvious, we would love to fix that.
Edit:
I've updated the links to the new pages. This post is very old now.
Please see the new links here:
https://deeplearning4j.konduit.ai/datavec/overview
https://github.com/eclipse/deeplearning4j-examples
I am trying to use Weka for feature selection using PCA algorithm.
My original feature space contains ~9000 attributes, in 2700 samples.
I tried to reduce dimensionality of the data using the following code:
AttributeSelection selector = new AttributeSelection();
PrincipalComponents pca = new PrincipalComponents();
Ranker ranker = new Ranker();
selector.setEvaluator(pca);
selector.setSearch(ranker);
Instances instances = SamplesManager.asWekaInstances(trainSet);
try {
selector.SelectAttributes(instances);
return SamplesManager.asSamplesList(selector.reduceDimensionality(instances));
} catch (Exception e ) {
...
}
However, It did not finish to run within 12 hours. It is stuck in the method selector.SelectAttributes(instances);.
My questions are:
Is so long computation time expected for weka's PCA? Or am I using PCA wrongly?
If the long run time is expected:
How can I tune the PCA algorithm to run much faster? Can you suggest an alternative? (+ example code how to use it)?
If it is not:
What am I doing wrong? How should I invoke PCA using weka and get my reduced dimensionality?
Update: The comments confirms my suspicion that it is taking much more time than expected.
I'd like to know: How can I get PCA in java - using weka or an alternative library.
Added a bounty for this one.
After deepening in the WEKA code, the bottle neck is creating the covariance matrix, and then calculating the eigenvectors for this matrix. Even trying to switch to sparsed matrix implementation (I used COLT's SparseDoubleMatrix2D) did not help.
The solution I came up with was first reduce the dimensionality using a first fast method (I used information gain ranker, and filtering based on document frequencey), and then use PCA on the reduced dimensionality to reduce it farther.
The code is more complex, but it essentially comes down to this:
Ranker ranker = new Ranker();
InfoGainAttributeEval ig = new InfoGainAttributeEval();
Instances instances = SamplesManager.asWekaInstances(trainSet);
ig.buildEvaluator(instances);
firstAttributes = ranker.search(ig,instances);
candidates = Arrays.copyOfRange(firstAttributes, 0, FIRST_SIZE_REDUCTION);
instances = reduceDimenstions(instances, candidates)
PrincipalComponents pca = new PrincipalComponents();
pca.setVarianceCovered(var);
ranker = new Ranker();
ranker.setNumToSelect(numFeatures);
selection = new AttributeSelection();
selection.setEvaluator(pca);
selection.setSearch(ranker);
selection.SelectAttributes(instances );
instances = selection.reduceDimensionality(wekaInstances);
However, this method scored worse then using a greedy information gain and a ranker, when I cross-validated for estimated accuracy.
It looks like you're using the default configuration for the PCA, which judging by the long runtime, it is likely that it is doing way too much work for your purposes.
Take a look at the options for PrincipalComponents.
I'm not sure if -D means they will normalize it for you or if you have to do it yourself. You want your data to be normalized (centered about the mean) though, so I would do this yourself manually first.
-R sets the amount of variance you want accounted for. Default is 0.95. The correlation in your data might not be good so try setting it lower to something like 0.8.
-A sets the maximum number of attributes to include. I presume the default is all of them. Again, you should try setting it to something lower.
I suggest first starting out with very lax settings (e.g. -R=0.1 and -A=2) then working your way up to acceptable results.
Best
for the construction of your covariance matrix, you can use the following formula which is also used by matlab. It is faster then the apache library.
Whereby Matrix is an m x n matrix. (m --> #databaseFaces)
I just ported my code from MATLAB to Java, and I need the eigen decomposition of a matrix, specifically I only need the first k values not the full decomposition.
However in JAMA, the eigen-decomposition class computes the full eigen decomposition. I tried to modify it, but it throws some errors. Is there another similar library?
In MATLAB, the function in question is eigs(k,A)
So it's just returning the array of all the eigenvalues. You want to return an array with just the first k values of the array. There are many ways to do this in Java. One is to convert the array to an ArrayList, get a subList of that list, and convert back to an array.
double[] mySubArray = new double[k];
for (int i=0; i < k; i++) {
subArray[i] = myFullArray[i];
}
By the way, this is the library he is referring to: http://math.nist.gov/javanumerics/jama/doc/
In the case you cannot find any existing codes, I guess you should refer to this thesis or maybe this paper.
Maybe you can try another package named EigenDecomposition in http://commons.apache.org/proper/commons-math/javadocs/api-3.6/org/apache/commons/math3/linear/EigenDecomposition.html, there are some methods like getImagEigenvalue(int i), you can get the i-th eigenvalue by this.
I'm currently porting a program written in Python to Java and have run into some problems. I'm porting a part of the program at the time and for testing purposes I'm using JPype to make it compatible with the new java classes.
EDIT: Just to makes things more clear, the class I'm currently working on provides data to the rest of the Python program.
So, in my java class I have some float and byte values in ArrayLists,
ArrayList<ArrayList<Float>> dataFloat = new ArrayList<ArrayList<Float>>();
ArrayList<ArrayList<Byte>> dataByte = new ArrayList<ArrayList<Byte>>();
Then with the use of JPype I am able to get these into my Python environment which now has the type
<class 'jpype._jclass.java.util.ArrayList'> .
Now I wanted to simply convert these to numpy arrays in Python,
numpy.array(dataFloat) .
Which seemed to work at first as it looked nice when it was printed out,
[[1.0 2.0 3.0]
[80.0 127.0 127.0]
[255.0 255.0 255.0]] .
However, it did not work with the rest of the program because it demands that the values are of the type float. Looking further into the problem I found that these "float" values that I have are in fact
<class 'jpype._jclass.java.lang.Float'>
and not the regular Python float that I wanted. Compared to a regular numpy float array,
>>> b = array([[1.1, 2.1, 3.1], [4.1, 5.1, 6.1], [7.1, 8.1, 9.1]])
>>> type((b[0])[0])
<type 'numpy.float64'>
which has the desired float type.
To be able to run it with the rest of the Python program I had to convert the array per element with the java Float.floatValue(),
arr = numpy.array(dataFloat)
a = array([])
for j in range(len(arr)):
b = array([])
if array_equal(a,[]):
for i in arr.get(j):
a = append(a, i.floatValue())
else:
for i in arr.get(j):
b = append(b, i.floatValue())
a = vstack((a, b))
And this of course takes a lot of time, especially when there are thousands of elements.
Does anyone know this can be done in an efficient way? Simply put, I get a lot java.lang.Float values from JPype that need to be converted to regular Python float values.
I tried JPype some time ago and had some issues with type conversions too. Maybe you can speedup your code using http://cython.org/, there are some methods to speed up access to numpy data structures: http://docs.cython.org/src/tutorial/numpy.html
One further comment: vstack and hstack can copy with arbitrary lists/tuples. So you could rewrite your code like (untested)
arr = numpy.array(dataFloat)
a = []
for j in range(len(arr)):
b = array([])
for i in arr.get(j):
b = append(b, i.floatValue())
a.append(b)
a = vstack(a)
Further you could optimise speed if you avoid the calls to append(). It is faster to allocate your array b with a fixed size N, eg using zero(), and then filling in the values:
arr = numpy.array(dataFloat)
N = ....
a = []
for j in range(len(arr)):
b = zeros((N,))
for k in range(N):
i =arr.get(j)[k]:
b[k] = i.floatValue()
a.append(b)
a = vstack(a)
Which only works if you know N.
I find myself needing to generate a checksum for a string of data, for consistency purposes. The broad idea is that the client can regenerate the checksum based on the payload it recieves and thus detect any corruption that took place in transit. I am vaguely aware that there are all kinds of mathematical principles behind this kind of thing, and that it's very easy for subtle errors to make the whole algorithm ineffective if you try to roll it yourself.
So I'm looking for advice on a hashing/checksum algorithm with the following criteria:
It will be generated by Javascript, so needs to be relatively light computationally.
The validation will be done by Java (though I cannot see this actually being an issue).
It will take textual input (URL-encoded Unicode, which I believe is ASCII) of a moderate length; typically around 200-300 characters and in all cases below 2000.
The output should be ASCII text as well, and the shorter it can be the better.
I'm primarily interested in something lightweight rather than getting the absolute smallest potential for collisions possible. Would I be naive to imagine that an eight-character hash would be suitable for this? I should also clarify that it's not the end of the world if corruption isn't picked up at the validation stage (and I do realise that this will not be 100% reliable), though the rest of my code is markedly less efficient for every corrupt entry that slips through.
Edit - thanks to all that contributed. I went with the Adler32 option and given that it was natively supported in Java, extremely easy to implement in Javascript, fast to calculate at both ends and have an 8-byte output it was exactly right for my requirements.
(Note that I realise that the network transport is unlikely to be responsible for any corruption errors and won't be folding my arms on this issue just yet; however adding the checksum validation removes one point of failure and means we can focus on other areas should this reoccur.)
CRC32 is not too hard to implement in any language, it is good enough to detect simple data corruption and when implemted in a good fashion, it is very fast. However you can also try Adler32, which is almost equally good as CRC32, but it's even easier to implement (and about equally fast).
Adler32 in the Wikipedia
CRC32 JavaScript implementation sample
Either of these two (or maybe even both) are available in Java right out of the box.
Are aware that both TCP and UDP (and IP, and Ethernet, and...) already provide checksum protection to data in transit?
Unless you're doing something really weird, if you're seeing corruption, something is very wrong. I suggest starting with a memory tester.
Also, you receive strong data integrity protection if you use SSL/TLS.
Javascript implementation of MD4, MD5 and SHA1. BSD license.
Other people have mentioned CRC32 already, but here's a link to the W3C implementation of CRC-32 for PNG, as one of the few well-known, reputable sites with a reference CRC implementation.
(A few years back I tried to find a well-known site with a CRC algorithm or at least one that cited the source for its algorithm, & was almost tearing my hair out until I found the PNG page.)
[UPDATE 30/5/2013: The link to the old JS CRC32 implementation died, so I've now linked to a different one.]
Google CRC32: fast, and much lighter weight than MD5 et al. There is a Javascript implementation here.
In my search for a JavaScript implementation of a good checksum algorithm I came across this question. Andrzej Doyle rightfully chose Adler32 as the checksum, as it is indeed easy to implement and has some excellent properties. DroidOS then provided an actual implementation in JavaScript, which demonstrated the simplicity.
However, the algorithm can be further improved upon as detailed in the Wikipedia page and as implemented below. The trick is that you need not determine the modulo in each step. Rather, you can defer this to the end. This considerably increases the speed of the implementation, up to 6x faster on Chrome and Safari. In addition, this optimalisation does not affect the readability of the code making it a win-win. As such, it definitely fits in well with the original question as to having an algorithm / implementation that is computationally light.
function adler32(data) {
var MOD_ADLER = 65521;
var a = 1, b = 0;
var len = data.length;
for (var i = 0; i < len; i++) {
a += data.charCodeAt(i);
b += a;
}
a %= MOD_ADLER;
b %= MOD_ADLER;
return (b << 16) | a;
}
edit: imaya created a jsperf comparison a while back showing the difference in speed when running the simple version, as detailed by DroidOS, compared to an optimised version that defers the modulo operation. I have added the above implementation under the name full-length to the jsperf page showing that the above implementation is about 25% faster than the one from imaya and about 570% faster than the simple implementation (tests run on Chrome 30): http://jsperf.com/adler-32-simple-vs-optimized/6
edit2: please don't forget that, when working on large files, you will eventually hit the limit of your JavaScript implementation in terms of the a and b variables. As such, when working with a large data source, you should perform intermediate modulo operations as to ensure that you do not exceed the maximum value of the integer that you can reliably store.
Use SHA-1 JS implementation. It's not as slow as you think (Firefox 3.0 on Core 2 Duo 2.4Ghz hashes over 100KB per second).
Here's a relatively simple one I've 'invented' - there's no mathematical research behind it but it's extremely fast and works in practice. I've also included the Java equivalent that tests the algorithm and shows that there's less than 1 in 10,000,000 chance of failure (it takes a minute or two to run).
JavaScript
function getCrc(s) {
var result = 0;
for(var i = 0; i < s.length; i++) {
var c = s.charCodeAt(i);
result = (result << 1) ^ c;
}
return result;
}
Java
package test;
import java.util.*;
public class SimpleCrc {
public static void main(String[] args) {
final Random randomGenerator = new Random();
int lastCrc = -1;
int dupes = 0;
for(int i = 0; i < 10000000; i++) {
final StringBuilder sb = new StringBuilder();
for(int j = 0; j < 1000; j++) {
final char c = (char)(randomGenerator.nextInt(128 - 32) + 32);
sb.append(c);
}
final int crc = crc(sb.toString());
if(lastCrc == crc) {
dupes++;
}
lastCrc = crc;
}
System.out.println("Dupes: " + dupes);
}
public static int crc(String string) {
int result = 0;
for(final char c : string.toCharArray()) {
result = (result << 1) ^ c;
}
return result;
}
}
This is a rather old thread but I suspect it is still viewed quite often so - if all you need is a short but reliable piece of code to generate a checksum the Adler32 bit algorithm has to be your choice. Here is the JavaScript code
function adler32(data)
{
var MOD_ADLER = 65521;
var a = 1, b = 0;
for (var i = 0;i < data.length;i++)
{
a = (a + data.charCodeAt(i)) % MOD_ADLER;
b = (b + a) % MOD_ADLER;
}
var adler = a | (b << 16);
return adler;
}
The corresponding fiddle demonsrating the algorithm in action is here.