After training the model in python & loading it in Java, for making predictions, how is it possible to create sparse tensors for categorical inputs.
I could successfully create tensors for numeric values as :
Tensor x =
Tensor.create(
new long[] {2, 4},
FloatBuffer.wrap(
new float[] {
6.4f, 3.2f, 4.5f, 1.5f,
5.8f, 3.1f, 5.0f, 1.7f
}));
But for Categorical data, we need Sparse Tensors, how can we create it?
Please find my input_fn() as :
def input_fn(df):
# Creates a dictionary mapping from each continuous feature column name (k) to
# the values of that column stored in a constant Tensor.
continuous_cols = {k: tf.constant(df[k].values)
for k in CONTINUOUS_COLUMNS}
# Creates a dictionary mapping from each categorical feature column name (k)
categorical_cols = {k: tf.SparseTensor(
indices=[[i, 0] for i in range(df[k].size)],
values=df[k].values,
dense_shape=[df[k].size, 1])
for k in CATEGORICAL_COLUMNS}
# Merges the two dictionaries into one.
feature_cols = dict(continuous_cols.items() + categorical_cols.items())
# Converts the label column into a constant Tensor.
label = tf.constant(df[LABEL_COLUMN].values)
# Returns the feature columns and the label.
return feature_cols, label
So what if i have input like below:
age workclass fnlwgt education education_num marital_status occupation race LABEL(Income_bracket)
39 State-gov 77516 Bachelors 13 Never-married Adm-clerical White 3
How can i create a tensor for continous values and categorical values and merge them to be provided as the input to Tensorflow in JAVA.
Find the code for training the model in python - https://gist.github.com/gaganmalhotra/cd6a5898b9caf9005a05c8831a9b9153
#ash
Related
I am trying to find some code that will easily break now a binary string. I'm not even sure I'm asking this question correctly, but I want to get the value of each "active bit". For example, if I have a binary string of 100000001, I would like to return the values 256, 1 in an array. I'm trying to figure this out so I can use a lookup table in SQL which has an integer column and a text column. The integer column will be used to determine which text values will be written to a new table. So, the value "Text1" at 1, and "Text 2" at 256 would both be written to the new table, but the number submitted to get those values would be 257.
I know I'm rambling, but I would input a value, 257, and I convert it to a binary string of 100000001. Now I want some code to break that binary string into two values... 1 and 256. Am I making any sense?
You don't need to convert to a binary string if you use Integer.highestOneBit. You can loop through the one bits, filling in an array of size Integer.bitCount with each call to Integer.highestOneBit. Afterwards, you can xor with the value of the highest bit to remove it from the number.
public static int[] getOneBits(int num) {
int[] oneBits = new int[Integer.bitCount(num)];
for (int i = 0; i < oneBits.length; i++) {
oneBits[i] = Integer.highestOneBit(num);
num ^= oneBits[i];
}
return oneBits;
}
Ideone Demo
This will produce an array, where all of the values are powers of 2 in descending order, where the sum of all the elements will be the original number. For example, 257 will produce [256, 1], and 127 will produce [64, 32, 16, 8, 4, 2, 1].
I have started using the EJML library for representing matrices. I will use the SimpleMatrix. I did not find two important things which I need. Perhaps somebody can help me identify if the following operations are possible and if yes, how this can be done:
Is it possible to convert a matrix back to a 1D double array (double[]) or 2D double array (double[][]) without just looping through all elements which would be very inefficient? I did not find a method for that. For example, Jeigen library provides a conversion to a 1D array (but I don't know how this is internally done).
Is it possible to delete a row or column?
By the way, does somebody know how EJML compares to Jeigen for large matrices in terms of runtime? EJML provides much more functionality and is much better documented but I'm a bit afraid in terms of runtime.
The underlying array of a SimpleMatrix (it's always 1-dimensional to keep all elements in the same area in RAM) can be retrieved by first getting the underlying DenseMatrix64F and then getting the public data field of D1Matrix64F base class
// where matrix is a SimpleMatrix
double[] data = matrix.getMatrix().data;
I don't see a straightforward way to delete arbitrary rows, columns. One workaround is to use extractMatrix (it copies the underlying double[]) to get 2 parts of the original matrix and then combine them to a new matrix. E.g. to delete the middle column of this 2x3 matrix :
SimpleMatrix fullMatrix = new SimpleMatrix(new double[][]{{2, 3, 4}, {7, 8, 9}});
SimpleMatrix a = fullMatrix.extractMatrix(0, 2, 0, 1);
SimpleMatrix b = fullMatrix.extractMatrix(0, 2, 2, 3);
SimpleMatrix matrix = a.combine(0, 1, b);
Or to delete specifically the first column you can simply do:
SimpleMatrix matrix = fullMatrix.extractMatrix(0, 2, 1, 3);
Or to delete specifically the last column you can simply do (doesn't delete, copy underlying data[]):
matrix.getMatrix().setNumCols(matrix.numCols() - 1);
I will refer to this answer for benchmarks / performance of various java matrix libraries. The performance of ejml is excellent for small matrices and for say size 100 or more doesn't compete well with libraries backed by native C/C++ libraries (like Jeigen). As always, your mileage may vary.
Manos' answer to the first question did not work for me. This is what I did instead:
public double[][] matrix2Array(SimpleMatrix matrix) {
double[][] array = new double[matrix.numRows()][matrix.numCols()];
for (int r = 0; r < matrix.numRows(); r++) {
for (int c = 0; c < matrix.numCols(); c++) {
array[r][c] = matrix.get(r, c);
}
}
return array;
}
I don't know how it compares in performance to other methods, but it works fast enough for what I needed it for.
I am using .NET implementation of liblinear in my C# code by the following nuget package:
https://www.nuget.org/packages/Liblinear/
But in the readme file of liblinear, the format for x is:
struct problem describes the problem:
struct problem
{
int l, n;
int *y;
struct feature_node **x;
double bias;
};
where `l` is the number of training data. If bias >= 0, we assume
that one additional feature is added to the end of each data
instance. `n` is the number of feature (including the bias feature
if bias >= 0). `y` is an array containing the target values. (integers
in classification, real numbers in regression) And `x` is an array
of pointers, each of which points to a sparse representation (array
of feature_node) of one training vector.
For example, if we have the following training data:
LABEL ATTR1 ATTR2 ATTR3 ATTR4 ATTR5
----- ----- ----- ----- ----- -----
1 0 0.1 0.2 0 0
2 0 0.1 0.3 -1.2 0
1 0.4 0 0 0 0
2 0 0.1 0 1.4 0.5
3 -0.1 -0.2 0.1 1.1 0.1
and bias = 1, then the components of problem are:
l = 5
n = 6
y -> 1 2 1 2 3
x -> [ ] -> (2,0.1) (3,0.2) (6,1) (-1,?)
[ ] -> (2,0.1) (3,0.3) (4,-1.2) (6,1) (-1,?)
[ ] -> (1,0.4) (6,1) (-1,?)
[ ] -> (2,0.1) (4,1.4) (5,0.5) (6,1) (-1,?)
[ ] -> (1,-0.1) (2,-0.2) (3,0.1) (4,1.1) (5,0.1) (6,1) (-1,?)
But, in the example showing java implementation:
https://gist.github.com/hodzanassredin/6682771
problem.x <- [|
[|new FeatureNode(1,0.); new FeatureNode(2,1.)|]
[|new FeatureNode(1,2.); new FeatureNode(2,0.)|]
|]// feature nodes
problem.y <- [|1.;2.|] // target values
which means his data set is:
1 0 1
2 2 0
So, he is not storing the nodes as per sparse format of liblinear. Does, anyone know of correct format for x for liblinear implementation?
Though it doesn't address exactly the library you mentioned, I can offer you an alternative. The
Accord.NET Framework has recently incorporated all of LIBLINEAR's algorithms in its machine learning
namespaces. It is also available through NuGet.
In this library, the direct syntax to create a linear support vector machine from in-memory data is
// Create a simple binary AND
// classification problem:
double[][] problem =
{
// a b a + b
new double[] { 0, 0, 0 },
new double[] { 0, 1, 0 },
new double[] { 1, 0, 0 },
new double[] { 1, 1, 1 },
};
// Get the two first columns as the problem
// inputs and the last column as the output
// input columns
double[][] inputs = problem.GetColumns(0, 1);
// output column
int[] outputs = problem.GetColumn(2).ToInt32();
// However, SVMs expect the output value to be
// either -1 or +1. As such, we have to convert
// it so the vector contains { -1, -1, -1, +1 }:
//
outputs = outputs.Apply(x => x == 0 ? -1 : 1);
After the problem is created, one can learn a linear SVM using
// Create a new linear-SVM for two inputs (a and b)
SupportVectorMachine svm = new SupportVectorMachine(inputs: 2);
// Create a L2-regularized L2-loss support vector classification
var teacher = new LinearDualCoordinateDescent(svm, inputs, outputs)
{
Loss = Loss.L2,
Complexity = 1000,
Tolerance = 1e-5
};
// Learn the machine
double error = teacher.Run(computeError: true);
// Compute the machine's answers for the learned inputs
int[] answers = inputs.Apply(x => Math.Sign(svm.Compute(x)));
This assumes, however, that your data is already in-memory. If you wish to load your data from the
disk, from a file in libsvm sparse format, you can use the framework's SparseReader class.
An example on how to use it can be found below:
// Suppose we are going to read a sparse sample file containing
// samples which have an actual dimension of 4. Since the samples
// are in a sparse format, each entry in the file will probably
// have a much smaller number of elements.
//
int sampleSize = 4;
// Create a new Sparse Sample Reader to read any given file,
// passing the correct dense sample size in the constructor
//
SparseReader reader = new SparseReader(file, Encoding.Default, sampleSize);
// Declare a vector to obtain the label
// of each of the samples in the file
//
int[] labels = null;
// Declare a vector to obtain the description (or comments)
// about each of the samples in the file, if present.
//
string[] descriptions = null;
// Read the sparse samples and store them in a dense vector array
double[][] samples = reader.ReadToEnd(out labels, out descriptions);
Afterwards, one can use the samples and labels vectors as the inputs and outputs of the problem,
respectively.
I hope it helps.
Disclaimer: I am the author of this library. I am answering this question in the sincere hope it
can be useful for the OP, since not long ago I also faced the same problems. If a moderator thinks
this looks like spam, feel free to delete. However, I am only posting this because I think it might
help others. I even came across this question by mistake while searching for existing C#
implementations of LIBSVM, not LIBLINEAR.
I'm using the Trickl-Cluster project to cluster my data set
and Colt to memorize the data objects in matrices .
After executing this code
import cern.colt.matrix.DoubleMatrix2D;
import cern.colt.matrix.impl.DenseDoubleMatrix2D;
import com.trickl.cluster.KMeans;
DoubleMatrix2D dm1 = new DenseDoubleMatrix2D(3, 3);
dm1.setQuick(0, 0, 5.9);
dm1.setQuick(0, 1, 1.6);
dm1.setQuick(0, 2, 18.0);
dm1.setQuick(1, 0, 2.0);
dm1.setQuick(1, 1, 3.5);
dm1.setQuick(1, 2, 20.3);
dm1.setQuick(2, 0, 11.5);
dm1.setQuick(2, 1, 100.5);
dm1.setQuick(2, 2,6.5);
System.out.println (dm1);
KMeans km = new KMeans();
km.cluster(dm1 ,1);
DoubleMatrix2D dm11 = km.getPartition();
System.out.println (dm11);
DoubleMatrix2D dm111 = km.getMeans();
System.out.println (dm111);
I had the following output
3 x 3 matrix
5.9 1.6 18
2 3.5 20.3
11.5 100.5 6.5
3 x 1 matrix
1
1
1
3 x 1 matrix
6.466667
35.2
14.933333
Following the algorithm steps , it's strange when one expects 1 cluster and has 3 means
The documentation is not so clear about that specific point .
This is the definition of the method Cluster according to the java doc of the project
void cluster(cern.colt.matrix.DoubleMatrix2D data, int clusters)
So logically speaking the int clusters represents the number of the expected clusters after K-means terminates.
Have you any idea about the relation between the outputs of K-means class in the project and the K-means algorithm expected results?
This is one 3-dimensional mean. If you put in three-dimensional data, you get out three-dimensional means.
Note that running k-means with k=1 is absolutely nonsensical, as it will simply compute the mean of the data set:
(5.9+2+11.5) / 3 = 6.466667
(1.6+3.5+100.5) / 3 = 35.2
(18+20.3+6.5) / 3 = 14.933333
The result is obviously correct.
I developed some java program to calculate cosine similarity on the basis of TF*IDF. It worked very well. But there is one problem.... :(
for example:
If I have following two matrix and I want to calculate cosine similarity it does not work as rows are not same in length
doc 1
1 2 3
4 5 6
doc 2
1 2 3 4 5 6
7 8 5 2 4 9
if rows and colums are same in length then my program works very well but it does not if rows and columns are not in same length.
Any tips ???
I'm not sure of your implementation but the cosine distance of two vectors is equal to the normalized dot product of those vectors.
The dot product of two matrix can be expressed as a . b = aTb. As a result if the matrix have different length you can't take the dot product to identify the cosine.
Now in a standard TF*IDF approach the terms in your matrix should be indexed by term, document as a result any terms not appearing in a document should appear as zeroes in your matrix.
Now the way you have it set up seems to suggest there are two different matrices for your two documents. I'm not sure if this is your intent, but it seems incorrect.
On the other hand if one of your matrices is supposed to be your query, then it should be a vector and not a matrix, so that the transpose produces the correct result.
A full explanation of TF*IDF follows:
Ok, in a classic TF*IDF you construct a term-document matrix a. Each value in matrix a is characterized as ai,j where i is the term and j is the document. This value is a combination of local, global and normalized weights (although if you normalize your documents, the normalized weight should be 1). Thus ai,j = fi,j*D/di, where fi,j is the frequency of word i in doc j, D is the document size, and di is the number of documents with term i in them.
Your query is a vector of terms designated as b. For each term bi,q in your query refers to term i for query q. bi,q = fi,q where fi,q is the frequency of term i in query q. In this case each query is a vector, and multiple queries form a matrix.
We can then calculate the unit vectors of each so that when we take the dot product it will produce the correct cosine. To achieve the unit vector we divide both the matrix a and the query b by their Frobenius norm.
Finally we can perform the cosine distance by taking the transpose of the vector b for a given query. Thus one query (or vector) per calculation. This is denoted as bTa. The final result is a vector with the scoring for each term where a higher score denotes higher document rank.
simple java cosine similarity
static double cosine_similarity(Map<String, Double> v1, Map<String, Double> v2) {
Set<String> both = Sets.newHashSet(v1.keySet());
both.removeAll(v2.keySet());
double sclar = 0, norm1 = 0, norm2 = 0;
for (String k : both) sclar += v1.get(k) * v2.get(k);
for (String k : v1.keySet()) norm1 += v1.get(k) * v1.get(k);
for (String k : v2.keySet()) norm2 += v2.get(k) * v2.get(k);
return sclar / Math.sqrt(norm1 * norm2);
}