Liblinear usage format - java

I am using .NET implementation of liblinear in my C# code by the following nuget package:
https://www.nuget.org/packages/Liblinear/
But in the readme file of liblinear, the format for x is:
struct problem describes the problem:
struct problem
{
int l, n;
int *y;
struct feature_node **x;
double bias;
};
where `l` is the number of training data. If bias >= 0, we assume
that one additional feature is added to the end of each data
instance. `n` is the number of feature (including the bias feature
if bias >= 0). `y` is an array containing the target values. (integers
in classification, real numbers in regression) And `x` is an array
of pointers, each of which points to a sparse representation (array
of feature_node) of one training vector.
For example, if we have the following training data:
LABEL ATTR1 ATTR2 ATTR3 ATTR4 ATTR5
----- ----- ----- ----- ----- -----
1 0 0.1 0.2 0 0
2 0 0.1 0.3 -1.2 0
1 0.4 0 0 0 0
2 0 0.1 0 1.4 0.5
3 -0.1 -0.2 0.1 1.1 0.1
and bias = 1, then the components of problem are:
l = 5
n = 6
y -> 1 2 1 2 3
x -> [ ] -> (2,0.1) (3,0.2) (6,1) (-1,?)
[ ] -> (2,0.1) (3,0.3) (4,-1.2) (6,1) (-1,?)
[ ] -> (1,0.4) (6,1) (-1,?)
[ ] -> (2,0.1) (4,1.4) (5,0.5) (6,1) (-1,?)
[ ] -> (1,-0.1) (2,-0.2) (3,0.1) (4,1.1) (5,0.1) (6,1) (-1,?)
But, in the example showing java implementation:
https://gist.github.com/hodzanassredin/6682771
problem.x <- [|
[|new FeatureNode(1,0.); new FeatureNode(2,1.)|]
[|new FeatureNode(1,2.); new FeatureNode(2,0.)|]
|]// feature nodes
problem.y <- [|1.;2.|] // target values
which means his data set is:
1 0 1
2 2 0
So, he is not storing the nodes as per sparse format of liblinear. Does, anyone know of correct format for x for liblinear implementation?

Though it doesn't address exactly the library you mentioned, I can offer you an alternative. The
Accord.NET Framework has recently incorporated all of LIBLINEAR's algorithms in its machine learning
namespaces. It is also available through NuGet.
In this library, the direct syntax to create a linear support vector machine from in-memory data is
// Create a simple binary AND
// classification problem:
double[][] problem =
{
// a b a + b
new double[] { 0, 0, 0 },
new double[] { 0, 1, 0 },
new double[] { 1, 0, 0 },
new double[] { 1, 1, 1 },
};
// Get the two first columns as the problem
// inputs and the last column as the output
// input columns
double[][] inputs = problem.GetColumns(0, 1);
// output column
int[] outputs = problem.GetColumn(2).ToInt32();
// However, SVMs expect the output value to be
// either -1 or +1. As such, we have to convert
// it so the vector contains { -1, -1, -1, +1 }:
//
outputs = outputs.Apply(x => x == 0 ? -1 : 1);
After the problem is created, one can learn a linear SVM using
// Create a new linear-SVM for two inputs (a and b)
SupportVectorMachine svm = new SupportVectorMachine(inputs: 2);
// Create a L2-regularized L2-loss support vector classification
var teacher = new LinearDualCoordinateDescent(svm, inputs, outputs)
{
Loss = Loss.L2,
Complexity = 1000,
Tolerance = 1e-5
};
// Learn the machine
double error = teacher.Run(computeError: true);
// Compute the machine's answers for the learned inputs
int[] answers = inputs.Apply(x => Math.Sign(svm.Compute(x)));
This assumes, however, that your data is already in-memory. If you wish to load your data from the
disk, from a file in libsvm sparse format, you can use the framework's SparseReader class.
An example on how to use it can be found below:
// Suppose we are going to read a sparse sample file containing
// samples which have an actual dimension of 4. Since the samples
// are in a sparse format, each entry in the file will probably
// have a much smaller number of elements.
//
int sampleSize = 4;
// Create a new Sparse Sample Reader to read any given file,
// passing the correct dense sample size in the constructor
//
SparseReader reader = new SparseReader(file, Encoding.Default, sampleSize);
// Declare a vector to obtain the label
// of each of the samples in the file
//
int[] labels = null;
// Declare a vector to obtain the description (or comments)
// about each of the samples in the file, if present.
//
string[] descriptions = null;
// Read the sparse samples and store them in a dense vector array
double[][] samples = reader.ReadToEnd(out labels, out descriptions);
Afterwards, one can use the samples and labels vectors as the inputs and outputs of the problem,
respectively.
I hope it helps.
Disclaimer: I am the author of this library. I am answering this question in the sincere hope it
can be useful for the OP, since not long ago I also faced the same problems. If a moderator thinks
this looks like spam, feel free to delete. However, I am only posting this because I think it might
help others. I even came across this question by mistake while searching for existing C#
implementations of LIBSVM, not LIBLINEAR.

Related

How to provide strings as inputs and outputs in encog XOR function?

I need to make encog program in Java with XOR function that must have string words with definitions as inputs but BasicMLDataSet can only receive doubles. Here is the sample code that I am using:
/**
* The input necessary for XOR.
*/
public static double XOR_INPUT[][] = { { 0.0, 0.0 }, { 1.0, 0.0 },
{ 0.0, 1.0 }, { 1.0, 1.0 } };
/**
* The ideal data necessary for XOR.
*/
public static double XOR_IDEAL[][] = { { 0.0 }, { 1.0 }, { 1.0 }, { 0.0 } };
And here is the class that receives XOR_INPUT and XOR_IDEAL:
MLDataSet trainingSet = new BasicMLDataSet(XOR_INPUT, XOR_IDEAL);
The code is from encog xor example
Is there any way that I can acomplish training with strings or parse them somehow and then return them to strings before writing them to console?
I have found a work around for this. As I can only provide double values between 0 and 1 as inputs and as I haven't found any function in encog that can naturally normalize string to double values I have made my own function. I'm getting ascii value from every letter in word and then I'm simply dividing 90/asciiValue to get value between 0 and 1. Keep in mind that this only works for small letters. Function can be easily upgraded to support upper letters also. Here is the function:
//Converts every letter in string to ascii and normalizes it (90/asciiValue)
public static double[] toAscii(String s, int najveci) {
double[] ascii = new double[najveci];
try {
byte[] bytes = s.getBytes("US-ASCII");
for (int i = 0; i < bytes.length; i++) {
ascii[i] = 90.0 / bytes[i];
}
} catch (UnsupportedEncodingException e) {
e.printStackTrace();
}
return ascii;
}
For word ideal output I'm using similar solution. I'm also normalizing each letter in word but then I make average of those values. Later, I'm denormalizing those values to get strings back and check model training goodnes.
You can view full code here.
You can use Encog's EncogAnalyst and AnalystWizard to normalize your data. This posting by #JeffHeaton (the author of Encog) shows an example using .csv files
These classes can normalize both numeric and "nominal" data (e.g. the strings you want to use.) You will likely want to use the "Equilateral" normalization for these strings, as this will avoid some training issues with Neural Networks.
You might also want to check out this tutorial on Encog on PluralSight which has an entire section on Normalization.
Here is an example from the Encog documentation that shows how to normalize a field using code (without a .csv file):
var fuelStats = new NormalizedField( NormalizationAction.Normalize , ”fuel”, 200, 0, −0.9, 0.9) ;
28 Obtaining Data for Encog
For the above example the range is normalized to -0.9 to 0.9. This is very similar to normalizing between -1 and 1, but less extreme. This can produce better results at times. It is also known that the acceptable range for fuel is between 0 and 200. Now that the field object has been created, it is easy to normalize the values. Here the value 100 is normalized into the variable n. double n = fuelStats .Normalize(100); To denormalize n back to the original fuel value, use the following code:
double f = fuelStats .Denormalize(n);

Delta training rule for perceptron training

I'm trying to train a perceptron for the AND boolean function using the delta training rule. But even after convergence, it's wrongly classifying the inputs (1 input actually). Could you please tell me where am I wrong : http://ideone.com/CDgTQE
This is the training function used:
public void trianWithDelta(Example[] examples){
for(int i=0;i<1000;++i){
dw1 = 0;
dw2 = 0;
for(Example ex:examples){
double o = computeOutput(ex);
double t = ex.o;
dw1 = dw1 + n*(t-o)*ex.x1;
dw2 = dw2 + n*(t-o)*ex.x2;
}
w1 += dw1;
w2 += dw2;
}
}
The training examples (boolean AND):
Example[] examples = new Example[]{
new Example(-1, -1, -1),
new Example(-1 , 1, -1),
new Example( 1, -1, -1),
new Example( 1, 1, 1)
};
Results :
w1 : 0.49999999999999994 w2 : 0.5000000000000002
Tests using the training examples after training :
-1
1 (incorrect)
-1
1
Your code is actually correct, the problem lies in your understanding of what can be learned using an unbiased perceptron and what can't.
If you do not have a bias, then learning AND is nearly impossible because:
there is exactly one angle separating your data, which is realized for line y=-x, in your code it would mean that w1=w2, and even slightest difference between their values will break the classifier (such as 1e-20)
you classifier actualy answers three values (as you use sign function): -1, 0, 1 while it is impossible to separate AND without bias in such setting, as you need to answer -1 when activation is 0.
Try to draw the correct separator on piece of paper, you will notice, that without bias your line has to cross (0,0), thus, it has to be y=-x, and consequently for (-1,1) and (1,-1) the activation is 0.
Both problems can be solved by just adding bias node (and this is what you should do).
You can also change "a bit" definition of AND - for example by encoding "False" as -2
Example[] examples = new Example[]{
new Example(-2, -2, -2),
new Example(-2 , 1, -2),
new Example( 1, -2, -2),
new Example( 1, 1, 1)
};
And runing your code behaves as expected
Trained weights : 0.6363636363636364 0.6363636363636364
-1
-1
-1
1

String to binary?

I have a very odd situation,
I'm writing a filter engine for another program, and that program has what are called "save areas". Each of those save areas is numbered 0 through 32 (why there are 33 of them, I don't know). They are turned on or off via a binary string,
1 = save area 0 on
10 = save area 1 on, save area 0 off
100 = save area 2 on, save areas 1 and 0 off.
and so on.
I have another program passing in what save areas it needs, but it does so with decimal representations and underscores - 1_2_3 for save areas 1, 2, and 3 for instance.
I would need to convert that example to 1110.
What I came up with is that I can build a string as follows:
I break it up (using split) into savePart[i]. I then iterate through savePart[i] and build strings:
String saveString = padRight("0b1",Integer.parseInt(savePart[i]));
That'll give me a string that reads "0b1000000" in the case of save area 6, for instance.
Is there a way to read that string as if it was a binary number instead. Because if I were to say:
long saveBinary = 0b1000000L
that would totally work.
or, is there a smarter way to be doing this?
long saveBinary = Long.parseLong(saveString, 2);
Note that you'll have to leave off the 0b prefix.
This will do it:
String input = "1_2_3";
long areaBits = 0;
for (String numTxt : input.split("_")) {
areaBits |= 1L << Integer.parseInt(numTxt);
}
System.out.printf("\"%s\" -> %d (decimal) = %<x (hex) = %s (binary)%n",
input, areaBits, Long.toBinaryString(areaBits));
Output:
"1_2_3" -> 14 (decimal) = e (hex) = 1110 (binary)
Just take each number in the string and treat it as an exponent. Accumulate the total for each exponent found and you will get your answer w/o the need to remove prefixes or suffixes.
// will contain our answer
long answer = 0;
String[] buckets = givenData.split("_"); // array of each bucket wanted, exponents
for (int x = 0; x < buckets.length; x++){ // iterate through all exponents found
long tmpLong = Long.parseLong(buckets[x]); // get the exponent
answer = (10^tmpLong) + answer; // add 10^exponent to our running total
}
answer will now contain our answer in the format 1011010101 (what have you).
In your example, the given data was 1_2_3. The array will contain {"1", "2", "3"}
We iterate through that array...
10^1 + 10^2 + 10^3 = 10 + 100 + 1000 = 1110
I believe this is also why your numbers are 0 - 32. x^0 = 1, so you can dump into the 0 bucket when 0 is in the input.

Saving a binary STL file in Java

I am trying to save some data as an STL file for use on a 3D printer. The STL file has two forms: ASCII and Binary. The ASCII format is relatively easy to understand and create but most 3D printing services require it to be in binary format.
The information about STL Binary is explained on the Wikipedia page here: http://en.wikipedia.org/wiki/STL_(file_format)
I know that I will require the data to be in a byte array but I have no idea how to go about interpreting the information from Wikipedia and creating the byte array. This is what I would like help with.
The code I have so far simply saves an empty byte array:
byte[] bytes = null;
FileOutputStream stream = new FileOutputStream("test.stl");
try {
stream.write(bytes);
} finally {
stream.close();
}
If you start a new project on an up-to-date Java version, you should not hassle with OutputStreams. Use Channels and ByteBuffers instead.
try(FileChannel ch=new RandomAccessFile("test.stl", "rw").getChannel())
{
ByteBuffer bb=ByteBuffer.allocate(10000).order(ByteOrder.LITTLE_ENDIAN);
// ...
// e.g. store a vertex:
bb.putFloat(0.0f).putFloat(1.0f).putFloat(42);
bb.flip();
ch.write(bb);
bb.clear();
// ...
}
This is the only API providing you with the little-endian support as required. Then match the datatypes:
UINT8 means unsigned byte,
UINT32 means unsigned int,
REAL32 means float,
UINT16 means unsigned short,
REAL32[3] means three floats (i.e. an array)
You don’t have to worry about the unsigned nature of the data types as long as you don’t exceed the max values of the corresponding signed Java types.
This is shouldn't be so ambiguous. The spec says:
UINT8[80] – Header
UINT32 – Number of triangles
foreach triangle
REAL32[3] – Normal vector
REAL32[3] – Vertex 1
REAL32[3] – Vertex 2
REAL32[3] – Vertex 3
UINT16 – Attribute byte count
end
That means a total file size of: 80 + 4 + Number of Triangles * ( 4 * 3 * 4 + 2 ).
So for example, 100 triangles ( 84+100*50 ), produces a 5084 byte file.
You can optimize the following functional code.
Open the file and write the header:
RandomAccessFile raf = new RandomAccessFile( fileName, "rw" );
raf.setLength( 0L );
FileChannel ch = raf.getChannel();
ByteBuffer bb = ByteBuffer.allocate( 1024 ).order( ByteOrder.LITTLE_ENDIAN );
byte titleByte[] = new byte[ 80 ];
System.arraycopy( title.getBytes(), 0, titleByte, 0, title.length() );
bb.put( titleByte );
bb.putInt( nofTriangles ); // Number of triangles
bb.flip(); // prep for writing
ch.write( bb );
In this code, the point vertices and triangle indices are in an arrays like this:
Vector3 vertices[ index ]
int indices[ index ][ triangle point number ]
Write the point data:
for ( int i = 0; i < nofIndices; i++ ) // triangles
{
bb.clear();
Vector3 normal = getNormal( indices[ i ][ 0 ], indices[ i ][ 1 ], indices[ i ][ 2 ] );
bb.putFloat( normal[ k ].x );
bb.putFloat( normal[ k ].y );
bb.putFloat( normal[ k ].z );
for ( int j = 0; j < 3; j++ ) // triangle indices
{
bb.putFloat( vertices[ indices[ i ][ j ] ].x );
bb.putFloat( vertices[ indices[ i ][ j ] ].y );
bb.putFloat( vertices[ indices[ i ][ j ] ].z );
}
bb.putShort( ( short ) 0 ); // number of attributes
bb.flip();
ch.write( bb );
}
close the file:
ch.close();
Get the normals:
Vector3 getNormal( int ind1, int ind2, int ind3 )
{
Vector3 p1 = vertices[ ind1 ];
Vector3 p2 = vertices[ ind2 ];
Vector3 p3 = vertices[ ind3 ];
return p1.cpy().sub( p2 ).crs( p2.x - p3.x, p2.y - p3.y, p2.z - p3.z ) ).nor();
}
See also:
Vector3
You should generate this file in ASCII and use an ASCII to Binary STL Converter.
If you are unable to answer this question yourself, it is probably way easier to do it in ascii first.
http://www.thingiverse.com/thing:39655
Since your question was based on writing a file to send to a 3D printer, I suggest you ditch the STL format file and use an OBJ format file instead. It is much simpler to compose, and it produces a much smaller file. There isn't a binary flavor of OBJ, but it's still a pretty compact file as you will see.
The (abbreviated) spec says:
List all the geometric vertex coordinates as a "v", followed by x, y, z values, like:
v 123.45 234.56 345.67
then List all the triangle as "f", followed by indices in a CCW order, like:
f 1 2 3
Indices start with 1.
Use a # character to start a comment line. Don't append comments anywhere else in a line.
Blank lines are ok.
There's a whole bunch of other things it supports, like normals, and textures. But if all you want to do is write your geometry to file to import into a 3D printer then OBJ is actually preferred, and this simple content is valid and adequate.
Here is an example of a perfectly valid file composing a 1 unit cube, as imported successfully in Microsoft 3D Viewer (included in Win/10), AutoDesk MeshMixer (free download), and PrusaSlicers (free download)
# vertices
v 0 0 0
v 0 1 0
v 1 1 0
v 1 0 0
v 0 0 1
v 0 1 1
v 1 1 1
v 1 0 1
# triangle indices
f 1 3 4
f 1 2 3
f 1 6 2
f 1 5 6
f 1 8 5
f 1 4 8
f 3 7 8
f 3 8 4
f 3 6 7
f 2 6 3
f 5 8 7
f 5 7 6
If you've got data in multiple meshes, you ought to coalesce the vertices to eliminate duplicate points. But because the file is plain text you can use PrintWriter() object and println() methods to write the whole thing.

Java K-means implementation with unexpected output

I'm using the Trickl-Cluster project to cluster my data set
and Colt to memorize the data objects in matrices .
After executing this code
import cern.colt.matrix.DoubleMatrix2D;
import cern.colt.matrix.impl.DenseDoubleMatrix2D;
import com.trickl.cluster.KMeans;
DoubleMatrix2D dm1 = new DenseDoubleMatrix2D(3, 3);
dm1.setQuick(0, 0, 5.9);
dm1.setQuick(0, 1, 1.6);
dm1.setQuick(0, 2, 18.0);
dm1.setQuick(1, 0, 2.0);
dm1.setQuick(1, 1, 3.5);
dm1.setQuick(1, 2, 20.3);
dm1.setQuick(2, 0, 11.5);
dm1.setQuick(2, 1, 100.5);
dm1.setQuick(2, 2,6.5);
System.out.println (dm1);
KMeans km = new KMeans();
km.cluster(dm1 ,1);
DoubleMatrix2D dm11 = km.getPartition();
System.out.println (dm11);
DoubleMatrix2D dm111 = km.getMeans();
System.out.println (dm111);
I had the following output
3 x 3 matrix
5.9 1.6 18
2 3.5 20.3
11.5 100.5 6.5
3 x 1 matrix
1
1
1
3 x 1 matrix
6.466667
35.2
14.933333
Following the algorithm steps , it's strange when one expects 1 cluster and has 3 means
The documentation is not so clear about that specific point .
This is the definition of the method Cluster according to the java doc of the project
void cluster(cern.colt.matrix.DoubleMatrix2D data, int clusters)
So logically speaking the int clusters represents the number of the expected clusters after K-means terminates.
Have you any idea about the relation between the outputs of K-means class in the project and the K-means algorithm expected results?
This is one 3-dimensional mean. If you put in three-dimensional data, you get out three-dimensional means.
Note that running k-means with k=1 is absolutely nonsensical, as it will simply compute the mean of the data set:
(5.9+2+11.5) / 3 = 6.466667
(1.6+3.5+100.5) / 3 = 35.2
(18+20.3+6.5) / 3 = 14.933333
The result is obviously correct.

Categories