I developed some java program to calculate cosine similarity on the basis of TF*IDF. It worked very well. But there is one problem.... :(
for example:
If I have following two matrix and I want to calculate cosine similarity it does not work as rows are not same in length
doc 1
1 2 3
4 5 6
doc 2
1 2 3 4 5 6
7 8 5 2 4 9
if rows and colums are same in length then my program works very well but it does not if rows and columns are not in same length.
Any tips ???
I'm not sure of your implementation but the cosine distance of two vectors is equal to the normalized dot product of those vectors.
The dot product of two matrix can be expressed as a . b = aTb. As a result if the matrix have different length you can't take the dot product to identify the cosine.
Now in a standard TF*IDF approach the terms in your matrix should be indexed by term, document as a result any terms not appearing in a document should appear as zeroes in your matrix.
Now the way you have it set up seems to suggest there are two different matrices for your two documents. I'm not sure if this is your intent, but it seems incorrect.
On the other hand if one of your matrices is supposed to be your query, then it should be a vector and not a matrix, so that the transpose produces the correct result.
A full explanation of TF*IDF follows:
Ok, in a classic TF*IDF you construct a term-document matrix a. Each value in matrix a is characterized as ai,j where i is the term and j is the document. This value is a combination of local, global and normalized weights (although if you normalize your documents, the normalized weight should be 1). Thus ai,j = fi,j*D/di, where fi,j is the frequency of word i in doc j, D is the document size, and di is the number of documents with term i in them.
Your query is a vector of terms designated as b. For each term bi,q in your query refers to term i for query q. bi,q = fi,q where fi,q is the frequency of term i in query q. In this case each query is a vector, and multiple queries form a matrix.
We can then calculate the unit vectors of each so that when we take the dot product it will produce the correct cosine. To achieve the unit vector we divide both the matrix a and the query b by their Frobenius norm.
Finally we can perform the cosine distance by taking the transpose of the vector b for a given query. Thus one query (or vector) per calculation. This is denoted as bTa. The final result is a vector with the scoring for each term where a higher score denotes higher document rank.
simple java cosine similarity
static double cosine_similarity(Map<String, Double> v1, Map<String, Double> v2) {
Set<String> both = Sets.newHashSet(v1.keySet());
both.removeAll(v2.keySet());
double sclar = 0, norm1 = 0, norm2 = 0;
for (String k : both) sclar += v1.get(k) * v2.get(k);
for (String k : v1.keySet()) norm1 += v1.get(k) * v1.get(k);
for (String k : v2.keySet()) norm2 += v2.get(k) * v2.get(k);
return sclar / Math.sqrt(norm1 * norm2);
}
Related
I am hitting a wall in coming up with an equation to this simple question. I need a different perspective coming up with an algorithm. I have a number x and I want to distribute it to n elements in a greedy manner.
For x=9, n=3
[1,2,3],[4,5,6],[7,8,9] OR [3,3,3]
For x=10, n=3
[1,2,3,4],[5,6,7],[8,9,10] OR [4,3,3]
For x=11, n=3
[1,2,3,4],[5,6,7,8],[9,10,11] OR [4,4,3]
For x=12, n=3
[1,2,3,4],[5,6,7,8],[9,10,11,12] OR [4,4,4]
As far as I understand, you need to get array like [4,4,3]. So use integer division and modulo operation
smallvalue = x / n ; //integer division
largecount = x % n; //number of larger values
smallcount = n - largecount
Now fill array with largecount quantity of smallvalue+1 and then with smallcount of smallvalue
If you need result [1,2,3,4],[5,6,7,8],[9,10,11] - use the same information to generate it.
Whats the difference between JAMA: Matrix.times() vs Matrix.arrayTimes() in JAMA (a java library for matrix calculations)
If I have a d dimension vector x and a k dimension vector z and I want to get the xz^T (x into z transpose) should I use Matrix.times or Matrix.arrayTimes?
How can I calculate this multiplication using JAMA?
arrayTimes is simply element by element multiplication
C[i][j] = A[i][j] * B[i][j];
(treated as corresponding individual numbers)
while times is the matrix multiplication
where each element of the product is the sum of the products of corresponding row-columns.
The dimensions must match as per what you want to achieve.
Given your problem of x z^T the only viable solution is to turn these into dx1 and kx1 matrices respectively and perform x.times(z.transpose()). The result will be a matrix of d x k dimensions.
I want to categorise digits which are represented in a 64 dimensional space which gives an 8X8 pixel character image. Each attribute is an integer from 0...16. I have 20 rows of 64 values plus one at the end which determines the category. The category is previously determined by UCI but I want to know how they got each particular category for each row. So they say they used Euclidean distance to determine the category.
My question is how do I apply Euclidean distance to 64 values? I tried to use following formula (pythagorean theorem) Math.sqrt(Math.pow(x2-x1)+Math.pow(y2-y1)) within a row but the result was too big and I do not know what that represents. For example for the first row I obtained 1612 which is the square root of 40.15
This is my code for the process:
enter code here
public static void main(String[]args)
{
int row[]= new int[64];
for(int z=0;z<64;z++)
{
row[z]=digits[0][z]; //get the first row and store it
}
double result = 0;
for(int z=0;z<64;z+=2)
{
double distance = Math.pow(row[z]-row[z+1],2);
result = result+distance; //add distance each time
System.out.print(result+", ");
}
}
The first row of digits is this:
0,0,5,13,9,1,0,0,0,0,13,15,10,15,5,0,0,3,15,2,0,11,8,0,0,4,12,0,0,8,8,0,0,5,8,0,0,9,8,0,0,4,11,0,1,12,7,0,0,2,14,5,10,12,0,0,0,0,6,13,10,0,0,0,0
I am not sure if this makes sense but if something is not clear please do ask.
Thanks in advance.
My question is how do I apply Euclidean distance to 64 values?
You do not. Distance is a measure between two objects, each of which can have 64 values, but you need two objects. In particular, euclidean distance is defined as
dist(x, y) = ||x-y||_2 = sqrt[ SUM_{i=1}^d (x_i - y_i)^2 ]
where d is the number of dimensions, and x_i means ith dimension of x.
So they say they used Euclidean distance to determine the category.
They said more than that, as the distance itself does not define anything besides... distance. Category on the other hand is an abstract object, which might be defined by some some characteristic point (centroid), then you assign a category with closest (in terms of given distance) centroid.
I am trying to solve this question: https://www.hackerrank.com/challenges/journey-to-the-moon I.e. a problem of finding connected components of a graph. What I have is a list of vertices (from 0 to N-1) and each line in the standard input gives me pair of vertices that are connected by an edge (i.e. if I have 1, 3) it means that vertex 1 and vertex 3 are in one connected component. My question is what is the best way to store the inpit, i.e. how to represent my graph? My idea is to use ArrayList of Arraylist - each position in the array list stores another arraylist of adgecent vertices. This is the code:
public static List<ArrayList<Integer>> graph;
and then in the main() method:
graph = new ArrayList<ArrayList<Integer>>(N);
for (int j = 0; j < N; j++) {
graph.add(new ArrayList<Integer>());
}
//then for each line in the standard input I fill the corresponding values in the array:
for (int j = 0; j < I; j++) {
String[] line2 = br.readLine().split(" ");
int a = Integer.parseInt(line2[0]);
int b = Integer.parseInt(line2[1]);
graph.get(a-1).add(b);
graph.get(b-1).add(a);
}
I'm pretti sure that for solving the question I have to put vertex a at position b-1 and then vertex b at position a-1 so this should not change. But what I am looking for is better way to represent the graph?
Using Java's collections (ArrayList, for example) adds a lot of memory overhead. each Integer object will take at least 12 bytes, in addition to the 4 bytes required for storing the int.
Just use a huge single int array (let's call it edgeArray), which represents the adjacency matrix. Enter 1 when the cell corresponds to an edge. e.g., if nodes k and m is seen on the input, then cell (k, m) will have 1, else 0. In the row major order, it will be the index k * N + m. i.e, edgeArray[k * N + m ] = 1. You can either choose column major order, or row major order. But then your int array will be very sparse. It's trivial to implement a sparse array. Just have an array for the non-zero indices in sorted order. It should be in sorted order so that you can binary search. The number of elements will be in the order of number of edges.
Of course, when you are building the adjacency matrix, you won't know how many edges are there. So you won't be able to allocate the array. Just use a hash set. Don't use HashSet, which is very inefficient. Look at IntOpenHashSet from fastutils. If you are not allowed to use libraries, implement one that is similar to that.
Let us say that the openHashMap variable you will be using is called adjacencyMatrix. So if you see, 3 and 2 and there are 10^6 nodes in total (N = 10^6). then you will just do
adjacencyMatirx.add(3 * 10000000 + 2);
Once you have processed all the inputs, then you can make the sparse adjacency matrix implementation above:
final int[] edgeArray = adjacencyMatrix.toIntArray(new int[adjacencyMatrix.size()]);
IntArrays.sort(edgeArray)
Given an node, finding all adjacent nodes:
So if you need all the nodes connected to node p, you would binary search for the next value that is greater than or equal to p * N (O(log (number of edges))). Then you will just traverse the array until you hit a value that is greater than or equal to (p + 1) * N. All the values you encounter will be nodes connected to p.
Comparing it with the approach you mentioned in your question:
It uses O(N*b) space complexity, where N (number of nodes) and b is the branching factor. It's lower bounded by the number of edges.
For the approach I mentioned, the space complexity is just O(E). In fact it's exactly e number of integers plus the header for the int array.
I used var graph = new Dictionary<long, List<long>>();
See here for complete solution in c# - https://gist.github.com/newton3/a4a7b4e6249d708622c1bd5ea6e4a338
PS - 2 years but just in case someone stumbles into this.
Let's say that you have an arbitrarily large sized two-dimensional array with an even amount of items in it. Let's also assume for clarity that you can only choose between two things to put as a given item in the array. How would you go about putting a random choice at a given index in the array but once the array is filled you have an even split among the two choices?
If there are any answers with code, Java is preferred but other languages are fine as well.
You could basically think about it in the opposite way. Rather than deciding for a given index, which value to put in it, you could select n/2 elements from the array and place the first value in them. Then place the 2nd value in the other n/2.
A 2-D A[M,N] array can be mapped to a vector V[M*N] (you can use a row-major or a column-major order to do the mapping).
Start with a vector V[M*N]. Fill its first half with the first choice, and the second half of the array with the second choice object. Run a Fisher-Yates shuffle, and convert the shuffled array to a 2-D array. The array is now filled with elements that are evenly split among the two choices, and the choices at each particular index are random.
The below creates a List<T> the size of the area of the matrix, and fills it half with the first choice (spaces[0]) and half with the second (spaces[1]). Afterward, it applies a shuffle (namely Fisher-Yates, via Collections.shuffle) and begins to fill the matrix with these values.
static <T> void fill(final T[][] matrix, final T... space) {
final int w = matrix.length;
final int h = matrix[0].length;
final int area = w * h;
final List<T> sample = new ArrayList<T>(area);
final int half = area >> 1;
sample.addAll(Collections.nCopies(half, space[0]));
sample.addAll(Collections.nCopies(half, space[1]));
Collections.shuffle(sample);
final Iterator<T> cursor = sample.iterator();
for (int x = w - 1; x >= 0; --x) {
final T[] column = matrix[x];
for (int y = h - 1; y >= 0; --y) {
column[y] = cursor.next();
}
}
}
Pseudo-code:
int trues_remaining = size / 2;
int falses_remaining = size / 2;
while (trues_remaining + falses_remaining > 0)
{
if (trues_remaining > 0)
{
if (falses_remaining > 0)
array.push(getRandomBool());
else
array.push(true);
}
else
array.push(false);
}
Doesn't really scale to more than two values, though. How about:
assoc_array = { 1 = 4, 2 = 4, 3 = 4, 4 = 4 };
while (! assoc_array.isEmpty())
{
int index = rand(assoc_array.getNumberOfKeys());
int n = assoc_array.getKeyAtIndex(index);
array.push(n);
assoc_array[n]--;
if (assoc_array[n] <= 0) assoc_array.deleteKey(n);
}
EDIT: just noticed you asked for a two-dimensional array. Well it should be easy to adapt this approach to n-dimensional.
EDIT2: from your comment above, "school yard pick" is a great name for this.
It doesn't sound like your requirements for randomness are very strict, but I thought I'd contribute some more thoughts for anyone who may benefit from them.
You're basically asking for a pseudorandom binary sequence, and the most popular one I know of is the maximum length sequence. This uses a register of n bits along with a linear feedback shift register to define a periodic series of 1's and 0's that has a perfectly flat frequency spectrum. At least it is perfectly flat within certain bounds, determined by the sequence's period (2^n-1 bits).
What does that mean? Basically it means that the sequence is guaranteed to be maximally random across all shifts (and therefore frequencies) if its full length is used. When compared to an equal length sequence of numbers generated from a random number generator, it will contain MORE randomness per length than your typical randomly generated sequence.
It is for this reason that it is used to determine impulse functions in white noise analysis of systems, especially when experiment time is valuable and higher order cross effects are less important. Because the sequence is random relative to all shifts of itself, its auto-correlation is a perfect delta function (aside from qualifiers indicated above) so the stimulus does not contaminate the cross correlation between stimulus and response.
I don't really know what your application for this matrix is, but if it simply needs to "appear" random then this would do that very effectively. In terms of being balanced, 1's vs 0's, the sequence is guaranteed to have exactly one more 1 than 0. Therefore if you're trying to create a grid of 2^n, you would be guaranteed to get the correct result by tacking a 0 onto the end.
So an m-sequence is more random than anything you'll generate using a random number generator and it has a defined number of 0's and 1's. However, it doesn't allow for unqualified generation of 2d matrices of arbitrary size - only those where the total number of elements in the grid is a power of 2.