How to Implement K-Means Clustering Algorithm for MFCC Features? - java

I got the features of some sound variables with MFCC Algorithm. I want to cluster them with K-Means. I have 70 frames and every frame has 9 cepstral coefficients for one voice sample. It means that I have something like a 70*9 size matrix.
Let's assume that A, B and C are the voice records so
A is:
List<List<Double>> -> 70*9 array (I can use Vector instead of List)
and also B and C has same lengths too.
I don't want to cluster each frame, I want to cluster each frame block(at my example one group has 70 frames).
How can I implement it with K-Means at Java?

Here's where your knowledge of the problem domain becomes crucial. You might just use a distance between the 70*9 matrices but you can probably better. I don't know the particular features you mention, but some generic examples might be average, standard deviation of the 70 values per feature. You're basically looking to reduce the num of dimensions, both to improve speed but also to make the measure robust against sImple transformations, like offsetting all values by one step

K-Means has some pretty tough assumptions on your data. I'm not convinced that your data is appropriate to run k-means on it.
K-means is designed for Euclidean distance, and there might be a more appropriate distance measure for your data.
K-means needs to be able to compute sensible means, which may not be appropriate on your data
Many distance functions (and algorithms!) don't work well at 70*9 dimensions ("curse of dimensionality")
You need to know k beforehand.
Side note: keep away from Java generics for primitive type such as Double. It kills performance. Use double[][].

Related

Clustering of images to evaluate diversity (Weka?)

Within a university course I have some features of images (as text files). I have to rank those images according to their diversity.#
The idea I have in mind is to feed a k-means classifier with the images and then compute the euclidian-distance from the images within a cluster to the cluster's centroïd. Then do a rotation between clusters and take always the (next) closest image to the centroïd. I.e., return closest to centroïd 1, then closest to centroïd 2, then 3.... then second closest to centroïd 1, 2, 3 and so on.
First question: would this be a clever approach? Or am I on the wrong path?
Second question: I'm a bit confused. I thought I'd feed the data to Weka and it'd tell me "hey, if I were you, I'd split this data into 7 clusters", or something like that. I mean, that it'd be able to give me some information about the clusters I need. Instead, to use simplekmeans I'm supposed to know a priori how many clusters I'll use... how could I possibly know that?
One example of what I mean: let's say I have 3 mono-color images: light-blue, blue, red.
I thought Weka would notice that the 2 blues are similar and cluster them together.
Btw I'm kind of new to Weka (as you might have seen) so if you could provide some information on which functions I miggt want to use (and why :P) I'd be grateful!
Thank you!
Simple K-means - is an algorithm where you have to specify a number of the possible clusters in the data set.
If you don't know how many clusters there might be, it's better to get different algorithm or find out a number of the clusters.
You can use X-means -there you don't need to specify k parameter. (http://weka.sourceforge.net/doc.packages/XMeans/weka/clusterers/XMeans.html)
X-Means is K-Means extended by an Improve-Structure part In this part of the algorithm the centers are attempted to be split in its region. The decision between the children of each center and itself is done comparing the BIC-values of the two structures.
or you can observe a cut point chart based on AHC - hierarchical clustering algorithm (https://en.wikipedia.org/wiki/Hierarchical_clustering)
and then deduct a number of the clusters

Document or Text Clustering using EM algorithm for GMM, how to do?

I am trying to make a project of Document Clustering (in Java). There can be maximum 1 million documents and I want to make unsupervised cluster. To do, I am trying to implement EM algorithm with Gaussian Mixture Model.
But, I am not sure how to make the document vector.
I am thinking something like this, first I will calculate TF/IDF for each word in the document(after removing stop words and stemming done).
Then I will normalize each vector. At this stage, the question arises, how shall I represent a vector by a point? Is it possible?
I have learned about EM algorithm from this (https://www.youtube.com/watch?v=iQoXFmbXRJA) video where 1-D points are used for GMM and to be used in EM.
Can any one explain how to convert a vector in a 1-D point to implement EM for GMM?
If my approach is wrong, can you explain how to do the whole thing in simple words? Sorry for my long question. Thanks for your help!
If you're going to be clustering that many documents, you might consider K-Medoids as well, it creates the initial centroids using randomization (basically). As for representing the vectors as a point, In my experience that is really sketchy. What I have done in the past is store term vectors in a SortedMap, remove irrelevant terms however you want, normalize the vectors into sparse representations, then you can use something like Cosine Similarity, or Euclidean distance (inverted) to gauge similarity. I have used JavaML, Weka, and rolled my own unsupervised clustering. The KMedoids in JavaML is pretty good, you will have to reduce your vectors to double[] data structures (normalized of course) and use their dataset object.
HTH
I would start with something simpler than EM for GMM. If you know the number of clusters in advance, use K-Means. Otherwise, use Mean Shift.
If you must learn a GMM, then note that it can work with an N-D feature vector. If you must must reduce the features to a single dimension, you can use PCA (or some other data dimensionality reduction) algorithm to do that.
In any case, you can find implementations of these algorithms on the net and don't have to implement them yourself, which would slow down your project.

Quadtree with HashMap

I am considering using a HashMap as the backing structure for a QuadTree. I believe I can use Morton sequencing to uniquely identify each square of my area of interest. I know that my QuadTree will have a height of at most 16. From my calculations, that would be lead to a matrix of 65,536 x 65,536 which should give me at most 4,294,967,296 cells. Does anyone know if that is too many elements for a HashMap? I could always write up a QuadTree using a Tree but I thought that I could get better performance with a HashMap.
Morton sequence of height 1 == (2x2) == 4
Morton sequence of height 2 == (4x4) == 16
Morton sequence of height 3 == (8x8) == 64
Morton Sequencing example for a tree of max height 3.
Here is what I know:
I will get data in lat/lon over a know rectangular area.
The data will not completely cover the whole area and will likely be
consolidated into chunks somewhere in that area. (worse case is data in all 4,294,967,296 cells)
The resolution of the data ends up breaking down the area into 65k by 65k rectangle.
I also know that I will likely get 10 to 1 queries to insert/update of
the data.
Hashmap is not a good idea.
There is a better solution, used in navigation systems:
Assign each Quadtree cell a letter: A (Left,upper), B(right, upper) , C and D.
Now you can adress each quad cell via a String:
ABACE: this identifies the cell in level 5. (A->B->A->C->E)
Search internet for details on that specific Quadtree coding.
Dont forgett: You decide the sub division rule (when to subdivide a cell into smaller ones), and that decides how many cells you get. The number you give is far to high.
It is only an theroetical calculation which reminds me 1:1 on Google Maps Quad tree.
Further it is import to know which type of Quadtree you need for your Application:
Point Quadtree, Region Quadtree (bounbding box), Line Quadtree.
If you know any existing Quadtree implementation in java. please post a comment, or edit this answer.
Further you cannot implement a one for all solution.
You have to know aproxmetly how many elements you will suport.
The theroretical maximum , which is not equal to the expected maximum, is not a good approach.
You have to know that because you must decide whether to store that in main memory, or on disk, this also influences the structure of the quadtree. The "ABCD" solution is suitable
for dynamic loading from disk.
The google approach stores images in the quadtree, this is different from points you want to store, so i doubt that your calculation is realistic.
If you want to store all streets of all countries in the world, you can estimate that
number because the number of points are known (Either OpenStreetMap, TomTom (Teelatlas), or (Nokia Maps) Navteq.
If you realized that you have to store the quadtree on disk, then proably the size is open, and limited by only the disk space.
I think that implementing a Quad Tree as a Tree will give you better results. Actually implementing such a big database in a HashMap is a bad idea anyways. Because if you have a lot of collisions, the performance of a HashMap decreases badly.
And apparently you know exactly how much data you have. In that case, a HashMap is totally redundant. A HashMap is meant for when you do not know how much data there is. But in this case, you know that every node of the tree has four elements. So why even bother using a HashMap.?
Also, your table is apparently at least 4GB large. On most systems, that just barely fits in your memory. And since there is also Java VM overhead, why do you store this in memory? It would be better to find a datastructure that works well on disks. One such datastructure for spatial data (which I assume you are having, since you are using a quad tree), is an R-Tree.
Whoa, we're getting a number of concepts here all at once. First of all, what are you trying to reach? Store a quad tree? A matrix of cells? Hash lookups?
If you want a quad tree, why use a hash map? You know there could be at most 4 child nodes to each node. A hash map is useful for an arbitrary number of key-value mappings where quick lookup is necessary. If you're only going to have 4, a hash might not even be important. Also, while you can nest maps, it's a bit unwieldy. You're better off using some data structure or writing your own.
Also, what are you trying to reach with the quad tree? Quickly looking up a cell in the matrix? Some coordinate mapping function might serve you much better there.
Finally, I'm not so much worried about that amount of nodes in a hash map, as I am by the amount purely on its own. 65536² cells would end up being 4 GiB of memory even at one byte per cell.
I think it would be best to pedal all the way back to the question "what is my goal with this data", then find out which data structures could help you with that (keepign requirements such as lookups in mind) while managing to fit it in memory.
Definitely use directly linked nodes for both space and speed reasons.
With data this big I'd avoid Java altogether. You'll be constantly at the mercy of the garbage collector. Go for a language closer to the metal: C or C++, Pascal/Delphi, Ada, etc.
Put the four child pointers in an array so that you can refer to leaves as packed arrays of 2-bit indices (a nice reason to use Ada, which will let you define such things with no bit fiddling at all). I guess this is Morton sequencing. I did not know that term.
This method of indexing children in itself is a reason to avoid Java. Including a child array in a node class instance will cost you a pointer plus an array size field: 8 or 16 bytes per node that aren't needed in some other languages. With 4 billion cells, that's a lot.
In fact you should do the math. If you use implicit leaf cells, you still have 1 billion nodes to represent. If you use 32-bit indices to reference them (to save memory vice 64-bit pointers), the minimum is 16 bytes per node. Say node attributes are a mere 4 bytes. Then you have 20 Gigabytes just for a full tree even with none of the Java overhead.
Better have a good budget for RAM.
It is true that most typical quad-trees will simply use nodes with four child node pointers and traverse that, without any mention of hashmaps. However, it is also possible to write an efficient quadtree-like spatial indexing method that stores all its nodes in a big hashmap.
The benefit is that by using the Morton sequence (or another similarly generated value) as the key, you become able to retrieve nodes at any level with only one pointer dereference.
In "traditional" quadtree implementations we get cache misses due to repeated pointer dereferencing while looking up nodes, and this becomes the main bottleneck. So provided that the cost of encoding the coordinate space and getting a hash is lower than the cost of dereferencing the node pointers along the search path, such an implementation could be faster. Particularly if the map is very deep (having sparse locations requiring high precision).
You don't really need the Morton sequence, and you hardly need to think of it as a quadtree when doing this. A very simple example implementation:
In order to retrieve a quad of some level, use { x, y, level } as the hashmap key, where x and y are quantized to that level. You only need to include the level in the key if you are storing several levels in the same map.
Whether this is still a quadtree is up for discussion, but the functionality is the same.

Array of Structs are always faster than Structs of arrays?

I was wondering if the data layout Structs of Arrays (SoA) is always faster than an Array of Structs (AoS) or Array of Pointers (AoP) for problems with inputs that only fits in RAM programmed in C/JAVA.
Some days ago I was improving the performance of a Molecular Dynamic algorithm (in C), summarizing in this algorithm it is calculated the force interaction among particles based on their force and position.
Original the particles were represented by a struct containing 9 different doubles, 3 for particles forces (Fx,Fy,Fz) , 3 for positions and 3 for velocity. The algorithm had an array containing pointers to all the particles (AoP). I decided to change the layout from AoP to SoA to improve the cache use.
So, now I have a Struct with 9 array where each array stores forces, velocity and positions (x,y,z) of each particle. Each particle is accessed by it own array index.
I had a gain in performance (for an input that only fits in RAM) of about 1.9x, so I was wondering if typically changing from AoP or AoS to SoA it will always performance better, and if not in which types of algorithms this do not occurs.
Much depends of how useful all fields are. If you have a data structure where using one fields means you are likely to use all of them, then an array of struct is more efficient as it keeps together all the things you are likely to need.
Say you have time series data where you only need a small selection of the possible fields you have. You might have all sorts of data about an event or point in time, but you only need say 3-5 of them. In this case a structure of arrays is more efficient because a) you don't need to cache the fields you don't use b) you often access values in order i.e. caching a field, its next value and its next is useful.
For this reason, time-series information is often stored as a collection of columns.
This will depend on how exactly you access the data.
Try to imagine, what exactly happens in the hardware when you access your data, in either SoA or AoS.
To reason about your question, you must consider following things -
If the cache is absent, the performance should be the same, assuming that memory access latency is equal for all the elements of the data.
Now with the cache, if you access consecutive address locations, definitely you will get performance improvement. This is exactly valid in your case. When you have AoS, The locations are not consecutive in the memory, so you must lose some performance there.
You must be accessing in for loops your data like for(int i=0;i<1000000;i++) Fx[i] = 0. So if the data is huge in quantity, you will easily see the small performance benefits. If your data was small, this would not matter much.
Finally, you also don't know about the DRAM that you are using. It will have some benefits when you access consecutive data. For example to understand why it is like that you can refer to wiki.

Java Program using 4D array [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 3 years ago.
Improve this question
I'm a first year computer engineering student and I'm quite new here. I have been learning Java for the past three and a half months, and C++ for six months before that. My knowledge of Java is limited to defining and using own methods, absolute basics of object-oriented programming like use of static data members and member visibility.
This afternoon, my computer programming prof taught us about multi-dimensional arrays in Java. About multi-dimensional arrays being simply arrays of arrays and so on. He mentioned that in nominal, educational programming, arrays beyond 2 dimensions are almost never used. Even 3D arrays are used only where absolutely essential, like carrying out scientific functions. This leaves next to zero use for 4D arrays as using them shows that "you're using the wrong datatype" in my prof's words.
However, I'd like to write a program in which the use of a 4D array, of any data type, primitive or otherwise, is justified. The program must not be as trivial as printing the elements of the array.
I have no idea where to begin, this is why I am posting this here. I'd like your suggestions. Relevant problem statements, algorithms, and bits and pieces of code are also welcome.
Thank you.
Edit: Forgot to mention, I have absolutely no idea about working with GUIs in Java, so please do not post ideas that implement GUIs.
Ideas:
- Matrix multiplication and it's applications like finding shortest path in graphs
- Solving of systems of equations
- Cryptography -- many cryptoprotocols represent data or keys or theirs internal structures in a form of matrices.
- Any algo on graphs represented as matrices
I must have been having some kind of fixation on matrices, sorry :)
For 4D arrays one obvious thing I can think of is the representation of 3D environment changing in time, so 4th dimension represents time scale. Or any representation of 3D which have additional associated property placed in 4th dimension of array.
You could create a Sodoku hypercube with 4 dimensions and stores the numbers the user enters into a 4dimensional int array.
One use could be applying dynamic programming to a function that takes 4 integer parameters f(int x,int y,int z,int w). To avoid calling this expensive function over and over again, you can cache the results in a 4D array, results[x][y][z][w]=f(x,y,z,w);.
Now you just have to find an expensive integer function with arity of 4, oh, and a need for calculating it often...
Just to back him up,..your prof is quite right. I'm afraid I might be physically violent to anyone using a 4D+ array in production code.
It's kinda cool to be able to go into greater than 3 dimensions as an educational exercise but for real work it makes things way too complicated because we don't really have much comprehension of structures with greater than 3 dimensions.
The reason it's difficult to come up with a practical use for 4D+ arrays is because there is (almost) nothing that complicated in the real world to model.
You could look into modelling something like a tesseract , which is (in layman's terms ) a 4D cube or as Victor suggests use the 4th dimension to model constant time.
HTH
There are many possible uses. As others have said, you can model a hypercube or something that makes use of a hypercube as well as modeling a change over time. However, there are many other possible uses.
For example, one of the theoretical simulation models of our universe uses 11th dimensional physics. You can write a program to model what these assumed physics would look like. The user would only be able to see a 3-dimensional space which definitely limits usability, but the 4th dimensional coordinate could act like the changing of a channel allowing the user to change their perspective. If a 4th dimensional explosion occurs, for example, you might even want a 5th dimensional array so that you can model what it looks like in each connected 3-dimensional space as well as how it looks in each frame of time.
To take a step away from the scientific, think about an MMORPG. Today many of those games uses "instanced" locations which means that a copy of a given zone is created exclusively for the use of a given group of players so to prevent lag. If this "instanced" concept was given a 4th dimensional coordinate and it allows players to shift their position across instances it could effectively allow all server worlds to be merged together while allowing the players a great deal of control over where they go while decreasing cost.
Of course, your question wants to know about ideas without using a GUI. That's a bit more difficult because you are working in a 2D environment. One real application would be Calculus. We have 3D graphing calculators, but for higher dimensions you pretty much have to do it by hand. A pogram that aims to solve these calculations for you might not be able to properly display the information, but you can certainly calculate it. Also, when hologaphic interfaces become a widespread reality it may be possible to represent a hypercube graph in 3D making such a program useful.
You might be able to write a text based board game where the position of pieces is represented with text. You can add dimensions and game rules to use them.
The simplest idea I could give you is a save state system. At each interval the program in memory is copied and stored into a file. It's coordinate is it's position in time. At face value you may not need a 4D array to handle this, but suppose the program you were saving states of used a 3D array. You could set it up to represent each saved state as a position in time that you can make use of and then view the change in time.
I'm not sure what specifically you could do with this, because I just started thinking about it. But you could possibly use a 4D array for some sort of basic physics simulation, like modeling a projectile flight involving some wind values and what not. That just came to mind because the term 4D always brings to mind that the "position" of any object is 4 values, with time as the 4th.
Being a physics student we have only 3 dimension of space but we have a 4th dimension which is time. So thinking in that way we can think of an array of any dimension(1D or 2D or 3D) whose values differ with time or an array which keeps the record of every array whose values changed with time.
It seems to be quite known to us. For example the "ATTENDANCE REGISTER" which we usually have in our classroom.
This is my view to it.
That's it.
Enjoy :-)
To give a concrete example for the Ishtar's answer: Four-string alignment. To compute optimal two-string alignment, you write one string along one (say, horizontal) axis of a 2D-array (a matrix!) and the other one along the other array. The array is filled with edit costs, and the optimal alignment of the two strings is the one which produces the lowest cost, associated with a path through the matrix. A common way of finding such a path is by the above mentioned dynamic programming. You can look up 'Levenshtein distance' or 'edit distance' for technical details.
The basic idea can be expanded to any number of strings. For four strings you'd need a four-dimensional array, to write each string along one of the dimensions.
In practice, however, multiple string alignment is not done this way, for at least two reasons:
Lack of flexibility: Why would you need to align exactly four strings??? In computational molecular biology, for example, you might wish to align many strings (think of DNA sequences), and their number is not known in advance, but it is seldom four. You program would be useful for a very limited class of problems.
Computational complexity, in space and time. The requirements are exponential in the number of dimensions, making the approach impractical for most real-world purposes. Besides, most of the entries in such multi-dimensional array would lie on such suboptimal paths, which are never even touched, so that storing them would be simply waste of space.
So, for all practical purposes, I believe your professor was right.

Categories