I'm working on a Java project where I need to match user queries against several engines.
Each engine has a method similarity(Object a, Object b) which returns: +1 if the objects surely match; -1 if the objects surely DON'T match; any float in-between when there's uncertainty.
Example: user searches "Dragon Ball".
Engine 1 returns "Dragon Ball", "Dragon Ball GT", "Dragon Ball Z", and it claims they are DIFFERENT result (similarity=-1), no matter how similar their names look. This engine is accurate, so it has a high "weight" value.
Engine 2 returns 100 different results. Some of them relate to DBZ, others to DBGT, etc. The engine claims they're all "quite similar" (similarity between 0.5 and 1).
The system queries several other engines (10+)
I'm looking for a way to build clusters out of this system. I need to ensure that values with similarity near -1 will likely end up in different clusters, even if many other values are very similar to all of them.
Is there a well-known clustering algorithm to solve this problem? Is there a Java implementation available? Can I build it on my own, perhaps with the help of a support library? I'm good at Java (15+ years experience) but I'm completely new at clustering.
Thank you!
The obvious approach would be to use "1 - similarity" as a distance function, which will thus go from 0 to 2. Then add them up.
Or you could use 1 + similarity and take the product of these values, ... or, or, or, ...
But since you apparently trust the first score more, you may also want to increase its influence. There is no mathematical solution for this, you habe to choose the weights depending on your data and preferences. If you have training data, you can optimize weights for your approach, and you may want to even discard some rankers if they don't work well or are correlated.
His
I have a somewhat hypothetical question. We've just programmed some code implementing genetic algorithm to find a solution to a sudoku game as part of the Computational Intelligence course project. Unfortunately it runs very slowly which limits our ability to perform adequate number of runs to find the optimal parameters. The question is whether reprogramming the whole thing - the code basis is not that big - into java would be a viable solution to boost up the speed of the software. Like we need 10x performance improvement really and i am doubtful that a Java version would be so much snappier. Any thoughts?
Thanks
=== Update 1 ===
Here is the code of the function that is computationally most expensive. It's a GA fitness function, that iterates through the population (different sudoku boards) and computes for each row and column how many elements are duplicates. The parameter n is passed, and is currently set to 9. That is, the function computes how many elements a row has that come up within the range 1 to 9 more then once. The higher the number the less is the fitness of the board, meaning that it is a weak candidate for the next generation.
The profiler reports that the two lines calling intersect in the for loops causing the poor performance, but we don't know how to really optimize the code. It follows below:
function [fitness, finished,d, threshold]=fitness(population_, n)
finished=false;
threshold=false;
V=ones(n,1);
d=zeros(size(population_,2),1);
s=[1:1:n];
for z=1:size(population_,2)
board=population_{z};
t=0;
l=0;
for i=1:n
l=l+n-length(intersect(s,board(:,i)'));
t=t+n-length(intersect(s,board(i,:)));
end
k=sum(abs(board*V-t));
f=t+l+k/50;
if t==2 &&l==2
threshold=true;
end
if f==0
finished=true;
else
fitness(z)=1/f;
d(z)=f;
end
end
end
=== Update 2 ===
Found a solution here: http://www.mathworks.com/matlabcentral/answers/112771-how-to-optimize-the-following-function
Using histc(V, 1:9), it's much faster :)
This is rather impossible to say without viewing your code, knowing if you use parallelization, etc. Indeed, as MrAzzaman says, profiling is the first thing to do. If you find a single bottleneck, especially if it is loop-heavy, it might be sufficient to write that part in C and connect it to Matlab via MEX.
In genetics algorithms, I'd believe that a 10x speed increase could be obtained rather than not. I do not quite agree with MrAzzaman here - in some cases (for loops, working with dynamic objects) is much, much slower than C/C++/Java. That is not to say that Matlab is always slow, for it is not, but there is plenty of algorithms where it would be slow.
I.e., I'd say that if you don't spend so much time looping over things, don't use objects, are not limited by Matlab's data structures, you might be ok with Matlab. That said, if I was to write GAs in Java or Matlab, I'd rather pick the former (and I'm using Matlab a lot more than Java these days, it's not just a matter of habit).
Btw. if you don't want to program it yourself, have a look at JGAP, it's a rather useful Java library for GAs.
OK, the first step is just to write a faster MATLAB function. Save the new languages for later.
I'm going to make the assumption that the board is full of valid guesses: that is, each entry is in [1, 9]. Now, what we're really looking for are duplicate entries in each row/column. To find duplicates, we sort. On a sorted row, if any element is equal to its neighbor, we have a duplicate. In MATLAB, the diff function does sliding pairwise differencing, and a zero in its output means that two neighboring values are equal. Both sort and diff operate on entire matrices, so no need for looping. Here's the code for the columnwise check:
l=sum(sum(diff(sort(board)) == 0));
The rowwise check is exactly the same, just using the transpose. Now let's put that in a test harness to compare results and timing with the previous version:
n = 9;
% Generate a test board. Random integers numbers from 1:n
board = randi(n, n);
s = 1:n;
K=1000; % number of iterations to use for timing
% Repeat current code for comparison
tic
for k=1:K
t=0;
l=0;
for i=1:n
l=l+n-length(intersect(s,board(:,i)'));
t=t+n-length(intersect(s,board(i,:)));
end
end
toc
% New code based on sort/diff for finding repeated values
tic
for k=1:K
l2=sum(sum(diff(sort(board)) == 0));
t2=sum(sum(diff(sort(board.')) == 0));
end
toc
% Check that reported values match
disp([l l2])
disp([t t2])
I encourage you to break down the sort/diff/sum code, and build it up on a sample board right at the command line, and try to understand exactly how it works.
On my system, the new code is about 330x faster.
For traditional GA applications for studying and research purposes it is better to use a native machine compiled source code programming language, like C, C++. Which I used when working with Genetic
Programming in the past and it is really fast.
However if you are planning to put this inside a more modern type of application that can be deployed in a web container or run in a mobile device, different OS, etc. Then Java is your best alternative as it is platform independent.
Another thing that can be important is about concurrency. For example lets us suppose that you want to put your GA in the Internet and you will have a growing number of users that are connected concurrently and all of them want to solve a different sudoku, Java applications are very good for scaling horizontally and works great with big number of concurrent connections.
Other thing that can be good if you migrate to Java is the number of libraries and frameworks that you can use, the Java universe is so big that you can find useful tools for almost any kind of application.
Java is a Virtual Machine compiled language, but it is important to note that currently the JVMs are very good in performance and are able to optimize the programs, for example they will find which methods are being more heavily used and compile them to native code, which means that for some applications you will find a Java program to be almost same fast than a native compiled from C.
Matlab is a platform that is very useful for engineering training and math, vector, matrix based calculations, also for some control stuff with Simulink. I used these products when in my electrical engineering bachelor, however those product's goal is to be mainly a tool for academic purposes I won't definitely go for Matlab if I am wanting to build a production application for the real world. It is not scalable, it is expensive to maintain and fine-tune it, also there are not lot of infrastructure providers that will support this kind of technology.
About the complexity of rewriting your code to Java, the Matlab code and Java code syntax is pretty similar, they also live in the same paradigm: Procedural OOP, even if you are not using OO in your code it can be easy rewritten in Java, the painful stuff will be when working with Matlab shortcuts to Math structures like matrix and passing functions as parameters.
For the matrix stuff, there are lot of java libraries like EJML that will make your life easier. About assigning functions to variables and then pass them as parameters to another functions, Java is not currently able to do that (Java 8 will be with Lambda Expressions) but you can have a equivalent functionality by using Class closures. Maybe these will be the only little painful things that you will find if migrating.
Found a solution here: http://www.mathworks.com/matlabcentral/answers/112771-how-to-optimize-the-following-function
Using histc(V, 1:9), it's much faster :)
we are using Drools Planner 5.4.0.Final.
We want to profile our java application to understand if we can improve performance.
Is there a way to profile how much time a rule needs to be evaluated?
We use a lot of eval(....) and our "average calculate count per second" is nearly 37. Removing all eval(...) our "average calculate count per second" remains the same.
We already profiled the application and we saw most of the time is spent in doMove ... afterVariableChanged(...).
So we suspect some of our rules are inefficient, but we don't understand where is the problem.
Thanks!
A decent average calculate count per second is higher than 1000 (at least), a good one higher than 5000. Follow these steps in order:
1) First, I strongly recommend to upgrade to to 6.0.0.CR5. Just follow the upgrade recipe which will guide you step by step in a few hours. That alone will double your average calculate count (and potentially far more), due to several improvements (selectors, constraint match system, ...).
2) Open the black box by enabling logging: first DEBUG, then TRACE. The logs can show if the moves are slow (= rules are slow) or the step initialization is slow (= you need JIT selection).
3) Use the stepLimit benchmark technique to find out which rule(s) are slow.
4) Use the benchmarker (if you aren't already) and play with JIT selection, late acceptance, etc. See those topics in the docs.
I implemented textrank in java but it seems pretty slow. Does anyone know about its expected performance?
If it's not expected to be slow, could any of the following be the problem:
1) It didn't seem like there was a way to create an edge and add a weight to it at the same in JGraphT time so I calculate the weight and if it's > 0, I add an edge. I later recalculate the weights to add them while looping through the edges. Is that a terrible idea?
2) I'm using JGraphT. Is that a slow library?
3) Anything else I could do to make it faster?
It depends what you mean by "pretty slow". A bit of googling found this paragraph:
"We calculated the total time for RAKE and TextRank (as an average over 100iterations) to extract keywords from the Inspec testing set of 500 abstracts, afterthe abstracts were read from files and loaded in memory. RAKE extracted key-words from the 500 abstracts in 160 milliseconds. TextRank extracted keywordsin 1002 milliseconds, over 6 times the time of RAKE."
(See http://www.scribd.com/doc/51398390/11/Evaluating-ef%EF%AC%81ciency for the context.)
So from this, I infer that a decent TextRank implementation should be capable of extracting keywords from ~500 abstracts in ~1second.
I'm trying to do template matching basically on java. I used straightforward algorithm to find match. Here is the code:
minSAD = VALUE_MAX;
// loop through the search image
for ( int x = 0; x <= S_rows - T_rows; x++ ) {
for ( int y = 0; y <= S_cols - T_cols; y++ ) {
SAD = 0.0;
// loop through the template image
for ( int i = 0; i < T_rows; i++ )
for ( int j = 0; j < T_cols; j++ ) {
pixel p_SearchIMG = S[x+i][y+j];
pixel p_TemplateIMG = T[i][j];
SAD += abs( p_SearchIMG.Grey - p_TemplateIMG.Grey );
}
}
// save the best found position
if ( minSAD > SAD ) {
minSAD = SAD;
// give me VALUE_MAX
position.bestRow = x;
position.bestCol = y;
position.bestSAD = SAD;
}
}
But this is very slow approach. I tested 2 images (768 × 1280) and subimage (384 x 640). This lasts for ages.
Does openCV perform template matching much faster or not with ready function cvMatchTemplate()?
You will find openCV cvMatchTemplate() is much mush quicker than the method you have implemented. What you have created is a statistical template matching method. It is the most common and the easiest to implement however is extremely slow on large images. Lets take a look at the basic maths you have a image that is 768x1280 you loop through each of these pixels minus the edge as this is you template limits so (768 - 384) x (1280 - 640) that 384 x 640 = 245'760 operations in which you loop through each pixel of your template (another 245'760 operations) therefore before you add any maths in your loop you already have (245'760 x 245'760) 60'397'977'600 operations. Over 60 billion operations just to loop through your image It's more surprising how quick machines can do this.
Remember however its 245'760 x (245'760 x Maths Operations) so there are many more operations.
Now cvMatchTemplate() actually uses the Fourier Analysis Template matching operation. This works by applying a Fast Fourier Transform (FFT) on the image in which the signals that make up the pixel changes in intensity are segmented into each of the corresponding wave forms. The method is hard to explain well but the image is transformed into a signal representation of complex numbers. If you wish to understand more please search on goggle for the fast fourier transform. Now the same operation is performed on the template the signals that form the template are used to filter out any other signals from your image.
In simple it suppresses all features within the image that do not have the same features as your template. The image is then converted back using a inverse fast fourier transform to produce an images where high values mean a match and low values mean the opposite. This image is often normalised so 1's represent a match and 0's or there about mean the object is no where near.
Be warned though if they object is not in the image and it is normalised false detection will occur as the highest value calculated will be treated as a match. I could go on for ages about how the method works and its benefits or problems that can occur but...
The reason this method is so fast is: 1) opencv is highly optimised c++ code. 2) The fft function is easy for your processor to handle as a majority have the ability to perform this operation in hardware. GPU graphic cards are designed to perform millions of fft operations every second as these calculations are just as important in high performance gaming graphics or video encoding. 3) The amount of operations required is far less.
In summery statistical template matching method is slow and takes ages whereas opencv FFT or cvMatchTemplate() is quick and highly optimised.
Statistical template matching will not produce errors if an object is not there whereas opencv FFT can unless care is taken in its application.
I hope this gives you a basic understanding and answers your question.
Cheers
Chris
[EDIT]
To further answer your Questions:
Hi,
cvMatchTemplate can work with CCOEFF_NORMED and CCORR_NORMED and SQDIFF_NORMED including the non-normalised version of these. Here shows the kind of results you can expect and gives your the code to play with.
http://dasl.mem.drexel.edu/~noahKuntz/openCVTut6.html#Step%202
The three methods are well cited and many papers are available through Google scholar. I have provided a few papers bellow. Each one simply uses a different equation to find the correlation between the FFT signals that form the template and the FFT signals that are present within the image the Correlation Coefficient tends to yield better results in my experience and is easier to find references to. Sum of the Squared Difference is another method that can be used with comparable results. I hope some of these help:
Fast normalized cross correlation for defect detection
Du-Ming Tsai; Chien-Ta Lin;
Pattern Recognition Letters
Volume 24, Issue 15, November 2003, Pages 2625-2631
Template Matching using Fast Normalised Cross Correlation
Kai Briechle; Uwe D. Hanebeck;
Relative performance of two-dimensional speckle-tracking techniques: normalized correlation, non-normalized correlation and sum-absolute-difference
Friemel, B.H.; Bohs, L.N.; Trahey, G.E.;
Ultrasonics Symposium, 1995. Proceedings., 1995 IEEE
A Class of Algorithms for Fast Digital Image Registration
Barnea, Daniel I.; Silverman, Harvey F.;
Computers, IEEE Transactions on Feb. 1972
It is often favoured to use the normalised version of these methods as anything that equals a 1 is a match however if not object is present you can get false positives. The method works fast simply due to the way it is instigated in the computer language. The operations involved are ideal for the processor architecture which means it can complete each operation with a few clock cycles rather than shifting memory and information around over several clock cycles. Processors have been solving FFT problems for many years know and like I said there is inbuilt hardware to do so. Hardware based is always faster than software and statistical method of template matching is in basic software based. Good reading for the hardware can be found here:
Digital signal processor
Although a Wiki page the references are worth a look an effectively this is the hardware that performs FFT calculations
A new Approach to Pipeline FFT Processor
Shousheng He; Mats Torkelson;
A favourite of mine as it shows whats happening inside the processor
An Efficient Locally Pipelined FFT Processor
Liang Yang; Kewei Zhang; Hongxia Liu; Jin Huang; Shitan Huang;
These papers really show how complex the FFT is when implemented however the pipe-lining of the process is what allows the operation to be performed in a few clock cycles. This is the reason real time vision based systems utilise FPGA (specifically design processors that you can design to implement a set task) as they can be design extremely parallel in the architecture and pipe-lining is easier to implement.
Although I must mention that for FFT of an image you are actually using FFT2 which is the FFT of the horizontal plain and the FFT of the vertical plain just so there is no confusion when you find reference to it. I can not say I have an expert knowledge in how the equations implemented and the FFT is implemented I have tried to find good guides yet finding a good guide is very hard so much I haven't yet found one (Not one I can understand at least). One day I may understand them but for know I have a good understanding of how they work and the kind of results that can be expected.
Other than this I can't really help you more if you want to implement your own version or understand how it works it's time to hit the library but I warn you the opencv code is so well optimised you will struggle to increase its performance however who knows you may figure out a way to gain better results all the best and Good luck
Chris