I'm working on a sketch search engine that correlates whatever someone's sketching with a picture in the database (the db is just about 40 pictures now). I'm doing this mostly for fun so I'm not that well-versed in computer imaging techniques.
First of all, are there any rules of thumb on how one should create histograms (bin sizes, ranges, etc)? I'm using some histogram code found at http://www.scribd.com/doc/6194304/Histograms (but ported to JavaCV). Sometimes I get good results, sometimes I get bad results, most of the time I get "meh" results. I've been experimenting a TON with bin sizes and ranges and I'm wondering if comparing higher dimensional histograms may be the answer here.
Second of all, it seems that black makes a very strong presence in my current histogram setup (even a black dot shifts the entire result set). Should this be expected? Or did I screw something up? Example:
And after the dot:
Note how I'm already getting pictures of "earthrise" as "close" matches.
I'm also wondering what methods I should use for blob or feature analysis. I think that stuff like SURF may be overkill because I only want to broadly compare blobs, not accurately map templates. Is there any way I can compare the edges after being passed through a Canny filter? (Low complexity if possible):
For example, here, I want the two smiley faces to be at the top because the needle smiley "blob" is more closely related to the smily face shape than to a bunch of passion fruit or a galaxy.
Phew long question. If you want to try out the engine for yourself, go to http://skrch.dvt.name/ (shameless plug, I know, I know -- only works in FF/Chrome/Safari). Maybe more experienced computer vision people can make suggestions based on results. Oh, I'm using the CV_COMP_BHATTACHARYYA distance when comparing histograms (it seemed that it gave the best results although chi-square isn't bad either).
Is there a background ?
IS it significant ?
Maybe you need to look at whether there is a user-supplied background or not.
then you "just" need to have 2 histogram per db entry, one with bg, one without.
That'll stop earthrise looking like an apple with a dot.
for basic bg separation, try a canny, then taking "outside" and removing it from a copy of the original.
Related
I wrote an object tracker that will try to detect and follow a moving object in a recorded video. In order to maximize the detection rate, my algorithm is using a bunch of detection & tracking algorithms (cascade, foreground & particle tracker). Each tracking algorithm will return me some point of interest that might be part of the object that I'm trying to track. Let's assume (for the simplicity of this example) that my object is a rectangle and that the three tracking algorithms returned the points 1, 2 and 3:
Based on the relation / distance of these three points it is possible to calculate the center of gravity (blue X in above image) of the tracked object. So for each frame I might be able to come up with some good estimate of the center of gravity. However, the object might move from one frame to the next:
In this example I merely rotated the original object. My algorithm will give me three new points of interest: 1',2' and 3'. I could again calculate the center of gravity based on these three new points, but I would throw away important information that I've acquired from the previous frame: based on points 1, 2 and 3 I already do know something about the relationship of these points and thus by combining the information from 1, 2 and 3 and 1',2' and 3' I should be able to come up with a better estimate of the center of gravity.
Furthermore, the next frame might yield a forth data point:
This is what I would like to do (but I don't know how):
based on the individual points (and their relationship to each other) that are returned from the different tracking algorithms, I want to build up a localization map of the tracked object. Intuitively I feel like I need to come up with A) an identification function that will identify individual points across frames and B) some cost function that will determine how similar tracked points (and the relationship / distance between them) are from frame to frame, but I can't get my head around on how to implement this. Alternatively, maybe some kind of map buildup based on the points will work. But again, I don't know how to approach this.
Any advice (and example code) is highly appreciated!
EDIT1
a simple particle filter might probably work too, but I again don't know how to define the cost function. A particle filter for tracking a certain color is easy to program: for each pixel you calculate the difference between target color and pixel color. But how would I do the same for estimating the relationship between tracked points?
EDIT2 intuitively I feel like Kalman filters could also help with the prediction step. See slides 24 - 32 of this pdf. Or am I misled?
What I think you're trying to do is essentially build up a state space of features, which can be applied to a filtering process, such as an Extended Kalman Filter. This is a useful framework when you have multiple observations in every frame, and you're trying to estimate or measure something indicated by these observations.
To determine the similarity of the tracked points, you can perform simple template matching from frame to frame for small regions around the points. One way of doing this is to extract an NxN (say, 7x7) region around point a in frame n and point a' in frame n+1, followed by normalised cross correlation between the extracted regions. This will give you a reasonable measure of how similar the patches are. If the patches are not similar, then you've probably lost track of that point.
There is an enormous literature on this and related problems starting in the 80's. Try searching for "optical flow" algorithms". The input for such algorithms is two successive frames of the same scene. The output is a vector field, one vector per pixel in the second image, which shows what the direction and speed of movement of the feature in that field. This presentation is a pretty nice summary.
A nice thing about optical flow is that many algorithms for it parallelize nicely and map onto your favorite video card GPU, so they can run in real time. Think ESPN overlays.
According to me, in order to identify who is who in each frame, you will have to use a greater dimension. For example if you want to know which point is where between two frame (considering your extracted point are same), you will have to build vectors or simplex and then deduce an organisation between your points (like angles values).
The main problem is that combinations increase with point number. If your camera is a fixed point then, you could use background as a reference in order to deduce object rotations and translations, i mean build vectors between background interest points and object points in order to clearly identify them.
hope that help go forward.
I would recommend looking in to the divided difference filter (DDF), which is similar to the extended Kalman filter (EKF), but does not require an approximate model of the dynamics of your system (which you may not have). Basically the DDF approximates the derivatives used in the EKF using a difference equation. There are plenty of papers online about this, but I do not know whether you have access to them so I have not linked them here. If you are working from a university or a company that has access to online journals (like IEEE Explore), then just Google "divided difference filter" and check out some of the papers.
Explanation:
EDIT3: MASSIVE CLEAN UP as this was not clearly explained.
I'm trying to build up a 2D level out of tiles and entities. Where the entities are for example trees that can be cut. I need to store the data (how many chops are left for example) for each entity. I want them to have a more dynamic position (doubles) and a more dynamic sprite-width and height. My tiles are 32x32 pixels whilst my trees are not going to be one tile but a sprite with greater height than width.
I want objects that are closer to the top of the level to be drawn before the other objects. In this case a character behind the tree will cannot be rendered in or in front of the tree. This case also applies to other objects of the same kind (like trees).
I think it might be too inefficient to loop through the entities and calculate each entity's position since there may be a LOT of entites in the level.
As I've done some research I found that certain libraries allow the storage of both the object and it's position in a MAP (BiMap in google's Guava).
Questions:
Is this an inefficient manner.. but are there some changes that can
be applied to make the rendering more efficient (if so, what could be
optimized)?
Or is this an inefficient manner to render the entities and is
there a better way (if so, what other methods are there in Java)?
Or is there something else that I haven't listed?
EDIT2: I looked through the link I've posted in the edit below.
It seems that Google's Guava (I think that's all correct) has BiMaps. Is there an equivalent to this in regular Java? Otherwise Google's Library will probably be able to fix this for me. But I'd rather not install such a huge library for this one interface.
At last:
It's very much possible that the answer has been right in front of my nose here on StackOverflow or somewhere else on the internet. I've tried my best searching but found nothing.
If you've got any suggestions for search queries or any relevant links that might be of use to me I would appriciate it if you'd post them in the comments.
Thanks for taking the time to read through this/helping me ;)
EDIT:
I have looked at; Efficient mapping of game entity positions in Java .
I think it's narrowly related to this question. But I think it's just not what I'm looking for. I am going to look through the second answer very closely since that might be able to solve this for me.. but I'm not sure.
SOLUTION
The solution is to have an array, arraylist or another manner to keep track of your entities. Every tick/update you'll take all the object's Y coordinates and store them in another array/arraylist/map/other with the same size as where the entities are stored in. On every equivalent position to the entity you'll store it's Y. Then you'll order it with another loop or using http://www.leepoint.net/notes-java/data/arrays/70sorting.html .
Then when rendering:
for(int i = 0; i < entityArray.length; i++)
entityArray[i].render();
Off course you'll render it more efficiently by rendering only whats on or near your screen.
But that's basically how one does this in 2D top-view/front-view.
In my own 2d game attempts I come up with the following solution:
use an enum to specify different types of objects in game and give them priorities (sample order: grass, rivers, trees, critters, characters, clouds, birds, GUI)
make all visual objects implement interface which allows for getting this DrawPriority enum
use a sorted implementation of list with comparator based on the enum
use the list to draw all elements
That way the order computing is not very expensive, because it is done only on Visual Object insertion (which is in my case done while loading a level).
.. And since you will already using a comparator, do a x/y comparison when the enum priority values are the same. This should solve your y-order draw problem.
How can I take two images and compare them to see how similar they are?
I'm not talking about comparing two exact images using MD5. The two images that I am comparing will be completely different, as well as likely different sizes at times.
Using Pokemon cards as an example:
I'm going to have scanned HD images of each of the cards. I want the user to be able to take a picture of their Pokemon card with their phone and I want to be able to compare it against my scanned images and then determine which card it is that they took a picture of.
The processing does not have to be done directly on the phone, offloading to a web service is an option however note that my knowledge somewhat limited on the programming languages (limited to PHP/JAVA/Android pretty much). The server I'm using is my own Ubuntu server so I do have access to the exec command from php if this would help.
At first I figured someone would have done something like this before (comparing two images). I tried using php with imageik using an example I found that claimed to do what I was trying ( utilizing compareImages() ), but it didn't work at all. There doesn't seem to be much (if any) documentation on doing something like this which is why I'm so stuck. All I'm looking for is a push in the right direction.
My second thought was to try using OCR to pull just the title of the card and I would just compare that against a database of titles and display the images tied to that title. So far I've tried using phpocr first, which didnt work at all as it requires monochrome images to my understanding. Next I tried tesseract directly from the console on my server, and while it did WAY better than phpocr, more than 80% of the characters were either wrong or incorrect on a scanned image, so a lower quality image coming from a smart phone would really have troubles.
I also tried OpenCV for Android but couldnt get any of the samples working.
Has anyone done anything like this, or at least used something that can accomplish what Im looking for?
There are two distinct tasks - identify area of interest ( which can be done with Haar cascades - same as face detection ) and recognition of identified image which can be
done with invariant moment techniques (like Hu moments - it was good enough to count soviet tanks on satellite images so it shall be good for pokemons). Nice property of invariant moments is soft degradation of results in case of low quality - you get a list of probability for symbols - like this is 80% pikachu and 30% something else.
We are developing OCR library based on invariant moments for use in android here:
https://sourceforge.net/projects/javaocr/
(
pure java and reasonable speed , and there are android samples in demos subdirectory.
And here is app based on javaocr, it will recognize black on white phone number and dial it: https://play.google.com/store/apps/details?id=de.pribluda.android.ocrcall&feature=search_result#?t=W251bGwsMSwyLDEsImRlLnByaWJsdWRhLmFuZHJvaWQub2NyY2FsbCJd
)
You may also consider some aiming help so user positions symbol to be matched properly
( so first task will use real intellect )
You should decide what kind of similarity comparison you need. There are geometric algorithms. They use edge detection and then try to match detected edges in both images. They are probably useful when dealing with different colours of objects with the same shape. And there are algorithms that are more based on colour similarity. They compare what colours are in the image and how they are distributed.
If you are looking for a concrete algorithm, you probably should have a look at the Hough Transform.
I'm trying to find out what effects the Photoshop "Poster edges" filter is composed of. It seems it's a combination of edge detection and posterization, but I haven't been able to duplicate it, not even close, with these so I guess I'm missing something. The image below shows the same image before and after the Poster edges filter:
I've tried performing posterization (and quantization) on the image, along with edge detection using Sobel, but apparently Photoshop is doing something different as the results are very different. Basically the posterization looks very different and edges are very weak compared to the photoshop filter.
So does anybody know how the Poster edges filter is implemented, or have any idea what image-processing should be done to achieve the latter image from the former.
Not that it really matters, but I'm using Java, and my image filtering code is based for the most part on the filters found here: http://www.jhlabs.com/ip/filters/index.html
Edit Description of the filter from adobe.com:
Poster Edges Reduces the number of colors in an image (posterizes it) according to the posterization option you set, and finds the edges of the image and draws black lines on them. Large broad areas have simple shading, and fine dark detail is distributed throughout the image.
Regarding the edges: I would assume that Photoshop uses something more sophisticated than a simple derivative filter (like Sobel) for edge detection. There are edge detection algorithms that try to find only "salient" edges, i.e. those that are relevant to human vision, edges a human artist would draw if he does a line sketch. An old and (rather) simple algorithm that goes in this direction (at least a bit) is the Canny edge detector. You should be able to find an implementation of this one. Google for "salient edges" for current research literature, but don't expect implementations or nice pseudocode in research papers.
Regarding posterization: Given their talks at SIGGRAPH, the Adobe guys are very much into bilateral filtering (please Google, I can't link any more), a smoothing technique that preserves important edges. I think if you apply the bilateral filter and posterize afterwards you should come closer to the look you want. Unfortunately, efficiently implementing the bilateral filter is not trivial.
Update for anyone still interested in this topic
The Bilateral filter, which I suggested above, is increasingly replaced with the Guided filter, at least in the Computer Vision community (the graphics people don't seem to have realized the Guided filter yet). The Guided filter achieves similar results, but is much easier to implement efficiently. The exact algorithm for the Guided filter is highly efficient, while efficient Bilateral filtering requires approximations or insanely complicated algorithms.
I suspect you have to do this at several scales, in order to filter the edge response.
Run your edge detection at several levels(scales) of a gaussian smoothed pyramid of the input image [sigma_min, sigma_max]
Then, either sum or select maximum edge magnitudes across scales
Posterize with original image (blend ?)
Copy the original image and then apply a PosterizeFilter. Then apply EdgeFilter, GrayscaleFilter and InvertFilter to the copy. Finally multiply the posterized original with the copy. At that point you should have something close to Poster edges.
I saw this video, and I am really curious how it was performed. Does anyone have any ideas? My intuition is that he scraped pixels from the screen (one per 'box'), and then fed that into some program to determine the next move.
Is scraping pixel-by-pixel the way to do this, or is there a better way? I am looking to do something similar with either Java or Python.
Thanks
Probably that's the most reliable way. There are ways to inspect what is happening inside a process - looking directly at its internal state and memory - but they are platform-specific and very prone to misbehaving because your dealing with a slightly different version of something - that includes a different flash version as well as a different version of the app. Those methods are more often used for "trainers" for exe games, where there's typically only one or two versions of the executable to worry about.
Lots of screen shots, comparing, figuring out reliable indicator pixels seems the way to go - plus keeping track of what you expect to happen, of course. When the app is running, it should work from a screenshot at a time (hopefully ensuring a consistent picture, with no half-updated views) and then test the minimum number of pixels needed using (perhaps) a decision tree.
There are ways to automate construction of efficient decision trees, but it's probably easier to do it manually based on comparing screen shots. In this case, since Tetris normally creates all new pieces at the same position, with a 1:1 relationship between colour and shape, you can probably determine the shape and position of a new piece from a single pixel colour - so "decision tree" is probably the wrong term, really, in this case - though there are other things the bot needs to read from the screen.
What's more interesting is the logic to actually make gameplay decisions, since that bot clearly isn't just slotting every piece into the most immediately obvious position, but deliberately aiming to create opportunities to clear 3 or 4 rows at a time.
Yes, i think he scanned the pixels. Actually it should be very simple because you only need to scan the new shape for each move. With that information you can locally calculate the grid and further use it for your AI calculations.