Some months ago, I was asked to teach an undergrad computer vision course at my university. It was my first teaching gig and… anyway, the course was a fairly basic introduction to image processing and artificial intelligence. The final project was to develop a basic Sudoku detector.
I’ll be reviewing this application as a small tutorial. I’ll focus on the general outline, the image processing operations needed and just the basic ideas of approaching the problem. I’ll be giving out some tips here and there. The tutorial is Matlab-based, but the approach should be the same no matter the development platform.
The first part will be used to introduce the problem and present possible solutions. My proposed method is just one of many possible, but it should give you some pointers if you want to craft your own solution.
The application’s main goal is to analyze input images of a Sudoku puzzle and to output a matrix containing only the numbers present in the puzzle. Figure 1 shows the app’s real input and desired output. Input images are not easy to process; these contain some (noisy) information that must be filtered out before attempting to recognize the numerical characters. Checkout the sample input image: illumination is crappy, there are some shapes that do not contribute to the actual puzzle (e.g., those clocks depicting the Sudoku’s difficulty) and there’s a nasty perspective distortion going on there.
Figure 1. Sudoku detector input and output. Checkout the outrageous perspective distortion on the input image.
The heart of the system will deal with character recognition. We will implement a basic minimum distance classifier. The classifier will work with a database (or template) of numbers previously labeled and will receive new (unlabeled) data gathered from test input images. The core concept here is the likeness metric we will employ. Minimum distance from the labeled data to the unlabeled data will be used as an index of similarity, thus, the minimum distance will be (theoretically) identical to the maximum similarity.
Divide and Conquer
In order to get the best classified results, we need to clean the image as much as possible. There are two main sources of noise that we will be focusing on: Symbols outside the “puzzle board” (the rectangle encompassing the puzzle numbers) and the distorted perspective. The last being the most critical of the problems. Let’s focus on this one for a bit.
The puzzle is a 9 x 9 squared grid with a known number of cells. Each cell contains either a “blank” (no number) or a numerical character. The board area is delimited by the four corners of the board. Each cell should be as straight as possible – that is, we do not want any distorted number inside a cell. If the number is distorted, the match operation could yield an incorrect answer. Figure 2 depicts the overall idea.
Figure 2. A distorted input could produce a mismatched result (not good!).
Summarizing: We will remove everything that we don’t need from the image and we will correct its perspective. Perspective correction is somewhat difficult, but will substantially boost the number matching quality, provide some robustness and simplify other filtering operations.
Now that we have some idea of what are we trying to do, let’s get over the first part of the processing pipeline of the application. Have a look at Figure 3.
Figure 3. Basic processing chain for board rectification.
No matter how you implement the perspective correction operation (actually, a projective transformation), you need four input points – in the distorted plane – and four output points – in the transformed plane. The transformation will yield a re-mapping of all input pixels inside the board area. The input points will be, of course, the four corners of the puzzle board. The only restriction that we will establish on the four output points is that they must depict an equilateral polygon.
The tricky part of this process is corner detection. What criteria should we use to locate the four corners of the board? Worry not, as we will cover this problem on the next part of the tutorial. For now, examine the input image and try to come up with a solution. Yeah. Check that image out… see anything that could help? Mmhh… Interesting, the board rectangle seems to be the object with the largest area in the image… that should takes us somewhere, right? LOL!
After image rectification we will attempt to filter out all the shapes that are not part of the target characters. Refer to Figure 4. This task is a little easier, as we already know – or should know – the exact position of the board. A simple crop operation will do. Next, check out the information left after some thresholding and area filtering. We are left with a nice black and white image that contains just the information that we will feed into the classifier. And the numbers seem to be undistorted and pretty clear!
Figure 4. Board cropping and segmentation.
The next step is to design the classifier itself. The theory we will implement is known in artificial intelligence as supervised learning classification. Here, the classifier starts with known metrics attributed to each class. Our example has nine known classes (numbers 1 to 9); we need to compute some metrics (attributes) that will optimally describe (label) each individual class. This is the way we build our attribute database of known classes. Now, imagine the classifier receives new, unlabeled, attributes. As suggested early, a possible solution to classify new information is to compute the distance between the unlabeled data and the already labeled data, and search for the minimum distance that will categorize the new information in a known cluster. The complete classifying procedure is depicted in Figure 5.
Figure 5. Supervised learning classification.
This step is not easy. Typically, more than one attribute is needed to describe each class, thus, attribute vectors are used. These vectors are multivariate, and so, a multivariate distance metric is needed. As we work in a multidimensional space of attributes, there’s also more than one definition of distance that we can apply. Euclidean distance between n points is the easiest of the bunch; some others also weigh in other statistical information such as the dispersion between the data inside each class cluster. We begin to delve into some nice statistical theory. I’ll limit things here to the use of multivariate Euclidean distance, focusing on the mean of each class cluster. The template that will be used in this tutorial is shown in Figure 6.
Figure 6. Classifier train template.
Here’s a summary of the operations needed in the number classification stage:
- Choose the best attributes that optimally segment each class.
- Compute attributes and cluster mean for each know class using a labeled template (Classifier Training).
- Receive new data.
- Compute attributes and cluster mean for the new data.
- Compute distance between each labeled cluster and the unlabeled cluster.
- Assign closest class to the unlabeled cluster (Classifier Testing).
The classification phase is carried out for each number in the input image. Of course, the attributes estimated from the template image may only be computed once. After processing all the numbers, it is fairly straightforward to assemble a numerical matrix containing all output classes. We can estimate the weight and height of the puzzle board; we know there’s a total of 9 x 9 cells. Additionally, we have access to the position of each number on the grid. That’s enough information to index a 9 x 9 matrix with all the detected numbers.
Well, for now we (hopefully!) have at least an outline of our attack plan. The tutorial will cover each of the previous points in detail, along with some example Matlab code. We will discuss the first part: board detection in the next post.