4/25/2010

I have developed a very simple framework for doing reconstruction. The framework consists of feature detection, correspondences, removing of incorrect data, distortion removal, computing the projection matrices, triangulation, and mesh creation. The discussion will involve a simplified problem space. Proofs for methods are left out of the explanation as the project's scope is to perform the operations rather than the theory that goes behind it.

The rig consists of a free moving calibrated camera at a single focus.

First step is to capture images. The camera must have the same focus for all images, including the images used for camera calibration. For simplicity, the Microsoft LifeCam NX-6000 web cam was used as it has a single focal length. A regular compact camera would give the best results. Compact cameras have a better image sensor (CCD) and lenses rather lower quality image sensor (CMOS) and lenses found in web cams.

A good feature detector is to create a consistent feature point amongst different images. Characteristics of a good detector are invariant to changes to scale, noise, illumination, and affine (scale and translation) transform. Feature point detection is primarily done with the algorithm called Scale Invariant Feature Transform, or SIFT. Rather than re-implementing available solutions, the project used the VLFeat library. SIFT is invariant to scale, orientation, affine distortion, partially invariant to illumination changes [ref].

The VLFeat library also does image correspondences. An advantage of SIFT is that its feature points have an associated vector to them. To compare two features to each other one only has to look at the Euclidean distance between the two vectors. However, this is all taken care by the library.

Although SIFT is a fairly accurate algorithm, it isn't perfect. Poor quality features can occur in parts of the image that the color data rapidly changes (i.e., a tree's leaves), parts were color data doesn't change (i.e., a flat solid area), highly saturated or under-saturated areas (i.e., lights, darkness), or highly specular or reflective surfaces (i.e. chrome, mirrors). Additionally, there can be bad correspondences from mismatching feature points.

Random Sample Consensus, or RANSAC, was used to remove bad data. RANSAC is an iterative method to estimate parameters of the mathematical model (p. 54). The model in this case is the fundamental matrix (see Fundamental Matrix). The basic concept of the algorithm is that there is a set of data consistent with a mathematical model called the inliers and a set of data not consistent with the model called the outliers. RANSAC is non-deterministic in nature because it randomly chooses a set of points to create the model. RANSAC runs a set number iterations or until it finds a good set of inliers. The set of inliers that minimizes the error is the chosen set of data to represent the model. Thus the inliers in this case would be good feature points and outliers would be noisy and incorrect data.

Once an accurate model has been established, a guided matching can occur. Guided matching usually consists of rectifying the images and comparing them to each other. Image rectification and comparison creates a set of dense point correspondences thus a better end result. This process can be repeated to get the most out of the images. This has yet to be implemented however.

For simplicity, handpicked results will used. cpselect is the best way for getting manual sub-pixel accuracy point correspondences in MATLAB. Handpicked correspondences are used so possible errors that come from automatic feature detection are eliminated.

Camera calibration was done with the Camera Calibration Toolbox for MATLAB. The camera calibration process consists of taking images of a known calibration object. For the toolbox the calibration object is a checkerboard pattern. The user holds the calibration object in different positions while taking pictures of the object. These images are fed through the toolbox and it outputs the calibration information or better known as intrinsic parameters.

The intrinsic parameters consist of focal length, principal point, and skew coefficient. Additionally, the toolbox can compute the distortions of the camera. With this information, the camera matrix K is built. The camera matrix allows a metric reconstruction of the scene up to a scale. Additionally, knowing the lens distortions of the camera will allow removal of distortion in the image correspondences. These variables are:

Name | Description | Variable | Dimensions |

Focal length | Focal length in pixels | fc | 2x1 vector |

Principal point | Principal point coordinates | cc | 2x1 vector |

Skew coefficient | Skew coefficient defining the angle between the x and y pixel axes | alpha_c | 1x1 scalar |

Distortion coefficients | Image distortion coefficients (radial and tangential distortions) | kc | 5x1 vector |

The calibration matrix is computed (p. 157): $$K = \left(\begin{array}{ccc} a_x & s & x_0 \\ 0 & a_y & y_0 \\ 0 & 0 & 1 \end{array}\right) = \left(\begin{array}{ccc} fc(1) & alpha_c * fc(1) & cc(1) \\ 0 & fc(2) & cc(2) \\ 0 & 0 & 1 \end{array}\right)$$

The basic concept of the fundamental matrix is that it takes one space and projects onto another space. In this application the fundamental matrix takes the first image point set and projects it to the second image point set. The fundamental matrix has the essential matrix, which has the rotation and translation matrices that is needed to make the project matrices. It is computed by taking the Kronecker tensor product for every image correspondences and finding the SVD of the resulting matrix (p. 393)

In this case, \(K_2\) and \(K_1\) are equal because the same camera with the same setting were used. Thus, to recover the essential matrix [4, p. 257]:
$$E = K_2^TFK_1=K^TFK$$

The essential matrix must be projected into the essential space by:

$$[U_E,E_E,V_E ] = SVD(E)$$
$$E = U_E * \left( \begin{array}{ccc} 0 & 1 & 0 \\ -1 & 0 & 0 \\ 0 & 0 & 0 \end{array}\right) * V_E^T$$

The extrinsic parameters, rotation \(R\) and translation \(t\), can be recovered from the essential matrix \(E\).

$$W = \left( \begin{array}{ccc} 0 & 1 & 0 \\ -1 & 0 & 0 \\ 0 & 0 & 0 \end{array}\right)$$
$$Z = \left(\begin{array}{ccc} 0 & 1 & 0 \\ -1 & 0 & 0 \\ 0 & 0 & 0 \end{array}\right)$$
$$[U_E,E_E,V_E ] = SVD(E)$$
\(R = UWV^T\) or \(R = UW^T V^T\)

\(t = U_E*[0 0 1]^T\) or \(t = -U_E*[0 0 1]^T\)

Thus now there are four sets of possible rotation and translation pairs. (p. 258 - 259)

The correct rotation and translation is the pair that creates (i.e. triangulates) points in front of both cameras. First, the project matrices have to be computed (see Projection Matrices), this will give \(P_1\) and \(P_2\). Next, one image pair has to be triangulated (see Triangulation), creating a 3D point \(X\). The depth with a positive value for both project matrices and the 3D point will indicate the correct rotation and translation pair. This is done by (p. 162):
$$X=(X,Y,Z,T)^T$$
$$P=[ M | p_4 ]$$
$$depth(X,P) = \frac{ sign(det(M)) * w }{T * ||m^3||}$$

Computing the projection matrices is a simple task of putting all the information together. Note that the first projection matrix set to origin (0,0,0) with no rotation and translation. $$P_1 = \left(\begin{array}{ccc} 1 & 0 & 0 \\ 0 & 1 & 0 \\ 0 & 0 & 1 \end{array}\right)$$ $$P_2 = K * [R,t]$$

In a perfect scenario, two rays extending from the points would intersect at a point in space. However, because of slight errors in the point correspondences the two points would never intersect or would intersect at the wrong point in space. The optimal triangulation algorithm brings the points closer to their respective epipolar lines (the plane that intersects the two cameras and the 3d point) so they intersect at a more accurate point in space [4, p318]. The final step of the algorithm is the actual triangulation. The algorithm is performed for every correspondence between the two images creating a point cloud.

The final step is the creation of the 3D model. This can be approached in various ways. For a simple mesh creation, MATLAB's Delaunay function proved to yield good results. Texture for each triangle face can be taken from corresponding three points on either image. The issue with method is that it won't work if any of the three points lay on a different plane. The better method is to create a dense point cloud that has the color information for every vertex. For output, a good model format is OBJ as it has an easy to follow textual representation for the model and is supported in many different model editors.

Additionally, external applications such as MeshLab and Blender have proven useful. MeshLab has many point cloud to mesh construction algorithms under *Filters > Remeshing, Simplification, and Reconstruction*. Noteworthy reconstruction algorithms include *Ball Pivoting Surface Reconstruction* algorithm and *Poisson Reconstruction*. Texturing can be accomplished in Blender by UV mapping using the images.