CV Lecture 8

Posted Feb 3, 2025

16 min read

CV Lecture 8

What is stereoscopic vision

Stereoscopic vision describes the ability of the visual brain to register a sense of three-dimensional shape and form from visual inputs.
In current usage, stereoscopic vision often refers uniquely to the sense of depth derived from the two eyes.

The fovea centralis (fovea) is a small depression at the centre of the retina. It provides the sharpest vision in the human eye, also called foveal vision. The central fovea contains a high concentration of retinal cells called cone photoreceptors. Cone cells help us see colours and fine details.

What is stereo vision?

The horopter was originally defined in geometric terms as the locus of points in space that make the same angle at each eye with the fixation point.
In more recent studies on binocular vision, it is taken to be the locus of points in space that have the same disparity as fixation.

Why do we need stereo vision?

The world is not flat, we live in a three dimensional world
A camera captures the world and stores it as images
Images are two dimensional
What is the relation between the world and an image?
Can we reconstruct the 3D world starting from more images?
We can if we can establish a correspondence between different views of a scene

Perspective projection

The camera exists in a 3D Cartesian frame
The image is a planar object
- It has its own coordinates
The focal length $f$ defines the distance between the image plane and the center of projection $O$
The line through $O$ and perpendicular to the image plane is the optical axis
The intersection of the optical axis with the image plane is called principal point or image center
image->2D eyes->3D

A point P in the world is projected onto a point p on the image

World and camera coordinate frames

A camera reference frame is usually rotated and translated with respect to the world reference frame.

A rotation matrix $R$ and a translation vector $t$ describe mathematically the transformation.

Summary of camera parameters

Extrinsic camera parameters
the parameters that define the location and orientation of the camera reference frame with respect to a known world reference frame
Intrinsic camera parameters
the parameters necessary to link the pixel coordinates of an image point with the corresponding coordinates in the camera reference frame

From world to camera coordinate frames

$$

P_c = R(P_w - T) \quad \text{where} \quad R = \begin{bmatrix} r_{11} & r_{12} & r_{13}
r_{21} & r_{22} & r_{23}
r_{31} & r_{32} & r_{33} \end{bmatrix}

\[\]

X_c = R^T_1(P_w - T)

\[\]

Y_c = R^T_2(P_w - T)

\[\]

Z_c = R^T_3(P_w - T)

\[where $R^T_i$ corresponds to the i-th row of the rotation matrix # From camera to image coordinates\]

x = f\frac{ X^c}{Z^c} = f \frac{R^T_1(P_w - T)}{R^T_3(P_w - T)},

\[\]

y = f\frac{ Y^c}{Z^c} = f \frac{R^T_2(P_w - T)}{R^T_3(P_w - T)}

\[![image.png](https://wichaiblog-1316355194.cos.ap-hongkong.myqcloud.com/20250203143921.png) # From image to pixel coordinates\]

x = -(x_{im} - o_x)s_x \quad \text{or} \quad x_{im} = -\frac{x}{s_x} + o_x,

\[\]

y = -(y_{im} - o_y)s_y \quad \text{or} \quad y_{im} = -\frac{y}{s_y} + o_y

\[Where $s_x$ and $s_y$ correspond to the effective size of the pixels in the horizontal and vertical directions (in millimeters) ## Properties - The projection of a point is not unique (any point in the world on the line OP has the same projection) - The distance to an object is inversely proportional to its image size ![image.png](https://wichaiblog-1316355194.cos.ap-hongkong.myqcloud.com/20250203143943.png) ![image.png](https://wichaiblog-1316355194.cos.ap-hongkong.myqcloud.com/20250203143955.png) - When a line (or surface) is parallel to the image plane, the effect of perspective projection is scaling. - When a line (or surface) is not parallel to the image plane, we use the term foreshortening to describe the projective distortion. - (i.e., the dimension parallel to the optical axis is compressed relative to the frontal dimension) ![image.png](https://wichaiblog-1316355194.cos.ap-hongkong.myqcloud.com/20250203144010.png) - Lines in 3D project to lines in 2D - Distances and angles are not preserved - Parallel lines do not in general project to parallel lines (unless they are parallel to the image plane) ![image.png](https://wichaiblog-1316355194.cos.ap-hongkong.myqcloud.com/20250203144026.png) # Effects of focal length As $f$ gets smaller, more points project onto the image plane (wide-angle camera) - **Left Image**: Wide-angle image of an electric guitar - **Right Image**: Wide-angle portrait of another pooch ![image.png](https://wichaiblog-1316355194.cos.ap-hongkong.myqcloud.com/20250203144044.png) ![image.png](https://wichaiblog-1316355194.cos.ap-hongkong.myqcloud.com/20250203144058.png) As $f$ gets larger, the field of view becomes smaller (more telescopic) ![image.png](https://wichaiblog-1316355194.cos.ap-hongkong.myqcloud.com/20250203144120.png) # More cameras ![image.png](https://wichaiblog-1316355194.cos.ap-hongkong.myqcloud.com/20250203144141.png) # Problem - When two cameras look at the same scene, and there is an overlapping between the views - Then, the shared information (overlapping region of space) can be used to find corresponding information - We can then measure the distance from both cameras - Combining such information gives us a 2D ½ view of the scene - It is called so because we cannot see behind the part of the image for which we have depth information - We do not have a proper 3D reconstruction # Binocular Stereo - We use the so called triangulation principle - We then - Extract features from the left and right image - Search for correspondence between the extracted features - Perform the triangulation ![image.png](https://wichaiblog-1316355194.cos.ap-hongkong.myqcloud.com/20250203144222.png) # Stereo algorithm - The input to stereo vision is a set of two or more images of the same scene taken at the same time from cameras in different positions. - The positions and orientations of the two cameras in the world are assumed to be known; if fixed on a rig, their relative pose can be calculated using a calibration target. - The output is a depth map. ![image.png](https://wichaiblog-1316355194.cos.ap-hongkong.myqcloud.com/20250203144257.png) # Stereo - A point with coordinates $X$ in the rig reference system projects to point $x_1$ in image 1 and $x_2$ in image 2. - These two points are said to correspond to each other. - From the differences in the image positions of corresponding points, a computation called triangulation determines the coordinates $X$ of the point in the world, relative to the rig reference system. - Triangulation is straightforward geometry, while determining pairs of corresponding points is a hard task, the so-called correspondence problem. ![image.png](https://wichaiblog-1316355194.cos.ap-hongkong.myqcloud.com/20250203144315.png) # What features can we use? - We have seen quite a few - We could use edge information - We could use interest points (SIFT etc.) - We could use image intensity patches ![image.png](https://wichaiblog-1316355194.cos.ap-hongkong.myqcloud.com/20250203144340.png) ![image.png](https://wichaiblog-1316355194.cos.ap-hongkong.myqcloud.com/20250203144353.png) # Stereo Matching algorithm - We want to match features found in two overlapping views - We run our favourite feature detector in both images - We then find an algorithm that matches the two sets, to find correspondences - We can use RANSAC to distinguish between inliers and outliers - Read the next slides for more detail on it # RANdom SAmple Consensus (RANSAC) >M.A. Fischler and R.C. Bolles. Random sample consensus: A paradigm for model fitting with applications to image analysis and automated cartography. Communications of the ACM, 24(6):381–395, 1981. - It is a general parameter estimation approach designed to cope with a large proportion of outliers in the input data - RANSAC is a resampling technique that generates candidate solutions by using the minimum number observations (data points) required to estimate the underlying model parameters - Unlike greedy algorithms, RANSAC uses the smallest set possible and proceeds to enlarge this set with consistent data points # The Algorithm ## Algorithm 1 RANSAC 1. Select randomly the minimum number of points required to determine the model parameters. 2. Solve for the parameters of the model. 3. Determine how many points from the set of all points fit with a predefined tolerance ε. 4. If the fraction of the number of inliers over the total number of points in the set exceeds a predefined threshold τ, re-estimate the model parameters using all the identified inliers and terminate. 5. Otherwise, repeat steps 1 through 4 (maximum of N times). - The number of iterations, $N$, is chosen high - It is high enough to ensure that - the probability $p$ that at least one of the sets of random samples does not include an outlier # Using SIFT Let us say we used SIFT in both left and right images ![image.png](https://wichaiblog-1316355194.cos.ap-hongkong.myqcloud.com/20250203144514.png) # Using RANSAC - **Red points:** points without a “good” match in the other image - Good means the ratio of the distances to the second nearest neighbour and first nearest neighbour - **Blue points:** “good” match in which the match was wrong - **Yellow points:** correct matches - RANSAC must run until it randomly picks 4 yellow points from among the blue and yellow points (the matches estimated to be “good”) ![image.png](https://wichaiblog-1316355194.cos.ap-hongkong.myqcloud.com/20250203144651.png) # Outlier rejection - We match only those features that have similar enough matches - This can be done exhaustively, searching for all the possible matches - A cleverer method is to assume the cameras are not too far apart and they look roughly at the same scene from similar perspectives - This means a feature from one camera must be close to a feature in a larger corresponding area in the second camera # Correspondence problem: template matching It is performed by choosing a template window from the first image and use a correlation technique to match it against a region of interest in the second image. ![image.png](https://wichaiblog-1316355194.cos.ap-hongkong.myqcloud.com/20250203144619.png) # Sum of Squared Differences (SSD) - It is a measure of match based on pixel by pixel intensity differences between images - It calculates the summation of squared for the product of pixels subtraction between two images\]

SSD(u, v) = \sum_n \sum_m

I(u + m, v + n) - T(m, n)

^2.

\[- Where $T(u, v)$ is the template and $I(u, v)$ is the image # Python code - In the next couple of slides, we will learn how to implement a disparity map using the simple SSD template matching - The function takes the stereo pair, the template window size and the maximum offset - To start with, we read in a stereo pair, i.e. left and right images ```python #!/usr/bin/env python # ----------------------------------------------------------------- # Simple sum of squared differences (SSD) stereo-matching using Numpy # ----------------------------------------------------------------- # Copyright (c) 2016 David Christian # Licensed under the MIT License import numpy as np from PIL import Image def stereo_match(left_img, right_img, kernel, max_offset): # Load in both images, assumed to be RGBA 8bit per channel images left_img = Image.open(left_img).convert('L') left = np.asarray(left_img) right_img = Image.open(right_img).convert('L') right = np.asarray(right_img) w, h = left_img.size # assume that both images are same size # Depth (or disparity) map depth = np.zeros((w, h), np.uint8) depth.shape = h, w kernel_half = int(kernel / 2) offset_adjust = 255 / max_offset # this is used to map depth map output to 0-255 range for y in range(kernel_half, h - kernel_half): print("\rProcessing.. %%d%% complete"% (y / (h - kernel_half) * 100), end="", flush=True) for x in range(kernel_half, w - kernel_half): best_offset = 0 prev_ssd = 65534 for offset in range(max_offset): ssd = 0 ssd_temp = 0 # v and u are the x,y of our local window search, used to ensure a good match # because the squared differences of two pixels alone is not enough ot go on for v in range(-kernel_half, kernel_half): for u in range(-kernel_half, kernel_half): # iteratively sum the sum of squared differences value for this block # left[] and right[] are arrays of uint8, so converting them to int saves # potential overflow ssd_temp = int(left[y+v, x+u]) - int(right[y+v, (x+u) - offset]) ssd += ssd_temp * ssd_temp # if this value is smaller than the previous ssd at this block # then it's theoretically a closer match. Store this value against # this block.. if ssd < prev_ssd: prev_ssd = ssd best_offset = offset # set depth output for this x,y location to the best match depth[y, x] = best_offset * offset_adjust # Convert to PIL and save it Image.fromarray(depth).save('depth.png') ``` # Epipolar geometry - When the relative position of a stereo pair is known - a point $x_2$ corresponding to a given point $x_1$ is constrained to be on a single, known line in image 2, called the **epipolar line** of point $x_1$ - This relationship holds for images that are immune from distortion ![image.png](https://wichaiblog-1316355194.cos.ap-hongkong.myqcloud.com/20250203144857.png) - The world point X and the centers of projection of the two cameras identify a plane in space, the epipolar plane of point X. - The Figure above shows a triangle of this plane, delimited by the two projection rays and by the baseline of the stereo pair, that is, the line segment that connects the two centers of projection. - If the image planes are thought of extending indefinitely, the baseline intersects the two image planes at two points called the epipoles of the two images. - In particular, if the baseline is parallel to an image plane, then the corresponding epipole is a point at infinity. # More on epipolar geometry - The **epipolar plane** intersects the two image planes along the two **epipolar lines** of point $X$, which by construction pass through the two projection points $x_1$ and $x_2$ and through the two **epipoles**. - Thus, **epipolar lines** come in corresponding pairs, and the correspondence is established by the single **epipolar plane** for the given point $X$. - For a different world point $X'$, the **epipolar plane** changes, and with it do the image projections of $X'$ and the **epipolar lines**. - However, all **epipolar planes** contain the baseline. - So, the set of **epipolar planes** forms a pencil of planes, supported by the line through the baseline. • Suppose now that we are given the two images and point x₁ in image 1 • We do not know where the corresponding point x₂ is in the other image, nor where the world point X is • However, the two centers of projection and point x₁ identify the epipolar plane, and this in turn determines the epipolar line of point x₁ in image 2 • The point x₂ must be somewhere on this line, so this construction reduces the search for x₂ from a rectangle (image 2 in its entirety) to a line segment (the part of epipolar line contained in image 2) # Image Rectification - The process of re-sampling pairs of stereo images to produce a pair of **matched epipolar projections**. - The rectification makes the corresponding **epipolar** lines coincide and be parallel to the x-axis. - Consequently, the disparity between two images is only in the x-direction. - A pair of stereo-rectified images is helpful for dense stereo matching algorithms. It restricts the search domain for each match to a line parallel to the x-axis. # Illustration of image rectification ![image.png](https://wichaiblog-1316355194.cos.ap-hongkong.myqcloud.com/20250203145012.png) - Left is the original stereo configuration, right the rectified pair - Image planes are now co-planar and their x axis is parallel to the CC' baseline ![image.png](https://wichaiblog-1316355194.cos.ap-hongkong.myqcloud.com/20250203145029.png) ![image.png](https://wichaiblog-1316355194.cos.ap-hongkong.myqcloud.com/20250203145039.png) # Epipolar constraint ![image.png](https://wichaiblog-1316355194.cos.ap-hongkong.myqcloud.com/20250203145055.png) - Straight lines in the world map onto straight lines in the images - It is always possible to transform the two images from a stereo pair so that epipolar lines coincide with rows in the image arrays - This transformation is called rectification ![image.png](https://wichaiblog-1316355194.cos.ap-hongkong.myqcloud.com/20250203145127.png) ![image.png](https://wichaiblog-1316355194.cos.ap-hongkong.myqcloud.com/20250203145134.png) ![image.png](https://wichaiblog-1316355194.cos.ap-hongkong.myqcloud.com/20250203145152.png) # The Essential matrix The Essential matrix uses camera coordinates\]

Ex = l’

$$ The essential matrix establishes constraints between matching image points

The Fundamental Matrix

The Fundamental matrix is the equivalent of the Essential matrix when the cameras are calibrated.
If $x_1$ and $x_2$ is a pair of corresponding points in two views, then they satisfy the scalar equation $x_2^T F x_1 = 0$.
The fundamental matrix can be computed from correspondences of imaged scene points alone, without requiring knowledge of the camera internal parameters or relative pose.

Estimating camera poses in stereo pairs

We can find the correspondences between the views
- We assume these are correct
- This is usually not the case, algorithms exist to rectify the problem
Using the estimated correspondences and the epipolar constraint
- We estimate the Essential matrix E
We then retrieve the relative pose (R, t), using E

Baseline, Disparity, and Depth

Depth is inversely proportional to disparity
- shift to the left of an image feature when viewed in the right image
The larger the disparity, the closer the object is to the baseline of the camera
The smaller the disparity, the farther the object is to the baseline.

Stereo Vision: sparse versus dense

Sparse matching
- This is performed as we have discussed: we extract interest points for both camera views and then match
- It is called sparse because it has holes: not all points of the scene can be reconstructed as 3D
Dense matching
- This is performed for all pixels
- We can interpolate across estimated sparse depths in the sparse matching
- We use correlation-based stereo
  - use pixel neighbourhood as features
  - compare using pixel correlation
  - obtain depth at every pixel location in image

Stereoscopic sensors

Use of Stereo Technology

Ubiquitous stereo vision device

Decision of camera layout by simulation

Camera arrangement is determined using the CAD data made beforehand.

Head tracking in classroom

Motion may be able to be evaluated from the position of the head, waist, or foot. …lots of examples
Summary
We have taken a more detailed view of camera geometry
- Extrinsic and intrinsic camera parameters
We have learnt about the geometry of a stereo pair of cameras
We have seen how to transform a 3D world point onto an image point
We have learnt about correlation methods
We have seen the RGBD sensor and some applications

Study, Master

DU CV

This post is licensed under CC BY 4.0 by the author.

CV Lecture 8

What is stereoscopic vision

What is stereo vision?

Why do we need stereo vision?

Perspective projection

World and camera coordinate frames

Summary of camera parameters

From world to camera coordinate frames

The Fundamental Matrix

Estimating camera poses in stereo pairs

Baseline, Disparity, and Depth

Stereo Vision: sparse versus dense

Stereoscopic sensors

Use of Stereo Technology

Ubiquitous stereo vision device

Decision of camera layout by simulation

Head tracking in classroom

Summary

Trending Tags