nandbit

Pinhole Camera Model

Introduction

The pinhole camera model is a mathematical representation of a physical pinhole camera. In the context of computer vision, the pinhole model allows us to formulate the process of transforming points from 3D world-space to digital 2D image-space.

Coordinate systems

There are four coordinate systems that we need to consider when talking about the pinhole camera model.

World

The real world coordinates of the points we're projecting

Camera

The coordinates in from the perspective of the camera

Image

The coordinates of the plane onto which the 3D points are projected. In a digital camera, it's the physical sensor.

Pixel

The coordinates in pixels on the produced digital image

FIGURE 1: Coordinate systems

In order to be able to project any 3D point in a real world to a 2D digital image, we can follow the following transformation:

World

[\begin{matrix} U \\ V \\ W \end{matrix}] \to

Camera

[\begin{matrix} X \\ Y \\ Z \end{matrix}] \to

Image

[\begin{matrix} x \\ y \end{matrix}] \to

Pixel

[\begin{matrix} u \\ v \end{matrix}]

FIGURE 2: 3D point to 2D digital space transformation

Doing so for every point of a real-world object will create a digital image of the object.

Model Visualization

The pinhole model can be formally visualized in the following manner:

FIGURE 3: The pinhole camera model

$P$ is a singular point taken as an example.
$O$ is the center of the camera (the pinhole).
$C'$ is the projection of the pinhole ( $O$ ) onto the image plane.
$P'$ is the point $P$ projected onto the image plane.
$i$ $j$ $k$ are the three dimensions.

You can kind of visualize this model in the real world like this (note that the pinhole should actually be infinitely small, not a large circle):

FIGURE 4: Real-world visualization

of the pinhole camera model

Camera → Image

We are here

World

Camera

Image

Pixel

[\begin{matrix} U \\ V \\ W \end{matrix}]

\to

[\begin{matrix} X \\ Y \\ Z \end{matrix}]

\to

[\begin{matrix} x \\ y \end{matrix}]

\to

[\begin{matrix} u \\ v \end{matrix}]

What we want to figure out is how to convert a point in the camera-space $P_{c}$ to a point in the image space $P_{i}$ . We need some function to perform this transformation:

P_{i} = f (P_{c})

You may notice in figure 3 that points $C^{'} O P^{'}$ and $P O Z$ make similar triangles.

FIGURE 5: Similar triangles in

the pinhole camera model

Using the similar triangle property:

\frac{x}{z} = \frac{x'}{f} \to x' = f \frac{x}{z}

\frac{y}{z} = \frac{y'}{f} \to y' = f \frac{y}{z}

Our conversion function now becomes:

P_{i} = {[\begin{matrix} x^{^{'}} & y^{^{'}} \end{matrix}]}^{T} = {[\begin{matrix} f \frac{x}{z} & f \frac{y}{z} \end{matrix}]}^{T}

Image → Pixel

We are here

World

Camera

Image

Pixel

[\begin{matrix} U \\ V \\ W \end{matrix}]

\to

[\begin{matrix} X \\ Y \\ Z \end{matrix}]

\to

[\begin{matrix} x \\ y \end{matrix}]

\to

[\begin{matrix} u \\ v \end{matrix}]

Now we need to convert the point $P_{i}$ from image-space to a point $P_{p}$ in digital-space.

In figure 1 we can see that the origin of the coordinate system in the image space is at the center, while in (this) pixel space coordinate system, the origin is in the bottom left corner.

FIGURE 6: Translation between image

and pixel space coordinate systems

This offset needs to be accounted for. The translation vector can be represented as:

[c_{x}, c_{y}]

Adding it to our transformation function yields:

P_{p} = [\begin{matrix} x^{'} \\ y^{'} \end{matrix}] = [\begin{matrix} f \frac{x}{z} + c_{x} \\ f \frac{y}{z} + c_{y} \end{matrix}]

Up to now, we've been working in real-world units, like millimeters. However, since we're creating a digital image, the units should be pixels. We can change the units by creating two new variables $k$ and $l$ . These variables represent the ratio of real-world units to pixels, and their units would be similar to $\frac{p i x e l}{m m}$ allowing us to cancel out the real-world units.

Adding it to our transformation function yields:

P_{p} = [\begin{matrix} x^{'} \\ y^{'} \end{matrix}] = [\begin{matrix} k f \frac{x}{z} + c_{x} \\ l f \frac{y}{z} + c_{y} \end{matrix}]

And tidying up the matrix,

P_{p} = [\begin{matrix} x^{^{'}} \\ y^{^{'}} \end{matrix}] = [\begin{matrix} α \frac{x}{z} + c_{x} \\ β \frac{y}{z} + c_{y} \end{matrix}]

Why do we have two variables for pixel/mm ratio?

Although in our daily lives we mostly encounter square pixels, that's not always the case. Having separate variables for $x$ and $y$ dimensions helps generalize the model to handle all kinds of pixel shapes.

Matrixification[2]

Scientists and computers love matrices, so it would be a good idea to represent our transformation as a matrix. If we do that, a simple matrix multiplication will yield the result.

However, there is one problem. This mapping is non-linear, meaning we cannot represent it as a matrix-vector product. To make it linear, we can make use of homogeneous coordinates.

What makes this mapping non-linear?

A mapping $T : V \to W$ is linear if it satisfies the following two properties:[3]

\begin{matrix} (1) & T (u + v) = T (u) + T (v), f o r a l l u, v \in V, \end{matrix}

\begin{matrix} (2) & T (λ v) = λ T (v), f o r a l l λ \in W a n d v \in V . \end{matrix}

In our case $V$ is the image space and $W$ is the digital space. Let's run the $x$ part of our mapping

f (x, z) = α \frac{x}{z} + c_{x}

Through the second property, first scaling input individually:

f (2 x, 2 z) = α \frac{2 x}{2 z} + c_{x}

And then scaling the input all at once:

2 f (x, z) = 2 (α \frac{x}{z} + c_{x}) = 2 α \frac{x}{z} + 2 c_{x}

Since

(α \frac{2 x}{2 z} + c_{x}) \neq (2 α \frac{x}{z} + 2 c_{x})

This mapping is non-linear

We convert our coordinates to homogeneous coordinates by adding an extra dimension with a value of 1

P_{p} = [\begin{matrix} x^{^{'}} \\ y^{^{'}} \\ 1 \end{matrix}] = [\begin{matrix} α \frac{x}{z} + c_{x} \\ β \frac{y}{z} + c_{y} \\ 1 \end{matrix}]

Getting rid of the division by $z$ ,

P_{p} = [\begin{matrix} x^{^{'}} \\ y^{^{'}} \\ z \end{matrix}] = [\begin{matrix} α x + c_{x} z \\ β y + c_{y} z \\ z \end{matrix}]

Note that the original real-world point $P$ also becomes homogeneous,

P = [\begin{matrix} x \\ y \\ z \\ 1 \end{matrix}]

And now we can represent the transformation of $P \to P_{p}$ with a matrix-vector multiplication,

P_{p} = [\begin{matrix} α & 0 & c_{x} & 0 \\ 0 & β & c_{y} & 0 \\ 0 & 0 & 1 & 0 \end{matrix}] [\begin{matrix} x \\ y \\ z \\ 1 \end{matrix}]

Which can be simplified into,

P_{p} = M P_{i}

Can't I simply divide each of the values in

P

z

and avoid homogeneous coordinates?

Yes, you can. The transformation would look like this:

P_{p} = [\begin{matrix} α & 0 & c_{x} \\ 0 & β & c_{y} \\ 0 & 0 & 1 \end{matrix}] [\begin{matrix} x / z \\ y / z \\ 1 \end{matrix}]

However, it is admittedly less "clean", and it will cause more issues down the road which will be discussed further.

World → Camera

We are here

World

Camera

Image

Pixel

[\begin{matrix} U \\ V \\ W \end{matrix}]

\to

[\begin{matrix} X \\ Y \\ Z \end{matrix}]

\to

[\begin{matrix} x \\ y \end{matrix}]

\to

[\begin{matrix} u \\ v \end{matrix}]

So far we've been working with the camera coordinate system and the digital space coordinate system. The final step in the real-world to digital-image pipeline is to convert the real-world 3D coordinate $P_{w}$ to the camera-space.

To add intuition to this process I propose the following analogy:

Imagine you are a camera, somewhere in the real world. Your friend is standing somewhere else and points at some point telling you to take a picture. In order to get the exact shot that your friend wants, you need to move to his position and aim yourself at the point.

This is what we do here. We use a translation matrix to "move" the camera coordinate system to the real-world coordinate system origin, and the rotation matrix to "aim" at the point.

Good news is that rotation and transformation can be achieved through matrix multiplication when using homogeneous coordinates.

Transformation matrix $T$ :

T (𝐱) = [\begin{matrix} 1 & 0 & 0 & t_{x} \\ 0 & 1 & 0 & t_{y} \\ 0 & 0 & 1 & t_{z} \end{matrix}]

Rotation matrices (one for each axis):

R_{x} (θ) = [\begin{matrix} 1 & 0 & 0 \\ 0 & c o s θ & - s i n θ \\ 0 & s i n θ & c o s θ \end{matrix}]

R_{y} (θ) = [\begin{matrix} c o s θ & 0 & s i n θ \\ 0 & 1 & 0 \\ - s i n θ & 0 & c o s θ \end{matrix}]

R_{z} (θ) = [\begin{matrix} c o s θ & - s i n θ & 0 \\ s i n θ & c o s θ & 0 \\ 0 & 0 & 1 \end{matrix}]

The rotation matrices can be combined into one general rotation matrix $R$

Using these matrices, we can represent the conversion from the world-space coordinates to camera-space coordinates by matrix multiplication:

R T 𝐏_{w}

Putting it all together

To convert a given point $P_{w}$ in real-space to a pixel in digital-space we can perform the following chain of matrix multiplications:

[\begin{matrix} u \\ v \\ 1 \end{matrix}] = [\begin{matrix} α & 0 & c_{x} & 0 \\ 0 & β & c_{y} & 0 \\ 0 & 0 & 1 & 0 \end{matrix}] [\begin{matrix} r_{11} & r_{12} & r_{13} & r_{14} \\ r_{21} & r_{22} & r_{23} & r_{24} \\ r_{31} & r_{32} & r_{33} & r_{34} \\ r_{41} & r_{42} & r_{43} & r_{44} \end{matrix}] [\begin{matrix} 1 & 0 & 0 & t_{x} \\ 0 & 1 & 0 & t_{y} \\ 0 & 0 & 1 & t_{z} \\ 0 & 0 & 0 & 1 \end{matrix}] [\begin{matrix} U \\ V \\ W \\ 1 \end{matrix}]

Or more simply:

M R T 𝐏_{w}

Notes

This article was heavily inspired, and is a lossy compression of CS231A course notes from Standford, and CMPEN454 notes from PSU
Pronounced with syllabification of Californication ↩
Formal definition ↩