nandbit

[index] [fun] [about]
Pinhole Camera Model

Introduction

The pinhole camera model is a mathematical representation of a physical pinhole camera. In the context of computer vision, the pinhole model allows us to formulate the process of transforming points from 3D world-space to digital 2D image-space.


Coordinate systems

There are four coordinate systems that we need to consider when talking about the pinhole camera model.

  • World
    • The real world coordinates of the points we're projecting
  • Camera
    • The coordinates in from the perspective of the camera
  • Image
    • The coordinates of the plane onto which the 3D points are projected. In a digital camera, it's the physical sensor.
  • Pixel
    • The coordinates in pixels on the produced digital image

FIGURE 1: Coordinate systems

In order to be able to project any 3D point in a real world to a 2D digital image, we can follow the following transformation:

World

[ U V W ]

Camera

[ X Y Z ]

Image

[ x y ]

Pixel

[ u v ]

FIGURE 2: 3D point to 2D digital space transformation

Doing so for every point of a real-world object will create a digital image of the object.


Model Visualization

The pinhole model can be formally visualized in the following manner:

FIGURE 3: The pinhole camera model

  • P is a singular point taken as an example.
  • O is the center of the camera (the pinhole).
  • C' is the projection of the pinhole (O) onto the image plane.
  • P' is the point P projected onto the image plane.
  • i j k are the three dimensions.

You can kind of visualize this model in the real world like this (note that the pinhole should actually be infinitely small, not a large circle):

FIGURE 4: Real-world visualization

of the pinhole camera model


Camera → Image

We are here

World

Camera

Image

Pixel

[ U V W ]
[ X Y Z ]
[ x y ]
[ u v ]

What we want to figure out is how to convert a point in the camera-space Pc to a point in the image space Pi. We need some function to perform this transformation:

P i = f ( P c )

You may notice in figure 3 that points C'OP' and P O Z make similar triangles.

FIGURE 5: Similar triangles in

the pinhole camera model

Using the similar triangle property:

xz=x'f x'=fxz yz=y'f y'=fyz

Our conversion function now becomes:

P i = [ x y ] T = [ f x z f y z ] T

Image → Pixel

We are here

World

Camera

Image

Pixel

[ U V W ]
[ X Y Z ]
[ x y ]
[ u v ]

Now we need to convert the point Pi from image-space to a point Pp in digital-space.

In figure 1 we can see that the origin of the coordinate system in the image space is at the center, while in (this) pixel space coordinate system, the origin is in the bottom left corner.

FIGURE 6: Translation between image

and pixel space coordinate systems

This offset needs to be accounted for. The translation vector can be represented as:

[cx,cy]

Adding it to our transformation function yields:

Pp=[x'y']=[fxz+cxfyz+cy]

Up to now, we've been working in real-world units, like millimeters. However, since we're creating a digital image, the units should be pixels. We can change the units by creating two new variables k and l. These variables represent the ratio of real-world units to pixels, and their units would be similar to p i x e l m m allowing us to cancel out the real-world units.

Adding it to our transformation function yields:

Pp=[x'y']=[ k fxz+cx l fyz+cy]

And tidying up the matrix,

P p = [ x y ] = [ α x z + c x β y z + c y ]
Why do we have two variables for pixel/mm ratio?

Although in our daily lives we mostly encounter square pixels, that's not always the case. Having separate variables for x and y dimensions helps generalize the model to handle all kinds of pixel shapes.


Matrixification[2]

Scientists and computers love matrices, so it would be a good idea to represent our transformation as a matrix. If we do that, a simple matrix multiplication will yield the result.

However, there is one problem. This mapping is non-linear, meaning we cannot represent it as a matrix-vector product. To make it linear, we can make use of homogeneous coordinates.

What makes this mapping non-linear?

A mapping T : V W is linear if it satisfies the following two properties:[3]

(1) T ( u + v ) = T ( u ) + T ( v ) ,       f o r   a l l   u , v V , (2) T ( λ v ) = λ T ( v ) ,       f o r   a l l   λ W   a n d   v V .

In our case V is the image space and W is the digital space. Let's run the xpart of our mapping

f ( x , z ) = α x z + c x

Through the second property, first scaling input individually:

f ( 2 x , 2 z ) = α 2 x 2 z + c x

And then scaling the input all at once:

2 f ( x , z ) = 2 ( α x z + c x ) = 2 α x z + 2 c x

Since

( α 2 x 2 z + c x ) ( 2 α x z + 2 c x )

This mapping is non-linear

We convert our coordinates to homogeneous coordinates by adding an extra dimension with a value of 1

P p = [ x y 1 ] = [ α x z + c x β y z + c y 1 ]

Getting rid of the division by z,

P p = [ x y z ] = [ α x + c x z β y + c y z z ]

Note that the original real-world point Palso becomes homogeneous,

P = [ x y z 1 ]

And now we can represent the transformation of P P p with a matrix-vector multiplication,

P p = [ α 0 c x 0 0 β c y 0 0 0 1 0 ] [ x y z 1 ]

Which can be simplified into,

P p = M Pi
Can't I simply divide each of the values in P by z and avoid homogeneous coordinates?

Yes, you can. The transformation would look like this:

P p = [ α 0 c x 0 β c y 0 0 1 ] [ x / z y / z 1 ]

However, it is admittedly less "clean", and it will cause more issues down the road which will be discussed further.


World → Camera

We are here

World

Camera

Image

Pixel

[ U V W ]
[ X Y Z ]
[ x y ]
[ u v ]

So far we've been working with the camera coordinate system and the digital space coordinate system. The final step in the real-world to digital-image pipeline is to convert the real-world 3D coordinate P w to the camera-space.

To add intuition to this process I propose the following analogy:

Imagine you are a camera, somewhere in the real world. Your friend is standing somewhere else and points at some point telling you to take a picture. In order to get the exact shot that your friend wants, you need to move to his position and aim yourself at the point.

This is what we do here. We use a translation matrix to "move" the camera coordinate system to the real-world coordinate system origin, and the rotation matrix to "aim" at the point.

Good news is that rotation and transformation can be achieved through matrix multiplication when using homogeneous coordinates.

Transformation matrix T:

T ( 𝐱 ) = [ 1 0 0 t x 0 1 0 t y 0 0 1 t z ]

Rotation matrices (one for each axis):

R x ( θ ) = [ 1 0 0 0 c o s θ s i n θ 0 s i n θ c o s θ ] R y ( θ ) = [ c o s θ 0 s i n θ 0 1 0 s i n θ 0 c o s θ ] R z ( θ ) = [ c o s θ s i n θ 0 s i n θ c o s θ 0 0 0 1 ]

The rotation matrices can be combined into one general rotation matrix R

Using these matrices, we can represent the conversion from the world-space coordinates to camera-space coordinates by matrix multiplication:

R T 𝐏 w

Putting it all together

To convert a given point Pw in real-space to a pixel in digital-space we can perform the following chain of matrix multiplications:

[ u v 1 ] = [ α 0 c x 0 0 β c y 0 0 0 1 0 ] [ r 11 r 12 r 13 r 14 r 21 r 22 r 23 r 24 r 31 r 32 r 33 r 34 r 41 r 42 r 43 r 44 ] [ 1 0 0 t x 0 1 0 t y 0 0 1 t z 0 0 0 1 ] [ U V W 1 ]

Or more simply:

M R T 𝐏 w

Notes


  1. This article was heavily inspired, and is a lossy compression of CS231A course notes from Standford, and CMPEN454 notes from PSU
  2. Pronounced with syllabification of Californication
  3. Formal definition