Images and Video

Before we can train any model, we need to first understand what an image actually looks like to a computer.

An image is stored as a grid of pixels. Each pixel encodes the color at that location using three numbers: how much red, how much green, and how much blue. Those values range from 0 to 255. A pure white pixel is [255, 255, 255]. A pure black pixel is [0, 0, 0]. A pale blue-tinted light might read [204, 229, 255].

A standard HD image might be 1080 rows by 1920 columns, with three channels per pixel. That means a single image is a tensor of shape [1080, 1920, 3], containing over six million numbers!

Pixel Explorer

Sample image (16×12 px)

Hover or tap a pixel to inspect it

Pixel inspector

Hover over a pixel
to inspect its values

Tensor shape: [12, 16, 3] 576 values total

Hover over any pixel in the sample image to see its RGB values. Toggle to grayscale to see how color collapses to a single luminance value per pixel.

💭Reflection

A grayscale medical X-ray is 512 × 512 pixels. What shape is its tensor representation, and how many total numbers does it contain?

Video Is a Time Series of Images

Videos extend the idea by one dimension. A video is a sequence of image frames over time. Once you internalize that — video is a time series of images — a lot of video modeling stops feeling mysterious. The same convolutional building blocks that work on images also work on video, just with an extra time axis.

One Encoding, Many Applications

The exact same RGB encoding scheme — a tensor of numbers between 0 and 255 — is the foundation under medical imaging, autonomous driving perception, manufacturing quality control, satellite imagery, and your Instagram filter. A radiologist's MRI viewer and your phone's camera app are reading the same kind of numbers; what's different is the model on top.

Checkpoint

A 64×64 pixel color image (RGB). How many numbers does it take to represent it?