Images and Video

Before we can train any model, we need to first understand what an image actually looks like to a computer.

An image is stored as a grid of pixels. Each pixel encodes the color at that location using three numbers: how much red, how much green, and how much blue. Those values range from 0 to 255. A pure white pixel is [255, 255, 255]. A pure black pixel is [0, 0, 0]. A pale blue-tinted light might read [204, 229, 255].

A standard HD image might be 1080 rows by 1920 columns, with three channels per pixel. That means a single image is a tensor of shape [1080, 1920, 3], containing over six million numbers!

Pixel Explorer

Sample image (16×12 px)

Hover or tap a pixel to inspect it

Pixel inspector

Hover over a pixel
to inspect its values

Tensor shape: [12, 16, 3] — 576 values total

Hover over any pixel in the sample image to see its RGB values. Toggle to grayscale to see how color collapses to a single luminance value per pixel.

💭Reflection

A grayscale medical X-ray is 512 × 512 pixels. What shape is its tensor representation, and how many total numbers does it contain?

ℹ

Video Is a Time Series of Images

Videos extend the idea by one dimension. A video is a sequence of image frames over time. Once you internalize that — video is a time series of images — a lot of video modeling stops feeling mysterious. The same convolutional building blocks that work on images also work on video, just with an extra time axis.

◆

One Encoding, Many Applications

The exact same RGB encoding scheme — a tensor of numbers between 0 and 255 — is the foundation under medical imaging, autonomous driving perception, manufacturing quality control, satellite imagery, and your Instagram filter. A radiologist's MRI viewer and your phone's camera app are reading the same kind of numbers; what's different is the model on top.

Checkpoint

A 64×64 pixel color image (RGB). How many numbers does it take to represent it?

←PreviousSensor DataInformation Representation Next→TextInformation Representation