Images and Video
Before we can train any model, we need to first understand what an image actually looks like to a computer.
An image is stored as a grid of pixels. Each pixel encodes the color at that location using three numbers: how much red, how much green, and how much blue. Those values range from 0 to 255. A pure white pixel is [255, 255, 255]. A pure black pixel is [0, 0, 0]. A pale blue-tinted light might read [204, 229, 255].
A standard HD image might be 1080 rows by 1920 columns, with three channels per pixel. That means a single image is a tensor of shape [1080, 1920, 3], containing over six million numbers!
Sample image (16×12 px)
Hover or tap a pixel to inspect it
Pixel inspector
Hover over a pixel
to inspect its values
Tensor shape: [12, 16, 3] — 576 values total
Hover over any pixel in the sample image to see its RGB values. Toggle to grayscale to see how color collapses to a single luminance value per pixel.
A grayscale medical X-ray is 512 × 512 pixels. What shape is its tensor representation, and how many total numbers does it contain?
Video Is a Time Series of Images
Videos extend the idea by one dimension. A video is a sequence of image frames over time. Once you internalize that — video is a time series of images — a lot of video modeling stops feeling mysterious. The same convolutional building blocks that work on images also work on video, just with an extra time axis.
One Encoding, Many Applications
The exact same RGB encoding scheme — a tensor of numbers between 0 and 255 — is the foundation under medical imaging, autonomous driving perception, manufacturing quality control, satellite imagery, and your Instagram filter. A radiologist's MRI viewer and your phone's camera app are reading the same kind of numbers; what's different is the model on top.
A 64×64 pixel color image (RGB). How many numbers does it take to represent it?