A Programmer’s Intuition for Matrix Multiplication

What does matrix multiplication mean? Here's a fewcommon intuitions:

1) Matrix multiplication scales/rotates/skews a geometric plane.

matrix intuition

This is useful when first learning about vectors: vectors go in, new ones come out. Unfortunately, this can lead to an over-reliance on geometric visualization.

如果有20个家庭来参加你的烧烤,你如何估计你需要的热狗?(Hrm… 20 families, call it 3 people per family, 2 hotdogs each… about 20 * 3 * 2 = 120 hotdogs.)

You probably don't think "Oh, I need the volume of a invitation-familysize-hunger prism!". With large matrices I don't think about 500-dimensional vectors, just data to be modified.

2) Matrix multiplication composes linear operations.

This is the technically accurate definition: yes, matrix multiplication results in a new matrix that composes the original functions. However, sometimes the matrix being operated on is not a linear operation, but a set of vectors or data points. We need another intuition for what's happening.

I'll put a programmer's viewpoint into the ring:

3) Matrix multiplication is about information flow, converting data to code and back.

matrix pour

I think of linear algebra as "math spreadsheets" (if you're new to linear algebra,read this intro):

  • We store information in various spreadsheets ("matrices")
  • Some of the data are seen as functions to apply, others as data points to use
  • We can swap between the vector and function interpretation as needed

Sometimes I'll think of data as geometric vectors, and sometimes I'll see a matrix as a composing functions. But mostly I think about information flowing through a system. (Some purists cringe at reducing beautiful algebraic structures into frumpy spreadsheets; I sleep OK at night.)

Programmer's Intuition: Code is Data is Code

Take your favorite recipe. If you interpret the words asinstructions, you'll end up with a pie, muffin, cake, etc.

如果你把这些话理解为data, the text is prose that can be tweaked:

  • Convert measurements to metric units
  • Swap ingredients due to allergies
  • Adjust for altitude or different equipment

The result is a new recipe, which can be further tweaked, or executed as instructions to make a different pie, muffin, cake, etc. (Compilers treat a program as text, modify it, and eventually output "instructions" — which could be text for another layer.)

That's Linear Algebra. We take raw information like "3 4 5" treat it as a vector or function, depending on how it's written:

operation and data

By convention, a vertical column is usually a vector, and a horizontal row is typically a function:

  • [3; 4; 5]meansx = (3, 4, 5). Here,xis a vector of data (I'm using;to separate each row).
  • [3 4 5]meansf(a, b, c) = 3a + 4b + 5c. This is a function taking three inputs and returning a single result.

And the aha! moment: data is code, code is data!

code and data are equivalent

包含水平函数的行实际上可以是三个数据点(每个数据点只有一个元素)。数据的垂直列实际上可以是三个不同的函数,每个函数都有一个参数。

Ah. This is getting neat: depending on the desired outcome, we can combine data and code in a different order.

The Matrix Transpose

The matrix transpose swaps rows and columns. Here's what it means in practice.

Ifxwas a column vector with 3 entries ([3; 4; 5]), thenx'is:

  • A function taking 3 arguments ([3 4 5])
  • x'can still remain a data vector, but as three separate entries. The transpose "split it up".

Similarly, iff = [3 4 5]is our row vector, thenf'can mean:

  • A single data vector, in a vertical column.
  • f'is separated into three functions (each taking a single input).

Let's use this in practice.

When we seex' * xwe mean:x'(as a single function) is working onx(a single vector). The result is thedot product(read more). In other words, we've applied the data to itself.

x transform linear algebra

When we seex * x'we meanx(as a set of functions) is working onx'(a set of individual data points). The result is a grid where we've applied each function to each data point. Here, we've mixed the data with itself in every possible permutation.

x transpose linear algebra

I think ofxxasx(x). It's the "function x" working on the "vector x". (This helps compute thecovariance matrix, a measure of self-similarity in the data.)

Putting The Intuition To Use

Phew! How does this help us? When we see an equation like this (from theMachine Learning class):

\displaystyle{\[h_{\theta}(x)=\theta^Tx\] }

I now have an instant feel of what's happening. In the first equation, we're treating $\theta$ (which is normally a set of data parameters) as a function, and passing in $x$ as an argument. This should give us a single value.

Morecomplex derivationslike this:

\displaystyle{\[\theta=(X^TX)^{-1}X^Ty\]}

can be worked through. In some cases it gets tricky because we store the data as rows (not columns) in the matrix, but now I have much better tools to follow along. You can start estimating when you'll get a single value, or when you'll get a "permutation grid" as a result.

几何比例和线性组合有它们的位置,但这里我想考虑信息。"The information in x is becoming a function, and we're passing itself as the parameter."

Long story short, don't get locked into a single intuition. Multiplication evolved from repeated addition, to scaling (decimals), to rotations (imaginary numbers), to "applying" one number to another (integrals), and so on. Why not the same for matrix multiplication?

快乐数学。

Appendix: What about the other combinations?

You may be curious why we can't use the other combinations, likex xorx' x'. Simply put, the parameters don't line up: we'd have functions expecting 3 inputs only being passed a single parameter, or functions expecting single inputs getting passed 3.

Appendix: Javascript Interpretation

The dot productx' * xcould be seen as the following javascript command:

((x, y, z) => x*3 + y*4 + z*5)(3, 4, 5)

我们定义了一个有3个参数的匿名函数,并立即向它传递3个参数。This returns 50 (the dot product:3*3 + 4*4 + 5*5 = 50).

The math notation is super-compact, so we can simply write (in Octave/Matlab):

octave:2> [3 4 5] * [3 4 5]' ans = 50

Remember that[3 4 5]is the function and[3; 4; 5]or[3 4 5]'is how we'd write the data vector.

Appendix: ADEPT Method

This article came about from a TODO in mymachine learning class notesthat use theADEPT Method:

I wanted to explain to myself — in plain English — why we wantedx' xand not the reverse. Now, in plain English: We're treating the information as a function, and passing the same info as the parameter.

Other Posts In This Series

  1. A Visual, Intuitive Guide to Imaginary Numbers
  2. Intuitive Arithmetic With Complex Numbers
  3. Understanding Why Complex Multiplication Works
  4. Intuitive Guide to Angles, Degrees and Radians
  5. Intuitive Understanding Of Euler's Formula
  6. An Interactive Guide To The Fourier Transform
  7. Intuitive Guide to Convolution
  8. Intuitive Understanding of Sine Waves
  9. An Intuitive Guide to Linear Algebra
  10. A Programmer's Intuition for Matrix Multiplication
  11. Imaginary Multiplication vs. Imaginary Exponents
  12. Intuitive Guide to Hyperbolic Functions

Join 450k Monthly Readers

喜欢这篇文章吗?还有很多方法可以帮助你对数学建立持久、直观的理解。加入时事通讯的奖金世界杯比利时vs摩洛哥亚盘内容和最新的更新。