深度学习-数学基础

2025-07-02 14:27

1 张量

arange 创建行向量

python
1
2
3
x = torch.arange(12)
x
# tensor([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11])

shape 获取张量的形状（沿着各个轴的长度）

python
1
2
x.shape
# torch.Size([12])

numel 获取张量总的元素个数

python
1
2
x.numel()
# 12

reshape 更改张量的形状

python
1
2
3
4
5
X = x.reshape(3, 4)
X
# tensor([[ 0,  1,  2,  3],
#        [ 4,  5,  6,  7],
#        [ 8,  9, 10, 11]])

reshape 的参数传递传递 $-1$ 可以触发自动推导。x.reshape(-1,4) 、x.reshape(3,-1) 和 x.reshape(3,4) 一样

zeros 创建全 $0$ 张量

python
1
2
3
4
5
6
7
torch.zeros((2, 3, 4))
# tensor([[[0., 0., 0., 0.],
#         [0., 0., 0., 0.],
#         [0., 0., 0., 0.]],
#        [[0., 0., 0., 0.],
#         [0., 0., 0., 0.],
#         [0., 0., 0., 0.]]])

ones 创建全 $1$ 张量

python
1
2
3
4
5
6
7
torch.ones((2, 3, 4))
# tensor([[[1., 1., 1., 1.],
#         [1., 1., 1., 1.],
#         [1., 1., 1., 1.]],
#        [[1., 1., 1., 1.],
#         [1., 1., 1., 1.],
#         [1., 1., 1., 1.]]])

randn 将会从均值为 $0$ ，标准差为 $1$ 的标准高斯分布中随机采样

python
1
2
3
4
torch.randn(3, 4)
# tensor([[-0.0135,  0.0665,  0.0912,  0.3212],
#        [ 1.4653,  0.1843, -1.6995, -0.3036],
#        [ 1.7646,  1.0450,  0.2457, -0.7732]])

tensor 直接将 python 的列表转为张量

python
1
2
3
4
torch.tensor([[2, 1, 4, 3], [1, 2, 3, 4], [4, 3, 2, 1]])
# tensor([[2, 1, 4, 3],
#        [1, 2, 3, 4],
#        [4, 3, 2, 1]])

两个张量进行加、减、乘、除、求幂、判等运算都是按元素的

python
1
2
3
4
5
6
7
8
9
x = torch.tensor([1.0, 2, 4, 8])
y = torch.tensor([2, 2, 2, 2])
x + y, x - y, x * y, x / y, x ** y, x == y
# (tensor([ 3.,  4.,  6., 10.]),
#  tensor([-1.,  0.,  2.,  6.]),
#  tensor([ 2.,  4.,  8., 16.]),
#  tensor([0.5000, 1.0000, 2.0000, 4.0000]),
#  tensor([ 1.,  4., 16., 64.]),
#  tensor([False,  True, False, False]))

如果两个张量的大小不同，那么会有一个广播机制：沿着长度为 $1$ 的轴进行复制，使得两个张量大小相同，然后再按元素运算

python
1
2
3
4
5
6
7
8
9
10
11
a = torch.arange(3).reshape((3, 1))
b = torch.arange(2).reshape((1, 2))
a, b
# (tensor([[0],
#          [1],
#          [2]]),
#  tensor([[0, 1]]))
a + b
# tensor([[0, 1],
#         [1, 2],
#         [2, 3]])

exp 也是逐个元素求 $e^x$

python
1
2
torch.exp(x)
# tensor([2.7183e+00, 7.3891e+00, 5.4598e+01, 2.9810e+03])

cat 可以将两个张量连接，其中 dim 指出在哪个维度连接

python
1
2
3
4
5
6
7
8
9
10
11
12
X = torch.arange(12, dtype=torch.float32).reshape((3,4))
Y = torch.tensor([[2.0, 1, 4, 3], [1, 2, 3, 4], [4, 3, 2, 1]])
torch.cat((X, Y), dim=0), torch.cat((X, Y), dim=1)
# (tensor([[ 0.,  1.,  2.,  3.],
#          [ 4.,  5.,  6.,  7.],
#          [ 8.,  9., 10., 11.],
#          [ 2.,  1.,  4.,  3.],
#          [ 1.,  2.,  3.,  4.],
#          [ 4.,  3.,  2.,  1.]]),
#  tensor([[ 0.,  1.,  2.,  3.,  2.,  1.,  4.,  3.],
#          [ 4.,  5.,  6.,  7.,  1.,  2.,  3.,  4.],
#          [ 8.,  9., 10., 11.,  4.,  3.,  2.,  1.]]))

sum 对所有元素进行求和，产生一个单元素的张量

python
1
2
X.sum()
# tensor(66.)

张量求和时可以指定在哪些轴上求和。沿着哪个轴求和就相当于是消除哪个轴

python
1
2
3
4
5
6
7
8
9
10
11
12
13
A = torch.arange(20, dtype=torch.float32).reshape(5, 4)
A
# tensor([[ 0.,  1.,  2.,  3.],
#          [ 4.,  5.,  6.,  7.],
#          [ 8.,  9., 10., 11.],
#          [12., 13., 14., 15.],
#          [16., 17., 18., 19.]])
A_sum_axis0 = A.sum(axis=0)
A_sum_axis0, A_sum_axis0.shape
# (tensor([40., 45., 50., 55.]), torch.Size([4]))
A_sum_axis1 = A.sum(axis=1)
A_sum_axis1, A_sum_axis1.shape
# (tensor([ 6., 22., 38., 54., 70.]), torch.Size([5]))

mean 可以求得平均数

python
1
2
A.mean(), A.sum() / A.numel()
# (tensor(9.5000), tensor(9.5000))

mean 也可以指定沿着哪个轴

python
1
2
A.mean(axis=0), A.sum(axis=0) / A.shape[0]
# (tensor([ 8.,  9., 10., 11.]), tensor([ 8.,  9., 10., 11.]))

sum 可以设置 keepdims=True 来进行非降维的求和

python
1
2
3
4
5
6
7
sum_A = A.sum(axis=1, keepdims=True)
sum_A
# tensor([[ 6.],
#         [22.],
#         [38.],
#         [54.],
#         [70.]])

结合广播机制，让一个轴的数字变为比例

python
1
2
3
4
5
6
A / sum_A
# tensor([[0.0000, 0.1667, 0.3333, 0.5000],
#         [0.1818, 0.2273, 0.2727, 0.3182],
#         [0.2105, 0.2368, 0.2632, 0.2895],
#         [0.2222, 0.2407, 0.2593, 0.2778],
#         [0.2286, 0.2429, 0.2571, 0.2714]])

cumsum 可以沿着一个轴进行累积求和

python
1
2
3
4
5
6
A.cumsum(axis=0)
# tensor([[ 0.,  1.,  2.,  3.],
#         [ 4.,  6.,  8., 10.],
#         [12., 15., 18., 21.],
#         [24., 28., 32., 36.],
#         [40., 45., 50., 55.]])

[-1] 可以访问最后一个元素，[x:y] 可以访问一个范围，[:] 访问所有元素

python
1
2
3
4
5
6
7
8
9
10
11
12
13
X[-1], X[1:3]
# (tensor([ 8.,  9., 10., 11.]),
#  tensor([[ 4.,  5.,  6.,  7.],
#          [ 8.,  9., 10., 11.]]))
X[1, 2] = 9
X
# tensor([[ 0.,  1.,  2.,  3.],
#         [ 4.,  5.,  9.,  7.],
#         [ 8.,  9., 10., 11.]])
X[0:2, :] = 12
# tensor([[12., 12., 12., 12.],
#         [12., 12., 12., 12.],
#         [ 8.,  9., 10., 11.]])

张量运算后会生成新的实例，占用新的内存

python
1
2
3
4
before = id(Y)
Y = Y + X
id(Y) == before
# False

如果我们像复用内存的话，应该写类似 Y[:] = <expression> 的语句

python
1
2
3
4
5
6
Z = torch.zeros_like(Y) # 创建和给定张量大小一致的全零张量
print('id(Z):', id(Z))
Z[:] = X + Y
print('id(Z):', id(Z))
# id(Z): 140327634811696
# id(Z): 140327634811696

或者 X[:] = X + Y 、X += Y

python
1
2
3
4
before = id(X)
X += Y
id(X) == before
# True

2 线性代数

2.1 标量

用只有一个元素的张量表示

python
1
2
3
4
5
6
7
import torch

x = torch.tensor(3.0)
y = torch.tensor(2.0)

x + y, x * y, x / y, x**y
# (tensor(5.), tensor(6.), tensor(1.5000), tensor(9.))

2.2 向量

加粗表示。 $\mathbf{x}\in\mathbb{R}^n$

一个一维的张量

python
1
2
3
4
5
x = torch.arange(4)
x
# tensor([0, 1, 2, 3])
x[3]
# tensor(3)

一般认为向量的默认方向是列向量

\begin{split}\mathbf{x} =\begin{bmatrix}x_{1} \\x_{2} \\ \vdots \\x_{n}\end{bmatrix},\end{split}

len 可以获得向量的长度

python
1
2
len(x)
# 4

2.3 矩阵

使用大写字母加粗表示， $\mathbf{A} \in \mathbb{R}^{m \times n}$

一个二维张量

转置： 如果 $\mathbf{B}=\mathbf{A}^\top$ ，则对于任意 $i,j$ ，都有 $b_{ij}=a_{ji}$

可以在代码中得到矩阵转置

python
1
2
3
4
5
6
7
8
9
10
11
12
A = torch.arange(20).reshape(5, 4)
A
# tensor([[ 0,  1,  2,  3],
#         [ 4,  5,  6,  7],
#         [ 8,  9, 10, 11],
#         [12, 13, 14, 15],
#         [16, 17, 18, 19]])
A.T
# tensor([[ 0,  4,  8, 12, 16],
#         [ 1,  5,  9, 13, 17],
#         [ 2,  6, 10, 14, 18],
#         [ 3,  7, 11, 15, 19]])

对阵矩阵满足： $\mathbf{A} = \mathbf{A}^\top$

一般矩阵的每一行代表一个数据的向量

2.4 点积

给定两个向量 $\mathbf{x},\mathbf{y}\in\mathbb{R}^d$ ，点积表示为 $\mathbf{x}^\top\mathbf{y}$ 或者 $\langle\mathbf{x},\mathbf{y}\rangle$ ，计算方法是相同位置乘积再加起来： $\mathbf{x}^\top \mathbf{y} = \sum_{i=1}^{d} x_i y_i$

用 dot 来求

python
1
2
3
y = torch.ones(4, dtype = torch.float32)
x, y, torch.dot(x, y)
# (tensor([0., 1., 2., 3.]), tensor([1., 1., 1., 1.]), tensor(6.))

2.5 矩阵-向量积

对于一个矩阵 $\mathbf{A} \in \mathbb{R}^{m \times n}$ ，我们可以将其用他的行向量来表示：

\begin{split}\mathbf{A}= \begin{bmatrix} \mathbf{a}^\top_{1} \\ \mathbf{a}^\top_{2} \\ \vdots \\ \mathbf{a}^\top_m \\ \end{bmatrix},\end{split}

其中 $\mathbf{a}^\top_{i} \in \mathbb{R}^n$ 为矩阵的第 $i$ 行，是一个行向量。现在又有一个向量 $\mathbf{x} \in \mathbb{R}^n$ ，将其与矩阵相乘为：

\begin{split}\mathbf{A}\mathbf{x} = \begin{bmatrix} \mathbf{a}^\top_{1} \\ \mathbf{a}^\top_{2} \\ \vdots \\ \mathbf{a}^\top_m \\ \end{bmatrix}\mathbf{x} = \begin{bmatrix} \mathbf{a}^\top_{1} \mathbf{x} \\ \mathbf{a}^\top_{2} \mathbf{x} \\ \vdots\\ \mathbf{a}^\top_{m} \mathbf{x}\\ \end{bmatrix}.\end{split}

也就是向量 $x$ 分别与矩阵 $A$ 的行向量做点积。

通过矩阵-向量积，我们把一个 $n$ 维的向量转为了一个 $m$ 维的向量。

使用 mv 来进行矩阵-向量积

python
1
2
3
4
5
6
7
8
9
10
A
# tensor([[ 0.,  1.,  2.,  3.],
#          [ 4.,  5.,  6.,  7.],
#          [ 8.,  9., 10., 11.],
#          [12., 13., 14., 15.],
#          [16., 17., 18., 19.]])
x
# tensor([0., 1., 2., 3.])
torch.mv(A, x)
# tensor([ 14.,  38.,  62.,  86., 110.])

2.6 矩阵乘法

可以理解为后面的矩阵的每一个列向量去与前面的矩阵做矩阵-向量积

\begin{split}\mathbf{A}=\begin{bmatrix} a_{11} & a_{12} & \cdots & a_{1k} \\ a_{21} & a_{22} & \cdots & a_{2k} \\ \vdots & \vdots & \ddots & \vdots \\ a_{n1} & a_{n2} & \cdots & a_{nk} \\ \end{bmatrix},\quad \mathbf{B}=\begin{bmatrix} b_{11} & b_{12} & \cdots & b_{1m} \\ b_{21} & b_{22} & \cdots & b_{2m} \\ \vdots & \vdots & \ddots & \vdots \\ b_{k1} & b_{k2} & \cdots & b_{km} \\ \end{bmatrix}.\end{split}

\begin{split}\mathbf{C} = \mathbf{AB} = \begin{bmatrix} \mathbf{a}^\top_{1} \\ \mathbf{a}^\top_{2} \\ \vdots \\ \mathbf{a}^\top_n \\ \end{bmatrix} \begin{bmatrix} \mathbf{b}_{1} & \mathbf{b}_{2} & \cdots & \mathbf{b}_{m} \\ \end{bmatrix} = \begin{bmatrix} \mathbf{a}^\top_{1} \mathbf{b}_1 & \mathbf{a}^\top_{1}\mathbf{b}_2& \cdots & \mathbf{a}^\top_{1} \mathbf{b}_m \\ \mathbf{a}^\top_{2}\mathbf{b}_1 & \mathbf{a}^\top_{2} \mathbf{b}_2 & \cdots & \mathbf{a}^\top_{2} \mathbf{b}_m \\ \vdots & \vdots & \ddots &\vdots\\ \mathbf{a}^\top_{n} \mathbf{b}_1 & \mathbf{a}^\top_{n}\mathbf{b}_2& \cdots& \mathbf{a}^\top_{n} \mathbf{b}_m \end{bmatrix}.\end{split}

使用 mm 来进行矩阵乘法

python
1
2
3
4
5
6
7
B = torch.ones(4, 3)
torch.mm(A, B)
# tensor([[ 6.,  6.,  6.],
#         [22., 22., 22.],
#         [38., 38., 38.],
#         [54., 54., 54.],
#         [70., 70., 70.]])

2.7 范数

$L_p$ 范数为：

\|\mathbf{x}\|_p = \left(\sum_{i=1}^n \left|x_i \right|^p \right)^{1/p}.

特别地，我们有 $L_2$ 范数：

\|\mathbf{x}\|_2 = \sqrt{\sum_{i=1}^n x_i^2},

可以使用 norm 来求

python
1
2
3
u = torch.tensor([3.0, -4.0])
torch.norm(u)
tensor(5.)

同时我们还有 $L_1$ 范数

\|\mathbf{x}\|_1 = \sum_{i=1}^n \left|x_i \right|.

直接按照定义求

python
1
2
torch.abs(u).sum()
# tensor(7.)

对于矩阵，有 Frobenius 范数：

\|\mathbf{X}\|_F = \sqrt{\sum_{i=1}^m \sum_{j=1}^n x_{ij}^2}.

使用 norm 来求：

python
1
2
torch.norm(torch.ones((4, 9)))
# tensor(6.)

3 微分

3.1 向量->标量微分

即自变量是向量，因变量是标量的函数： $\frac{\partial y}{\partial\mathbf{x}}$

\begin{split}\mathbf{x}= \begin{bmatrix} x1 \\ x2 \\ \vdots \\ x_n \end{bmatrix}\end{split}

\frac{\partial y}{\partial \mathbf{x}} = \begin{bmatrix} \frac{\partial y}{\partial x_1}, \frac{\partial y}{\partial x_1}, \dots, \frac{\partial y}{\partial x_n} \end{bmatrix}

如： $\frac{\partial}{\partial \mathbf{x}} x_1^2+2x_2^2=\begin{bmatrix} 2x_1,4x_2 \end{bmatrix}$

即求导结果还是个向量，列向量变行向量。

几何意义是得到了一个因变量下降最快的方向

$y$	$a$	$au$	$\text{sum}(\mathbf{x})$	$\left \| \left \| \mathbf{x} \right \| \right \|^2$	$u+v$	$uv$	$\left \langle \mathbf{u},\mathbf{v} \right \rangle$
$\frac{\partial y}{\partial \mathbf{x}}$	$\mathbf{0}^T$	$a\frac{\partial u}{\partial \mathbf{x}}$	$\mathbf{1}^T$	$2\mathbf{x}^T$	$\frac{\partial u}{\partial \mathbf{x}} + \frac{\partial v}{\partial \mathbf{x}}$	$\frac{\partial u}{\partial \mathbf{x}}v + \frac{\partial v}{\partial \mathbf{x}}u$	$\mathbf{u}^T\frac{\partial \mathbf{v}}{\partial \mathbf{x}}+\mathbf{v}^T\frac{\partial \mathbf{u}}{\partial \mathbf{x}}$

3.2 标量->向量微分

即自变量是标量，因变量是向量。

\begin{split}\mathbf{y}= \begin{bmatrix} y1 \\ y2 \\ \vdots \\ y_n \end{bmatrix}\end{split}

有：

\begin{split} \frac{\partial \mathbf{y}}{\partial x}= \begin{bmatrix} \frac{\partial y_1}{\partial x} \\ \frac{\partial y_2}{\partial x} \\ \vdots \\ \frac{\partial y_m}{\partial x} \end{bmatrix}\end{split}

仍然为列向量

3.3 向量->向量微分

即自变量和因变量都是向量

\begin{split}\mathbf{x}= \begin{bmatrix} x1 \\ x2 \\ \vdots \\ x_n \end{bmatrix}\end{split}

\begin{split}\mathbf{y}= \begin{bmatrix} y1 \\ y2 \\ \vdots \\ y_m \end{bmatrix}\end{split}

\frac{\partial \mathbf{y}}{\partial \mathbf{x}} = \begin{bmatrix} \frac{\partial y_1}{\partial \mathbf{x}} \\ \frac{\partial y_2}{\partial \mathbf{x}} \\ \vdots \\ \frac{\partial y_m}{\partial \mathbf{x}} \\ \end{bmatrix} = \begin{bmatrix} \frac{\partial y_1}{\partial x_1} & \frac{\partial y_1}{\partial x_2} & \cdots & \frac{\partial y_1}{\partial x_n} \\ \frac{\partial y_2}{\partial x_1} & \frac{\partial y_2}{\partial x_2} & \cdots & \frac{\partial y_2}{\partial x_n} \\ \vdots & \vdots & \ddots &\vdots\\ \frac{\partial y_m}{\partial x_1} & \frac{\partial y_m}{\partial x_2} & \cdots & \frac{\partial y_m}{\partial x_n} \\ \end{bmatrix}

采用分子布局的话，分子来决定行，分母来决定列。也就是如果是两个行向量进行求导，那么也需要把其中一个转置成列向量。

$\mathbf{y}$	$\mathbf{a}$	$\mathbf{x}$	$\mathbf{Ax}$	$\mathbf{x}^T\mathbf{A}$	$a\mathbf{u}$	$\mathbf{Au}$	$\mathbf{u}+\mathbf{v}$
$\frac{\partial \mathbf{y}}{\partial \mathbf{x}}$	$\mathbf{0}$	$\mathbf{I}$	$\mathbf{A}$	$\mathbf{A}^T$	$a\frac{\partial \mathbf{u}}{\partial \mathbf{x}}$	$\mathbf{A}\frac{\partial \mathbf{u}}{\partial \mathbf{x}}$	$\frac{\partial \mathbf{u}}{\partial \mathbf{x}} + \frac{\partial \mathbf{v}}{\partial \mathbf{x}}$

3.4 矩阵相关微分

自变量	因变量	微分结果
$(n,k)$ 矩阵	标量	$(k,n)$ 矩阵	自变量转置
$(n,k)$ 矩阵	$(m,1)$ 向量	$(m,k,n)$ 张量	自变量转置，放后面
$(n,k)$ 矩阵	$(m,l)$ 矩阵	$(m,l,k,n)$ 张量	自变量转置，放后面
标量	$(m,l)$ 矩阵	$(m,l)$ 矩阵	因变量不变
$(n,1)$ 向量	$(m,l)$ 矩阵	$(m,l,n)$ 张量	因变量不变，放前面

3.5 链式法则

主要把形状搞对

3.6 自动微分

首先需要使用 requires_grad_ 来启用自动微分

python
1
2
3
4
5
import torch

x = torch.arange(4.0)
x.requires_grad_(True)  # 等价于x=torch.arange(4.0,requires_grad=True)
x.grad  # 默认值是None

然后可以继续进行计算：

python
1
2
3
y = 2 * torch.dot(x, x)
y
# tensor(28., grad_fn=<MulBackward0>)

要将 y 关于 x 求导数的话：

python
1
2
3
4
y.backward() # 先执行这个
x.grad # 然后就会算到梯度了
x.grad == 4 * x
# tensor([True, True, True, True])

在默认情况下，PyTorch会累积梯度，我们需要清除之前的值，使用 grad.zero_

python
1
2
3
4
5
6
# 在默认情况下，PyTorch会累积梯度，我们需要清除之前的值
x.grad.zero_()
y = x.sum()
y.backward()
x.grad
# tensor([1., 1., 1., 1.])

有时候我们想把某些计算跟分离开，比如：

python
1
2
3
4
5
6
7
8
x.grad.zero_()
y = x * x
u = y.detach() # 这里我们只希望单纯的把 y 的值赋值给 u，而不携带之前的运算步骤
z = u * x

z.sum().backward()
x.grad == u # u 作为常数，求导完之后就是 u
# tensor([True, True, True, True])

即使构建函数的计算图需要通过Python控制流（例如，条件、循环或任意函数调用），我们仍然可以计算得到的变量的梯度

python
1
2
3
4
5
6
7
8
9
10
11
12
def f(a):
    b = a * 2
    while b.norm() < 1000:
        b = b * 2
    if b.sum() > 0:
        c = b
    else:
        c = 100 * b
    return c
a = torch.randn(size=(), requires_grad=True)
d = f(a)
d.backward()

最后更新于：2025-09-05 13:34

Caiwen

本文作者

一只蒟蒻，爱好编程和算法