Temporal-Segment-Networks

2025-03-16 约 4434 字预计阅读 9 分钟

https://bing.ee123.net/img/rand?artid=146207437

Temporal Segment Networks

摘要

Temporal Segment Networks 是对传统双流网络的改进，旨在解决其在长时间跨度动作建模和计算效率方面的局限性。传统双流网络通过分别处理RGB图像和光流信息来捕捉空间和时间特征，但其只能建模短时间跨度的动作，且光流计算耗时且计算成本高。TSN 创新性地引入了 分段采样策略 ，将视频均匀分成多个片段，并从每个片段中随机采样一帧进行训练和推理，从而有效地捕捉了长时间跨度的动作信息，同时大幅减少了计算开销。TSN模型采用双流架构，空间流处理RGB图像片段，时间流处理光流图像片段，最后通过分段共识函数融合两个流的特征并进行分类。TSN 在UCF101、HMDB51和Something-Something等多个主流视频动作识别数据集上取得了SOTA的性能，显著提升了动作识别的准确率。此外，TSN 还提出了一系列训练技巧，如：跨模态预训练和数据增强，进一步提升了模型的表现。

Abstract

Temporal Segment Networks is an improvement over the traditional Two-Stream Networks, aiming to address its limitations in modeling long-term temporal dynamics and computational efficiency. The traditional Two-Stream Networks capture spatial and temporal features by separately processing RGB images and optical flow information, but they are only capable of modeling short-term actions, and the computation of optical flow is time-consuming and computationally expensive. TSN innovatively introduces a temporal segment sampling strategy, which divides the video evenly into multiple segments and randomly samples one frame from each segment for training and inference. This effectively captures long-term temporal information while significantly reducing computational overhead. The TSN model adopts a two-stream architecture, where the spatial stream processes RGB image segments and the temporal stream processes optical flow segments. The features from both streams are then fused through a segmental consensus function for classification. TSN achieves SOTA performance on multiple mainstream video action recognition datasets, such as UCF101, HMDB51, and Something-Something, significantly improving the accuracy of action recognition. Additionally, TSN proposes a series of training techniques, such as cross-modality pre-training and data augmentation, further enhancing the model’s performance.

TSN

论文链接：

项目地址：

传统双流网络通过两个独立的网络分支分别处理RGB图像和光流来捕捉视频中的空间和时间特征。然而，这种方法存在以下问题：

短时间建模限制：传统方法只能捕捉短时间跨度的动作信息，难以建模长时间依赖关系；
计算效率低：光流的计算非常耗时，且需要额外的存储空间；
数据利用率低：传统方法通常只从视频中采样少量帧，导致大量视频信息未被充分利用。

TSN 的架构基于传统的双流网络，但通过引入分段采样策略和分段共识函数，显著提升了模型对长时间跨度动作的建模能力。其模型架构如下图所示：

分段采样策略

分段采样策略是 TSN 的核心创新，旨在捕捉视频中的长时间跨度动作信息。具体步骤如下：

视频分段

将输入视频（Video）均匀分成 K 个片段 $\left { S_{1},S_{2},\cdots ,S_{K} \right }$ ，每个片段包含一定数量的连续帧。例如，对于一个长度为 T 的视频，每个片段的长度为 T/K。TSN 按如下方式对一系列片段进行建模：

$TSN(T_{1},T_{2},\cdots ,T_{K})=H(G(F(T_{1};W),F(T_{2};W),\cdots ,F(T_{K};W)))$ (1)

$\left { T_{1},T_{2},\cdots ,T_{K} \right }$ 代表片段序列，每个片段 $T_{K}$ 从它对应的段 $S_{K}$ 中随机采样得到；
$F(T_{K};W)$ 函数代表采用 W 作为参数的卷积网络作用于短片段 $T_{K}$
段共识函数 G 结合多个短片段的类别得分输出以获得他们之间关于类别假设的共识；
基于这个共识，预测函数 H 预测整段视频属于每个行为类别的概率，本文 H 选择了 Softmax函数。

随机采样

从每个片段（Snippet）中随机采样一帧或一组光流帧。这种随机采样策略允许模型在训练和推理过程中捕捉视频的全局时间信息，而不仅仅是局部时间信息。

多模态输入

TSN 通过探索更多的输入模式来提高辨别力，除了像传统双流网络那样，空间流卷积网络操作单一RGB图像，时间流卷积网络将一堆连续的光流场作为输入，本文作者还提出了两种额外的输入模式： RGB差异 （RGB difference）和 扭曲的光流场 （warped optical flow fields）。如下图第2列和第4列所示：

对于空间流，采样的是RGB difference输入Spatial ConvNet

RGB Difference 是一种简单而有效的时间信息表示方法，用于捕捉视频帧之间的变化信息。

实现方式

对于视频中的连续两帧 $I_{t}$ 和 $I_{t+1}$ ，计算它们的像素级差值：

$D_{t}=\left | I_{t+1}-I_{t} \right |$ (2)

$D_{t}$ 是一个与输入帧尺寸相同的图像，表示两帧之间的变化。

为了捕捉更长时间跨度的变化，可以计算多对连续帧的差分，并将它们堆叠在一起。例如，对于一个包含10帧的片段，可以计算9个差分图像，并将它们堆叠成一个多通道输入。将差分图像堆叠后，输入到空间流网络中。由于差分图像是单通道的，可以通过复制通道或与其他模态结合使用。

作用

捕捉运动信息： RGB Difference 能够直接反映帧之间的像素变化，从而捕捉物体的运动信息；

计算效率高： 与光流相比，RGB Difference 的计算成本更低，适合实时应用；

补充光流信息： 在某些情况下，RGB Difference 可以提供光流无法捕捉的局部变化信息。

对于时间流，采样的是warped optical flow fields输入Temporal ConvNet

Warped Optical Flow Fields 是对传统光流的一种改进，旨在减少光流计算中的噪声和误差，同时增强对物体运动的建模能力。

实现方式

首先，使用光流算法，如：Farneback或FlowNet，计算两帧之间的光流场 $F_{t}$ 。

$F_{t}$ 是一个二维向量场，表示每个像素从 $I_{t}$ 到 $I_{t+1}$ 的运动向量。

为了减少光流计算中的噪声和误差，可以使用 图像对齐 技术对光流进行修正。使用光流 $F_{t}$ 将 $I_{t+1}$ 对齐到 $I_{t}$ 的坐标系，生成对齐后的图像 $I_{t+1}^{w}$ 。计算对齐后的图像 $I_{t+1}^{w}$ 与原始图像 $I_{t}$ 之间的残差，公式如下：

$R_{t}=\left | I_{t+1}^{w}-I_{t} \right |$ (3)

将残差 $R_{t}$ 与原始光流场 $F_{t}$ 结合，生成修正后的光流场 $F_{t}^{w}$ ，将修正后的光流场 $F_{t}^{w}$ 堆叠成多通道输入，输入到时间流网络中，用于捕捉运动信息。

作用

减少噪声： 通过对光流进行修正，可以减少光流计算中的噪声和误差；

增强运动建模： 修正后的光流场能够更准确地反映物体的运动信息，尤其是在复杂背景或快速运动的情况下；

提高鲁棒性： Warped Optical Flow Fields 对光照变化和背景干扰具有更强的鲁棒性。

双流网络

TSN 沿用了的设计，包含两个独立的网络分支：空间流和时间流。每个分支都是一个卷积神经网络，如：BN-Inception或ResNet，分别对采样的帧进行特征提取。

分段共识函数

分段共识函数用于对每个片段提取的特征进行聚合，生成全局视频表示。具体步骤如下：

特征提取

对于每个片段，空间流和时间流分别提取特征。假设空间流提取的特征为 $f_{s}$ ，时间流提取的特征为 $f_{t}$ 。将提取的特征分别传入公式（1）中的 $F(T_{K};W)$ 函数，即可得到该特征对应的分类得分。

特征聚合

对 K 个片段的特征进行聚合（Segmental Consensus），生成全局特征表示。如下图所示：

常用的聚合方法包括：平均池化、最大池化、加权平均。

全局特征表示

空间流的全局特征表示为 $F_{s}$ ，时间流的全局特征表示为 $F_{t}$ 。将其输入段共识函数G中，便可得出他们对于类别的共识。其损失函数如下：

$L(y,G)=-\sum_{i=1}^{C}y_{i}(G_{i}-log\sum_{j=1}^{C}expG_{j})$ ，可微

$https://latex.csdn.net/eq?%5Cfrac%7B%5Cpartial%20L%28y%2CG%29%7D%7B%5Cpartial%20W%7D%3D%5Cfrac%7B%5Cpartial%20L%7D%7B%5Cpartial%20G%7D%5Csum_%7Bk%3D1%7D%5E%7BK%7D%5Cfrac%7B%5Cpartial%20G%7D%7B%5Cpartial%20F%28T_%7Bk%7D%29%7D%5Cfrac%7B%5Cpartial%20F%28T_%7Bk%7D%29%7D%7B%5Cpartial%20W%7D$

分类器

将空间流和时间流的全局特征 $F_{s}$ 和 $F_{t}$ 进行简单的拼接或加权求和融合。将融合后的特征输入到全连接层，生成分类得分。对分类得分进行Softmax归一化，得到每个类别的概率分布。最后，使用交叉熵损失函数计算模型预测与真实标签之间的误差。如下图所示：

网络训练

由于行为检测的数据集相对较小，训练时有过拟合的风险，为了缓解这个问题，作者设计了几个训练策略如下：

交叉输入模式预训练

空间网络以RGB图像作为输入：故采用在ImageNet上预训练的模型做初始化。对于其他输入模式（比如：RGB差异和光流场），它们基本上捕捉视频数据的不同视觉方面，并且它们的分布不同于RGB图像的分布。

作者提出了交叉模式预训练技术：利用RGB模型初始化时间网络。首先，通过线性变换将光流场离散到从0到255的区间，这使得光流场的范围和RGB图像相同。然后，修改RGB模型第一个卷积层的权重来处理光流场的输入。具体来说，就是对RGB通道上的权重进行平均，并根据时间网络输入的通道数量复制这个平均值。这一策略对时间网络中降低过拟合非常有效。

正则化技术

在学习过程中，Batch Normalization将估计每个batch内的激活均值和方差，并使用它们将这些激活值转换为标准高斯分布。这一操作虽可以加快训练的收敛速度，但由于要从有限数量的训练样本中对激活分布的偏移量进行估计，也会导致过拟合问题。因此，在用预训练模型初始化后，冻结所有Batch Normalization层的均值和方差参数，但第一个标准化层除外。由于光流的分布和RGB图像的分布不同，第一个卷积层的激活值将有不同的分布，于是，我们需要重新估计的均值和方差，称这种策略为部分BN。与此同时，在BN-Inception的全局pooling层后添加一个额外的dropout层，来进一步降低过拟合的影响。dropout比例设置：空间流卷积网络设置为0.8，时间流卷积网络设置为0.7。

数据增强

数据增强能产生不同的训练样本并且可以防止严重的过拟合。在传统的two-stream中，采用随机裁剪和水平翻转方法增加训练样本。作者采用两个新方法：角裁剪和尺度抖动。

角裁剪：仅从图片的边角或中心提取区域，来避免默认关注图片的中心。
尺度抖动（scale jittering）：将输入图像或者光流场的大小固定为 256×340，裁剪区域的宽和高随机从 {256,224,192,168} 中选择。最终，这些裁剪区域将会被resize到 224×224 用于网络训练。

消融实验

不同输入模式的表现比较，如下图所示：

TSN不同段共识函数的实验结果，如下图所示：

在不同深度卷积网络上的实验结果，如下图所示：

代码

网络模型代码如下：

import argparse
import os
import sys
import math
import cv2
import numpy as np
import multiprocessing
from sklearn.metrics import confusion_matrix

sys.path.append('.')
from pyActionRecog import parse_directory
from pyActionRecog import parse_split_file

from pyActionRecog.utils.video_funcs import default_aggregation_func

parser = argparse.ArgumentParser()
parser.add_argument('dataset', type=str, choices=['ucf101', 'hmdb51'])
parser.add_argument('split', type=int, choices=[1, 2, 3],
                    help='on which split to test the network')
parser.add_argument('modality', type=str, choices=['rgb', 'flow'])
parser.add_argument('frame_path', type=str, help="root directory holding the frames")
parser.add_argument('net_proto', type=str)
parser.add_argument('net_weights', type=str)
parser.add_argument('--rgb_prefix', type=str, help="prefix of RGB frames", default='img_')
parser.add_argument('--flow_x_prefix', type=str, help="prefix of x direction flow images", default='flow_x_')
parser.add_argument('--flow_y_prefix', type=str, help="prefix of y direction flow images", default='flow_y_')
parser.add_argument('--num_frame_per_video', type=int, default=25,
                    help="prefix of y direction flow images")
parser.add_argument('--save_scores', type=str, default=None, help='the filename to save the scores in')
parser.add_argument('--num_worker', type=int, default=1)
parser.add_argument("--caffe_path", type=str, default='./lib/caffe-action/', help='path to the caffe toolbox')
parser.add_argument("--gpus", type=int, nargs='+', default=None, help='specify list of gpu to use')
args = parser.parse_args()

print args

sys.path.append(os.path.join(args.caffe_path, 'python'))
from pyActionRecog.action_caffe import CaffeNet

# build neccessary information
print args.dataset
split_tp = parse_split_file(args.dataset)
f_info = parse_directory(args.frame_path,
                         args.rgb_prefix, args.flow_x_prefix, args.flow_y_prefix)

gpu_list = args.gpus


eval_video_list = split_tp[args.split - 1][1]

score_name = 'fc-action'


def build_net():
    global net
    my_id = multiprocessing.current_process()._identity[0] \
        if args.num_worker > 1 else 1
    if gpu_list is None:
        net = CaffeNet(args.net_proto, args.net_weights, my_id-1)
    else:
        net = CaffeNet(args.net_proto, args.net_weights, gpu_list[my_id - 1])


def eval_video(video):
    global net
    label = video[1]
    vid = video[0]

    video_frame_path = f_info[0][vid]
    if args.modality == 'rgb':
        cnt_indexer = 1
    elif args.modality == 'flow':
        cnt_indexer = 2
    else:
        raise ValueError(args.modality)
    frame_cnt = f_info[cnt_indexer][vid]

    stack_depth = 0
    if args.modality == 'rgb':
        stack_depth = 1
    elif args.modality == 'flow':
        stack_depth = 5

    step = (frame_cnt - stack_depth) / (args.num_frame_per_video-1)
    if step > 0:
        frame_ticks = range(1, min((2 + step * (args.num_frame_per_video-1)), frame_cnt+1), step)
    else:
        frame_ticks = [1] * args.num_frame_per_video

    assert(len(frame_ticks) == args.num_frame_per_video)

    frame_scores = []
    for tick in frame_ticks:
        if args.modality == 'rgb':
            name = '{}{:05d}.jpg'.format(args.rgb_prefix, tick)
            frame = cv2.imread(os.path.join(video_frame_path, name), cv2.IMREAD_COLOR)
            scores = net.predict_single_frame([frame,], score_name, frame_size=(340, 256))
            frame_scores.append(scores)
        if args.modality == 'flow':
            frame_idx = [min(frame_cnt, tick+offset) for offset in xrange(stack_depth)]
            flow_stack = []
            for idx in frame_idx:
                x_name = '{}{:05d}.jpg'.format(args.flow_x_prefix, idx)
                y_name = '{}{:05d}.jpg'.format(args.flow_y_prefix, idx)
                flow_stack.append(cv2.imread(os.path.join(video_frame_path, x_name), cv2.IMREAD_GRAYSCALE))
                flow_stack.append(cv2.imread(os.path.join(video_frame_path, y_name), cv2.IMREAD_GRAYSCALE))
            scores = net.predict_single_flow_stack(flow_stack, score_name, frame_size=(340, 256))
            frame_scores.append(scores)

    print 'video {} done'.format(vid)
    sys.stdin.flush()
    return np.array(frame_scores), label

if args.num_worker > 1:
    pool = multiprocessing.Pool(args.num_worker, initializer=build_net)
    video_scores = pool.map(eval_video, eval_video_list)
else:
    build_net()
    video_scores = map(eval_video, eval_video_list)

video_pred = [np.argmax(default_aggregation_func(x[0])) for x in video_scores]
video_labels = [x[1] for x in video_scores]

cf = confusion_matrix(video_labels, video_pred).astype(float)

cls_cnt = cf.sum(axis=1)
cls_hit = np.diag(cf)

cls_acc = cls_hit/cls_cnt

print cls_acc

print 'Accuracy {:.02f}%'.format(np.mean(cls_acc)*100)

if args.save_scores is not None:
    np.savez(args.save_scores, scores=video_scores, labels=video_labels)

性能评估代码如下：

import argparse
import sys
import numpy as np
sys.path.append('.')

from pyActionRecog.utils.video_funcs import default_aggregation_func
from pyActionRecog.utils.metrics import mean_class_accuracy

parser = argparse.ArgumentParser()
parser.add_argument('score_files', nargs='+', type=str)
parser.add_argument('--score_weights', nargs='+', type=float, default=None)
parser.add_argument('--crop_agg', type=str, choices=['max', 'mean'], default='mean')
args = parser.parse_args()

score_npz_files = [np.load(x) for x in args.score_files]

if args.score_weights is None:
    score_weights = [1] * len(score_npz_files)
else:
    score_weights = args.score_weights
    if len(score_weights) != len(score_npz_files):
        raise ValueError("Only {} weight specifed for a total of {} score files"
                         .format(len(score_weights), len(score_npz_files)))

score_list = [x['scores'][:, 0] for x in score_npz_files]
label_list = [x['labels'] for x in score_npz_files]

# label verification

# score_aggregation
agg_score_list = []
for score_vec in score_list:
    agg_score_vec = [default_aggregation_func(x, normalization=False, crop_agg=getattr(np, args.crop_agg)) for x in score_vec]
    agg_score_list.append(np.array(agg_score_vec))

final_scores = np.zeros_like(agg_score_list[0])
for i, agg_score in enumerate(agg_score_list):
    final_scores += agg_score * score_weights[i]

# accuracy
acc = mean_class_accuracy(final_scores, label_list[0])
print 'Final accuracy {:02f}%'.format(acc * 100)

输入指令：

python tools/eval_net.py ucf101 1 flow output/optical_flow/ models/ucf101/tsn_bn_inception_flow_deploy.prototxt models/ucf101_split_1_tsn_flow_reference_bn_inception.caffemodel --num_worker 1 --save_scores output/flow_score_file

总结

Temporal Segment Networks 是一种用于视频动作识别的深度学习模型，其核心创新在于引入了分段采样策略和多模态特征融合。TSN 将视频均匀分成多个片段，并从每个片段中随机采样一帧进行训练和推理，从而有效地捕捉了长时间跨度的动作信息。模型采用双流网络架构，包括空间流和时间流，分别处理RGB图像和光流信息，通过分段共识函数对片段特征进行聚合，生成全局视频表示。此外，TSN 还支持多模态输入，如：RGB Difference 和 Warped Optical Flow Fields，进一步增强了模型对时间信息的捕捉能力。

TSN 的训练过程采用了跨模态预训练、数据增强和正则化等技巧，显著提升了模型的泛化能力和鲁棒性。其架构设计灵活且高效，能够适应不同的视频动作识别任务。TSN 的成功不仅验证了分段采样策略的有效性，还为视频理解领域的研究提供了重要的实践指导和创新思路，为今后动作识别工作的发展奠定了坚实的基础。