mindvideo.models¶

SpatialAttention¶

class mindvideo.models.SpatialAttention(in_channels: int = 64, out_channels: int = 16)

Initialize spatial attention unit which refine the aggregation step by re-weighting block contributions.

base: nn.Cell

Parameters:

in_channels: The number of channels of the input feature.
out_channels: The number of channels of the output of hidden layers.

Return:

Tensor of shape (1, 1, H, W).

SimilarityNetwork¶

class mindvideo.models.SimilarityNetwork(in_channels=2, out_channels=64, input_size=64, hidden_size=8)

Similarity learning between query and support clips as paired relation descriptors for RelationNetwork.

base: nn.Cell

Parameters:

in_channels (int): Number of channels of the input feature. Default: 2.
out_channels (int): Number of channels of the output feature. Default: 64.
input_size (int): Size of input features. Default: 64.
hidden_size (int): Number of channels in the hidden fc layers. Default: 8.

Return:

Tensor, output tensor.

ARNEmbedding¶

class mindvideo.models.ARNEmbedding(support_num_per_class: int = 1, query_num_per_class: int = 1, class_num: int = 5, is_c3d: bool = True, in_channels: Optional[int] = 3, out_channels: Optional[int] = 64)

Embedding for ARN based on Unit3d-built 4-layer Conv or C3d.

base: nn.Cell

Parameters:

support_num_per_class (int): Number of samples in support set per class. Default: 1.
query_num_per_class (int): Number of samples in query set per class. Default: 1.
class_num (int): Number of classes. Default: 5.
is_c3d (bool): Specifies whether the network uses C3D as embedding for ARN. Default: False.
in_channels: The number of channels of the input feature. Default: 3.
out_channels: The number of channels of the output of hidden layers (only used when is_c3d is set to False). Default: 64.

Return:

Tensor, output 2 tensors.

ARNBackbone¶

class mindvideo.models.ARNBackbone(jigsaw: int = 10, support_num_per_class: int = 1, query_num_per_class: int = 1, class_num: int = 5, seq: int = 16)

ARN architecture.

base: nn.Cell

Parameters:

jigsaw (int): Number of the output dimension for spacial-temporal jigsaw discriminator. Default: 10.
support_num_per_class (int): Number of samples in support set per class. Default: 1.
query_num_per_class (int): Number of samples in query set per class. Default: 1.
class_num (int): Number of classes. Default: 5.

Return:

Tensor, output 2 tensors.

ARNNeck¶

class mindvideo.models.ARNNeck(class_num: int = 5, support_num_per_class: int = 1, sigma: int = 100)

ARN neck architecture.

base: nn.Cell

Parameters:

class_num (int): Number of classes. Default: 5.
support_num_per_class (int): Number of samples in support set per class. Default: 1.
sigma: Controls the slope of PN. Default: 100.

Return:

Tensor, output 2 tensors.

def mindvideo.models.ARNNeck.power_norm(x)

Define the operation of Power Normalization.

Parameters:

x (Tensor): Tensor of shape :math:(C_{in}, C_{in}).

Return:

Tensor of shape: math:(C_{out}, C_{out}).

ARNHead¶

class mindvideo.models.ARNHead(class_num: int = 5, query_num_per_class: int = 1)

ARN head architecture.

base: nn.Cell

Parameters:

class_num (int): Number of classes. Default: 5.
query_num_per_class (int): Number of query samples per class. Default: 1.

Return:

Tensor, output tensors.

ARN¶

class mindvideo.models.ARN(support_num_per_class: int = 1, query_num_per_class: int = 1, class_num: int = 5, is_c3d: bool = False, in_channels: Optional[int] = 3, out_channels: Optional[int] = 64, jigsaw: int = 10, sigma: int = 100)

Constructs a ARN architecture from Few-shot Action Recognition via Permutation-invariant Attention <https://arxiv.org/pdf/2001.03905.pdf>.

base: nn.Cell

Parameters:

support_num_per_class (int): Number of samples in support set per class. Default: 1.
query_num_per_class (int): Number of samples in query set per class. Default: 1.
class_num (int): Number of classes. Default: 5.
is_c3d (bool): Specifies whether the network uses C3D as embendding for ARN. Default: False.
in_channels: The number of channels of the input feature. Default: 3.
out_channels: The number of channels of the output of hidden layers (only used when is_c3d is set to False). Default: 64.
jigsaw (int): Number of the output dimension for spacial-temporal jigsaw discriminator. Default: 10.
sigma: Controls the slope of PN. Default: 100.

Inputs:

x(Tensor): Tensor of shape :math:(E, N, C_{in}, D_{in}, H_{in}, W_{in}).

Return:

Tensor of shape :math:(CLASSES_NUM, CLASSES_{out})

C3D¶

class mindvideo.models.C3D(in_d: int = 16, in_h: int = 112, in_w: int = 112, in_channel: int = 3, kernel_size: Union[int, Tuple[int]] = (3, 3, 3), head_channel: Union[int, Tuple[int]] = (4096, 4096), num_classes: int = 400, keep_prob: Union[float, Tuple[float]] = (0.5, 0.5, 1.0))

Constructs a C3D architecture.

base: nn.Cell

Parameters:

in_d: Depth of input data, it can be considered as frame number of a video. Default: 16.
in_h: Height of input frames. Default: 112.
in_w: Width of input frames. Default: 112.
in_channel(int): Number of channel of input data. Default: 3.
kernel_size(Union[int, Tuple[int]]): Kernel size for every conv3d layer in C3D. Default: (3, 3, 3).
head_channel(Tuple[int]): Hidden size of multi-dense-layer head. Default: [4096, 4096].
num_classes(int): Number of classes, it is the size of classfication score for every sample, i.e. :math:CLASSES_{out}. Default: 400.
keep_prob(Tuple[int]): Probability of dropout for multi-dense-layer head, the number of probabilities equals the number of dense layers.
pretrained(bool): If True, it will create a pretrained model, the pretrained model will be loaded from network. If False, it will create a c3d model with uniform initialization for weight and bias.

Inputs:

x(Tensor): Tensor of shape :math:(N, C_{in}, D_{in}, H_{in}, W_{in}).

Return:

Tensor of shape :math:(N, CLASSES_{out}).

BasicBlock¶

class mindvideo.models.BasicBlock(cin, cout, stride=1, dilation=1)

Basic residual block for dla.

base: nn.Cell

Parameters:

cin(int): Input channel.
cout(int): Output channel.
stride(int): Covolution stride. Default: 1.
dilation(int): The dilation rate to be used for dilated convolution. Default: 1.

Return:

Tensor, the feature after covolution.

Root¶

class mindvideo.models.Root(in_channels, out_channels, kernel_size, residual)

Get HDA node which play as the root of tree in each stage.

base: nn.Cell

Parameters:

cin(int): Input channel.
cout(int):Output channel.
kernel_size(int): Covolution kernel size.
residual(bool): Add residual or not.

Return:

Tensor, HDA node after aggregation.

Tree¶

class mindvideo.models.Tree(levels, block, in_channels, out_channels, stride=1, level_root=False, root_dim=0, root_kernel_size=1, dilation=1, root_residual=False)

Construct the deep aggregation network through recurrent. Each stage can be seen as a tree with multiple children.

base: nn.Cell

Parameters:

levels(list int): Tree height of each stage.
block(Cell): Basic block of the tree.
in_channels(list int): Input channel of each stage.
out_channels(list int): Output channel of each stage.
stride(int): Covolution stride. Default: 1.
level_root(bool): Whether is the root of tree or not. Default: False.
root_dim(int): Input channel of the root node. Default: 0.
root_kernel_size(int): Covolution kernel size at the root. Default: 1.
dilation(int): The dilation rate to be used for dilated convolution. Default: 1.
root_residual(bool): Add residual or not. Default: False.

Return:

Tensor, the root ida node.

DLA34¶

class mindvideo.models.DLA34(levels, channels, block=None, residual_root=False)

Construct the downsampling deep aggregation network.

base: nn.Cell

Parameters:

levels(list int): Tree height of each stage.
channels(list int): Input channel of each stage
block(Cell): Initial basic block. Default: BasicBlock.
residual_root(bool): Add residual or not. Default: False

Return:

tuple of Tensor, the root node of each stage.

DlaDeformConv¶

class mindvideo.models.DlaDeformConv(cin, cout)

Deformable convolution v2 with bn and relu.

base: nn.Cell

Parameters:

cin(int): Input channel
cout(int): Output_channel

Return:

Tensor, results after deformable convolution and activation

IDAUp¶

class mindvideo.models.IDAUp(out, channels, up_f)

IDAUp sample.

base: nn.Cell

Return:

List.

DLAUp¶

class mindvideo.models.DLAUp(startp, channels, scales, in_channels=None)

DLAUp sample.

base: nn.Cell

Return:

List.

DLASegConv¶

class mindvideo.models.DLASegConv(down_ratio: int, last_level: int, out_channel: int = 0, stage_levels: Tuple[int] = (1, 1, 1, 2, 2, 1), stage_channels: Tuple[int] = (16, 32, 64, 128, 256, 512))

The DLA backbone network.

base: nn.Cell

Parameters:

down_ratio(int): The ratio of input and output resolution
last_level(int): The ending stage of the final upsampling
stage_levels(tuple[int]): The tree height of each stage block
stage_channels(tuple[int]): The feature channel of each stage

Return:

Tensor, the feature map extracted by dla network

FairmotDla34¶

class mindvideo.models.FairmotDla34(down_ratio: int, last_level: int, out_channel: int = 0, stage_levels: Tuple[int] = (1, 1, 1, 2, 2, 1), stage_channels: Tuple[int] = (16, 32, 64, 128, 256, 512))

Constructs a Fairmot architecture.

base: nn.Cell

Parameters:

down_ratio(int): Output stride. Currently only supports 4. Default: 4.
last_level(int): Last level of dla layers used for deep layer aggregation(DLA) module. Default: 5.
head_channel(int): Channel of input of second conv2d layer in heads. Default: 256.
head_conv2_ksize(Union[int, Tuple]): Kernel size of second conv2d layer. Default: 1.
hm(int): Number of heatmap channels. Default: 1.
wh(int): Dimension of offset and size output, i.e. position of bbox, it equals 4 if regress left, top, right, bottom of bbox, else 2. Default: 4.
feature_id(int): Dimension of identity embedding. Default: 128.
reg(int): Dimension of local offset. Default: 2.
pretrained(bool): If True, it will create a pretrained model, the pretrained model will be loaded from network. If False, it will create a fairmot model with default initialization. Default: False.

Inputs:

x(Tensor) - Tensor of shape :math:(N, C_{in}, D_{in}, H_{in}, W_{in}).

Return:

Tensor of shape :math:(N, CLASSES_{out}).

Inception3dModule¶

class mindvideo.models.Inception3dModule(in_channels, out_channels)

Inception3dModule definition.

base: nn.Cell

Parameters:

in_channels (int): The number of channels of input frame images.
out_channels (int): The number of channels of output frame images.

Return:

Tensor, output tensor.

InceptionI3d¶

class mindvideo.models.InceptionI3d(in_channels=3)

InceptionI3d architecture.

base: nn.Cell

Parameters:

in_channels (int): The number of channels of input frame images(default 3).

Return:

Tensor, output tensor.

I3dHead¶

class mindvideo.models.I3dHead(in_channels, num_classes=400, dropout_keep_prob=0.5)

I3dHead definition.

base: nn.Cell

Parameters:

in_channels: Input channel.
num_classes (int): The number of classes .
dropout_keep_prob (float): A float value of prob.

Return:

Tensor, output tensor.

I3D¶

class mindvideo.models.I3D(in_channel: int = 3, num_classes: int = 400, keep_prob: float = 0.5, pooling_keep_dim: bool = True, backbone_output_channel=1024)

Constructs a I3D architecture.

base: nn.Cell

Parameters:

in_channel(int): Number of channel of input data. Default: 3.
num_classes(int): Number of classes, it is the size of classfication score for every sample, i.e. :math:CLASSES_{out}. Default: 400.
keep_prob(float): Probability of dropout for multi-dense-layer head, the number of probabilities equals the number of dense layers. Default: 0.5.
pooling_keep_dim: whether to keep dim when pooling. Default: True.
pretrained(bool): If True, it will create a pretrained model, the pretrained model will be loaded from network. If False, it will create a i3d model with uniform initialization for weight and bias. Default: False.

Inputs:

x(Tensor) - Tensor of shape :math:(N, C_{in}, D_{in}, H_{in}, W_{in}).

Return:

Tensor of shape :math:(N, CLASSES_{out}).

NonLocalBlockND¶

class mindvideo.models.NonLocalBlockND(in_channels, inter_channels=None, mode=’embedded’, sub_sample=True, bn_layer=True)

Classification backbone for nonlocal. Implementation of Non-Local Block with 4 different pairwise functions.

base: nn.Cell

Parameters:

in_channels (int): original channel size.
inter_channels (int): channel size inside the block if not specified reduced to half.
mode: 4 mode to choose (gaussian, embedded, dot, and concatenation).
bn_layer: whether to add batch norm.

Inputs:

x(Tensor) - Tensor of shape :math:(N, C_{in}, D_{in}, H_{in}, W_{in}).

Return:

Tensor of shape :math:(N, C_{out}, D_{out}, H_{out}, W_{out}).

NLInflateBlockBase3D¶

class mindvideo.models.NLInflateBlockBase3D(in_channels, inter_channels=None, mode=’embedded’, sub_sample=True, bn_layer=True)

ResNet residual block base definition.

base: ResidualBlockBase3D

Parameters:

in_channel (int): Input channel.
out_channel (int): Output channel.
stride (int): Stride size for the first convolutional layer. Default: 1.
group (int): Group convolutions. Default: 1.
base_width (int): Width of per group. Default: 64.
norm (nn.Cell, optional): Module specifying the normalization layer to use. Default: None.
down_sample (nn.Cell, optional): Downsample structure. Default: None.

Return:

Tensor, output tensor.

NLInflateBlock3D¶

class mindvideo.models.NLInflateBlockBase3D(in_channel: int, out_channel: int, conv12: Optional[nn.Cell] = Inflate3D, group: int = 1, base_width: int = 64, norm: Optional[nn.Cell] = None, down_sample: Optional[nn.Cell] = None, non_local: bool = False, non_local_mode: str = ‘dot’, **kwargs)

ResNet3D residual block definition.

base: ResidualBlock3D

Parameters:

in_channel (int): Input channel.
out_channel (int): Output channel.
stride (int): Stride size for the second convolutional layer. Default: 1.
group (int): Group convolutions. Default: 1.
base_width (int): Width of per group. Default: 64.
norm (nn.Cell, optional): Module specifying the normalization layer to use. Default: None.
down_sample (nn.Cell, optional): Downsample structure. Default: None.

Return:

Tensor, output tensor.

NLInflateResNet3D¶

class mindvideo.models.NLInflateResNet3D(block: Optional[nn.Cell], layer_nums: Tuple[int], stage_channels: Tuple[int] = (64, 128, 256, 512), stage_strides: Tuple[int] = ((1, 1, 1), (1, 2, 2), (1, 2, 2), (1, 2, 2)), down_sample: Optional[nn.Cell] = Unit3D, inflate: Tuple[Tuple[int]] = ((1, 1, 1), (1, 0, 1, 0), (1, 0, 1, 0, 1, 0), (0, 1, 0)), non_local: Tuple[Tuple[int]] = ((0, 0, 0), (0, 1, 0, 1), (0, 1, 0, 1, 0, 1), (0, 0, 0)), **kwargs)

Inflate3D with ResNet3D backbone and non local block.

base: ResNet3D

Parameters:

block (Optional[nn.Cell]): THe block for network.
layer_nums (list): The numbers of block in different layers.
norm (nn.Cell, optional): The module specifying the normalization layer to use. Default: None.
stage_strides: Stride size for ResNet3D convolutional layer.
non_local: Determine whether to apply nonlocal block in this block.

Inputs:

x(Tensor) - Tensor of shape :math:(N, C_{in}, D_{in}, H_{in}, W_{in}).

Return:

Tensor, output tensor.

nonlocal3d¶

class mindvideo.models.nonlocal3d(in_d: int = 32, in_h: int = 224, in_w: int = 224, num_classes: int = 400, keep_prob: float = 0.5, backbone: Optional[nn.Cell] = NLResInflate3D50, avg_pool: Optional[nn.Cell] = AdaptiveAvgPool3D, flatten: Optional[nn.Cell] = nn.Flatten, head: Optional[nn.Cell] = DropoutDense)

nonlocal3d model from Xiaolong Wang. “Non-local Neural Networks.” https://arxiv.org/pdf/1711.07971v3

base: nn.Cell

Parameters:

in_d: Depth of input data, it can be considered as frame number of a video. Default: 32.
in_h: Height of input frames. Default: 224.
in_w: Width of input frames. Default: 224.
num_classes(int): Number of classes, it is the size of classfication score for every sample, i.e. :math:CLASSES_{out}. Default: 400.
pooling_keep_dim: whether to keep dim when pooling. Default: True.
keep_prob(float): Probability of dropout for multi-dense-layer head, the number of probabilities equals the number of dense layers.
pretrained(bool): If True, it will create a pretrained model, the pretrained model will be loaded from network. If False, it will create a nonlocal3d model with uniform initialization for weight and bias.
backbone: Bcxkbone of nonlocal3d.
avg_pool: Avgpooling and flatten.
head: LinearClsHead architecture.

Inputs:

x(Tensor) - Tensor of shape :math:(N, C_{in}, D_{in}, H_{in}, W_{in})..

Return:

Tensor of shape :math:(N, CLASSES_{out}).

Conv2Plus1d¶

class mindvideo.models.Conv2Plus1d(in_channel, mid_channel, out_channel, kernel_size=(3, 3, 3), stride=(1, 1, 1), norm=nn.BatchNorm3d, activation=nn.ReLU)

R(2+1)d conv12 block. It implements spatial-temporal feature extraction in a sperated way.

base: nn.Cell

Parameters:

in_channels (int): The number of channels of input frame images.
out_channels (int): The number of channels of output frame images.
kernel_size (tuple): The size of the spatial-temporal convolutional layer kernels.
stride (Union[int, Tuple[int]]): Stride size for the convolutional layer. Default: 1.
group (int): Splits filter into groups, in_channels and out_channels must be divisible by the number of groups. Default: 1.
norm (Optional[nn.Cell]): Norm layer that will be stacked on top of the convolution layer. Default: nn.BatchNorm3d.
activation (Optional[nn.Cell]): Activation function which will be stacked on top of the normalization layer (if not None), otherwise on top of the conv layer. Default: nn.ReLU.

Return:

Tensor, its channel size is calculated from in_channel, out_channel and kernel_size.

R2Plus1dNet¶

class mindvideo.models.R2Plus1dNet(block: Optional[nn.Cell], layer_nums: Tuple[int], stage_channels: Tuple[int] = (64, 128, 256, 512), stage_strides: Tuple[Tuple[int]] = ((1, 1, 1), (2, 2, 2), (2, 2, 2), (2, 2, 2)), num_classes: int = 400, **kwargs)

Generic R(2+1)d generator.

base: ResNet3D

Parameters:

block (Optional[nn.Cell]): THe block for network.
layer_nums (Tuple[int]): The numbers of block in different layers.
stage_channels (Tuple[int]): Output channel for every res stage. Default: (64, 128, 256, 512).
stage_strides (Tuple[Tuple[int]]): Strides for every res stage.Default:((1, 1, 1), (2, 2, 2), (2, 2, 2), (2, 2, 2)).
conv12 (nn.Cell, optional): Conv1 and conv2 config in resblock. Default: Conv2Plus1D.
base_width (int): The width of per group. Default: 64.
norm (nn.Cell, optional): The module specifying the normalization layer to use. Default: None.
num_classes(int): Number of categories in the action recognition dataset.
keep_prob(float): Dropout probability in classification stage.
kwargs (dict, optional): Key arguments for “make_res_layer” and resblocks.

Return:

Tensor, output tensor.

WindowAttention3D¶

class mindvideo.models.WindowAttention3D(in_channels: int = 96, window_size: int = (8, 7, 7), num_head: int = 3, qkv_bias: Optional[bool] = True, qk_scale: Optional[float] = None, attn_kepp_prob: Optional[float] = 1.0, proj_keep_prob: Optional[float] = 1.0)

Window based multi-head self attention (W-MSA) module with relative position bias. It supports both of shifted and non-shifted window.

base: nn.Cell

Parameters:

in_channels (int): Number of input channels.
window_size (tuple[int]): The depth length, height and width of the window. Default: (8, 7, 7).
num_head (int): Number of attention heads. Default: 3.
qkv_bias (bool, optional): If True, add a learnable bias to query, key, value. Default: True.
qk_scale (float | None, optional): Override default qk scale of head_dim ** -0.5 if set. Default: None.
attn_keep_prob (float, optional): Dropout keep ratio of attention weight. Default: 1.0.
proj_keep_prob (float, optional): Dropout keep ratio of output. Deault: 1.0.

Inputs:

x (Tensor) - Tensor of shape (B, N, C).
mask (Tensor) - (0 / - inf) mask with shape of (num_windows, N, N) or None.

Return:

Tensor of shape (B, N, C), which is equal to the input x.

SwinTransformerBlock3D¶

class mindvideo.models.SwinTransformerBlock3D(embed_dim: int = 96, input_size: int = (16, 56, 56), num_head: int = 3, window_size: int = (8, 7, 7), shift_size: int = (4, 3, 3), mlp_ratio: float = 4., qkv_bias: bool = True, qk_scale: Optional[float] = None, keep_prob: float = 1., attn_keep_prob: float = 1., droppath_keep_prob: float = 1., act_layer: nn.Cell = nn.GELU, norm_layer: str = ‘layer_norm’)

A Video Swin Transformer Block. The implementation of this block follows the paper “Video Swin Transformer”.

base: nn.Cell

Parameters:

embed_dim (int): input feature’s embedding dimension, namely, channel number. Default: 96.
input_size (int | tuple(int)): input feature size. Default: (16, 56, 56).
num_head (int): number of attention head of the current Swin3d block. Default: 3.
window_size (int): window size of window attention. Default: (8, 7, 7).
shift_size (tuple[int]): shift size for shifted window attention. Default: (4, 3, 3).
mlp_ratio (float): ratio of mlp hidden dim to embedding dim. Default: 4.0.
qkv_bias (bool): if True, add a learnable bias to query, key,value. Default: True.
qk_scale (float | None, optional): override default qk scale of head_dim ** -0.5 if set True. Default: None.
keep_prob (float): dropout keep probability. Default: 1.0.
attn_keep_prob (float): units keeping probability for attention dropout. Default: 1.0.
droppath_keep_prob (float): path keeping probability for stochastic droppath. Default: 1.0.
act_layer (nn.Cell): activation layer. Default: nn.GELU.
norm_layer (nn.Cell): normalization layer. Default: ‘layer_norm’.

Inputs:

x (Tensor) - Input feature of shape (B, D, H, W, C).
mask_matrix (Tensor) - Attention mask for cyclic shift.

Return:

Tensor of shape (B, D, H, W, C)

PatchMerging¶

class mindvideo.models.PatchMerging(dim: int = 96, norm_layer: str = ‘layer_norm’)

Patch Merging Layer.

base: nn.Cell

Parameters:

dim (int): Number of input channels.
norm_layer (nn.Cell): Normalization layer. Default: nn.LayerNorm

Inputs:

x (Tensor) - Input feature of shape (B, D, H, W, C).

Return:

Tensor of shape (B, D, H/2, W/2, 2*C)

SwinTransformerStage3D¶

class mindvideo.models.SwinTransformerStage3D(embed_dim=96, input_size=(16, 56, 56), depth=2, num_head=3, window_size=(8, 7, 7), mlp_ratio=4., qkv_bias=True, qk_scale=None, keep_prob=1., attn_keep_prob=1., droppath_keep_prob=0.8, norm_layer=’layer_norm’, downsample=PatchMerging)

A basic Swin Transformer layer for one stage.

base: nn.Cell

Parameters:

embed_dim (int): input feature’s embedding dimension, namely, channel number. Default: 96.
input_size (tuple[int]): input feature size. Default. (16, 56, 56).
depth (int): depth of the current Swin3d stage. Default: 2.
num_head (int): number of attention head of the current Swin3d stage. Default: 3.
window_size (int): window size of window attention. Default: (8, 7, 7).
mlp_ratio (float): ratio of mlp hidden dim to embedding dim. Default: 4.0.
qkv_bias (bool): if qkv_bias is True, add a learnable bias into query, key, value matrixes. Default: Truee
qk_scale (float | None, optional): override default qk scale of head_dim ** -0.5 if set. Default: None.
keep_prob (float): dropout keep probability. Default: 1.0.
attn_keep_prob (float): units keeping probability for attention dropout. Default: 1.
droppath_keep_prob (float): path keeping probability for stochastic droppath. Default: 0.8.
norm_layer(string): normalization layer. Default: ‘layer_norm’.
downsample (nn.Cell | None, optional): downsample layer at the end of swin3d stage. Default: PatchMerging.

Inputs:

A video feature of shape (N, D, H, W, C)

Return:

Tensor of shape (N, D, H / 2, W / 2, 2 * C)

PatchEmbed3D¶

class mindvideo.models.PatchEmbed3D(input_size=(16, 224, 224), patch_size=(2, 4, 4), in_channels=3, embed_dim=96, norm_layer=’layer_norm’, patch_norm=True)

Video to Patch Embedding.

base: nn.Cell

Parameters:

input_size (tuple[int]): Input feature size.
patch_size (int): Patch token size. Default: (2,4,4).
in_channels (int): Number of input video channels. Default: 3.
embed_dim (int): Number of linear projection output channels. Default: 96.
norm_layer (nn.Module, optional): Normalization layer. Default: None.
patch_norm (bool): if True, add normalization after patch embedding. Default: True.

Inputs:

An original Video tensor in data format of ‘NCDHW’.

Return:

An embedded tensor in data format of ‘NDHWC’.

SwinTransformer3D¶

class mindvideo.models.SwinTransformer3D(input_size=(16, 56, 56), embed_dim=96, depths=(2, 2, 6, 2), num_heads=(3, 6, 12, 24), window_size=(8, 7, 7), mlp_ratio=4., qkv_bias=True, qk_scale=None, keep_prob=1., attn_keep_prob=1., droppath_keep_prob=0.8, norm_layer=’layer_norm’)

Video Swin Transformer backbone. A mindspore implementation of : Video Swin Transformer http://arxiv.org/abs/2106.13230

base: nn.Cell

Parameters:

input_size (int | tuple(int)): input feature size. Default: (16, 56, 56).
embed_dim (int): input feature’s embedding dimension, namely, channel number. Default: 96.
depths (tuple[int]): depths of each Swin3d stage. Default: (2, 2, 6, 2).
num_heads (tuple[int]): number of attention head of each Swin3d stage. Default: (3, 6, 12, 24).
window_size (int): window size of window attention. Default: (8, 7, 7).
mlp_ratio (float): ratio of mlp hidden dim to embedding dim. Default: 4.0.
qkv_bias (bool): if qkv_bias is True, add a learnable bias into query, key, value matrixes. Default: True.
qk_scale (float | None, optional): override default qk scale of head_dim ** -0.5 if set. Default: None.
keep_prob (float): dropout keep probability. Default: 1.0.
attn_keep_prob (float): units keeping probability for attention dropout. Default: 1.
droppath_keep_prob (float): path keeping probability for stochastic droppath. Default: 0.8.
norm_layer (string): normalization layer. Default: ‘layer_norm’.

Inputs:

x (Tensor) - Tensor of shape ‘NDHWC’.

Return:

Tensor of shape ‘NCDHW’.

Swin3D¶

class mindvideo.models.Swin3D(input_size=(16, 56, 56), embed_dim=96, depths=(2, 2, 6, 2), num_heads=(3, 6, 12, 24), window_size=(8, 7, 7), mlp_ratio=4., qkv_bias=True, qk_scale=None, keep_prob=1., attn_keep_prob=1., droppath_keep_prob=0.8, norm_layer=’layer_norm’)

Constructs a swin3d architecture corresponding to Video Swin Transformer <http://arxiv.org/abs/2106.13230>.

base: nn.Cell

Parameters:

num_classes (int): The number of classification. Default: 400.
patch_size (int): Patch size used by window attention. Default: (2, 4, 4).
window_size (int): Window size used by window attention. Default: (8, 7, 7).
embed_dim (int): Embedding dimension of the featrue generated from patch embedding layer. Default: 96.
depths (int): Depths of each stage in Swin3d Tiny module. Default: (2, 2, 6, 2).
num_heads (int): Numbers of heads of each stage in Swin3d Tiny module. Default: (3, 6, 12, 24).
representation_size (int): Feature dimension of the last layer in backbone. Default: 768.
droppath_keep_prob (float): The drop path keep probability. Default: 0.9.
input_size (int | tuple(int)): Input feature size. Default: (32, 224, 224).
in_channels (int): Input channels. Default: 3.
mlp_ratio (float): Ratio of mlp hidden dim to embedding dim. Default: 4.0.
qkv_bias (bool): If qkv_bias is True, add a learnable bias into query, key, value matrixes. Default: True.
qk_scale (float | None, optional): Override default qk scale of head_dim ** -0.5 if set. Default: None.
keep_prob (float): Dropout keep probability. Default: 1.0.
attn_keep_prob (float): Keeping probability for attention dropout. Default: 1.0.
norm_layer (string): Normalization layer. Default: ‘layer_norm’.
patch_norm (bool): If True, add normalization after patch embedding. Default: True.
pooling_keep_dim (bool): Specifies whether to keep dimension shape the same as input feature. Default: False.
head_bias (bool): Specifies whether the head uses a bias vector. Default: True.
head_activation (Union[str, Cell, Primitive]): Activate function applied in the head. Default: None.
head_keep_prob (float): Head’s dropout keeping rate, between [0, 1]. Default: 0.5.

Inputs:

x (Tensor) - Tensor of shape :math:(N, C_{in}, D_{in}, H_{in}, W_{in}).

Return:

Tensor of shape :math:(N, CLASSES_{out})

swin3d_t¶

def mindvideo.models.swin3d_t(num_classes: int = 400, patch_size: int = (2, 4, 4), window_size: int = (8, 7, 7), embed_dim: int = 96, depths: int = (2, 2, 6, 2), num_heads: int = (3, 6, 12, 24), representation_size: int = 768, droppath_keep_prob: float = 0.9)

Video Swin Transformer Tiny (swin3d-T) model.

Parameters:

num_classes (int): Number of categories. patch_size (int): Size of swin3d patch segmentation. window_size (int): Size of swin3d window. embed_dim (int): Dimension output by the patch embedding. depths (int): Depth of each stage. num_heads (int): Number of heads in window attention. representation_size (int): Size of features output at the last layer of backbone. droppath_keep_prob (float): Retetion probability of drop path.

Returns:

swin3d_t: nn.Cell

swin3d_s¶

def mindvideo.models.swin3d_s(num_classes: int = 400, patch_size: int = (2, 4, 4), window_size: int = (8, 7, 7), embed_dim: int = 96, depths: int = (2, 2, 18, 2), num_heads: int = (3, 6, 12, 24), representation_size: int = 768, droppath_keep_prob: float = 0.9)

Video Swin Transformer Small (swin3d-S) model.

Parameters:

num_classes (int): Number of categories. patch_size (int): Size of swin3d patch segmentation. window_size (int): Size of swin3d window. embed_dim (int): Dimension output by the patch embedding. depths (int): Depth of each stage. num_heads (int): Number of heads in window attention. representation_size (int): Size of features output at the last layer of backbone. droppath_keep_prob (float): Retetion probability of drop path.

Returns:

swin3d_s: nn.Cell

swin3d_b¶

def mindvideo.models.swin3d_b(num_classes: int = 400, patch_size: int = (2, 4, 4), window_size: int = (8, 7, 7), embed_dim: int = 128, depths: int = (2, 2, 18, 2), num_heads: int = (4, 8, 16, 32), representation_size: int = 1024, droppath_keep_prob: float = 0.7)

Video Swin Transformer Base (swin3d-B) model.

Parameters:

num_classes (int): Number of categories. patch_size (int): Size of swin3d patch segmentation. window_size (int): Size of swin3d window. embed_dim (int): Dimension output by the patch embedding. depths (int): Depth of each stage. num_heads (int): Number of heads in window attention. representation_size (int): Size of features output at the last layer of backbone. droppath_keep_prob (float): Retetion probability of drop path.

Returns:

swin3d_b: nn.Cell

swin3d_l¶

def mindvideo.models.swin3d_l(num_classes: int = 400, patch_size: int = (2, 4, 4), window_size: int = (8, 7, 7), embed_dim: int = 192, depths: int = (2, 2, 18, 2), num_heads: int = (6, 12, 24, 48), representation_size: int = 1536, droppath_keep_prob: float = 0.9)

Video Swin Transformer Large (swin3d-L) model.

Parameters:

num_classes (int): Number of categories. patch_size (int): Size of swin3d patch segmentation. window_size (int): Size of swin3d window. embed_dim (int): Dimension output by the patch embedding. depths (int): Depth of each stage. num_heads (int): Number of heads in window attention. representation_size (int): Size of features output at the last layer of backbone. droppath_keep_prob (float): Retetion probability of drop path.

Returns:

swin3d_l: nn.Cell

GroupNorm3d¶

class mindvideo.models.GroupNorm3d(num_groups, num_channels, eps=1e-05, affine=True, gamma_init=’ones’, beta_init=’zeros’)

modify from mindspore.nn.GroupNorm, add depth

base: nn.Cell

Parameters:

num_groups (int): Number of groups to be divided along the channel dimension. num_channels (int): Number of channels. eps(float): The value added to the denominator. affine (bool): When set to True, a learnable affine transformation parameter is added to the layer. gamma_init (str): Method of initializing the gamma parameter. beta_init (str): Method of initializing the beta parameter.

Return:

Tensor, output tensor.

VistrCom¶

class mindvideo.models.VistrCom(name: str = ‘ResNet50’, train_embeding: bool = True, num_queries: int = 360, num_pos_feats: int = 64, num_frames: int = 36, temperature: int = 10000, normalize: bool = True, scale: float = None, hidden_dim: int = 384, d_model: int = 384, nhead: int = 8, num_encoder_layers: int = 6, num_decoder_layers: int = 6, dim_feedforward: int = 2048, dropout: int = 0.1, activation: str = “relu”, normalize_before: bool = False, return_intermediate_dec: bool = True, aux_loss: bool = True, num_class: int = 41)

Vistr Architecture.

base: nn.Cell

Parameters:

name (str): The type of ResNet. train_embeding (bool): Whether to train embeding or not. num_queries （int）: Number of instances. num_pos_feats (int): The encoding length of each dimension. num_frames (int)： Number of frames. temperature (int): Coefficient. normalize (bool): Whether to normalize. If True, normalize. scale (float): Coefficient. hidden_dim (int): Dimensions required by the input vector in the encoder. d_model (int): Number of expected features entered by the backbone nhead (int): Number of heads in multi head attention. num_encoder_layers (int): Layer number of encoders. num_decoder_layers (int): Layer number of decoders. dim_feedforward (int): Dimensions of the feedforward network model in backbone dropout (int): Value of dropout. activation(str): Activation function. normalize_before (bool): Whether is normalized or not before. return_intermediate_dec (bool): Whether to return intermediate output aux_loss (bool): Whether to calculate the loss of the middle layer. num_class (int): Number of categories.

Return:

Tensor, output tensor.

BlockX3D¶

class mindvideo.models.BlockX3D(in_channel, out_channel, conv12: Optional[nn.Cell] = Inflate3D, inflate: int = 2, norm: Optional[nn.Cell] = None, down_sample: Optional[nn.Cell] = None, block_idx: int = 0, se_ratio: float = 0.0625, use_swish: bool = True, drop_connect_rate: float = 0.0, bottleneck_factor: float = 2.25, **kwargs)

BlockX3D 3d building block for X3D.

base: ResidualBlock3D

Parameters:

in_channel (int): Input channel.
out_channel (int): Output channel.
conv12(nn.Cell, optional): Block that constructs first two conv layers. It can be Inflate3D, Conv2Plus1D or other custom blocks, this block should construct a layer where the name of output feature channel size is mid_channel for the third conv layers. Default: Inflate3D.
inflate (int): Whether to inflate kernel.
spatial_stride (int): Spatial stride in the conv3d layer. Default: 1.
down_sample (nn.Module | None): DownSample layer. Default: None.
block_idx (int): the id of the block.
se_ratio (float | None): The reduction ratio of squeeze and excitation unit. If set as None, it means not using SE unit. Default: None.
use_swish (bool): Whether to use swish as the activation function before and after the 3x3x3 conv. Default: True.
drop_connect_rate (float): dropout rate. If equal to 0.0, perform no dropout.
bottleneck_factor (float): Bottleneck expansion factor for the 3x3x3 conv.

Return:

Tensor, output tensor.

ResNetX3D¶

class mindvideo.models.ResNetX3D(block: Optional[nn.Cell], layer_nums: Tuple[int], stage_channels: Tuple[int], stage_strides: Tuple[Tuple[int]], drop_rates: Tuple[float], down_sample: Optional[nn.Cell] = Unit3D, bottleneck_factor: float = 2.25)

X3D backbone definition.

base: ResNet3D

Parameters:

block (Optional[nn.Cell]): THe block for network.
layer_nums (list): The numbers of block in different layers.
stage_channels (Tuple[int]): Output channel for every res stage.
stage_strides (Tuple[Tuple[int]]): Stride size for ResNet3D convolutional layer.
drop_rates (list): list of the drop rate in different blocks. The basic rate at which blocks are dropped, linearly increases from input to output blocks.
down_sample (Optional[nn.Cell]): Residual block in every resblock, it can transfer the input feature into the same channel of output. Default: Unit3D.
bottleneck_factor (float): Bottleneck expansion factor for the 3x3x3 conv.
fc_init_std (float): The std to initialize the fc layer(s).

Inputs:

x (Tensor) - Tensor of shape :math:(N, C_{in}, D_{in}, H_{in}, W_{in}).

Return:

Tensor, output tensor.

X3DHead¶

class mindvideo.models.X3DHead(pool_size, input_channel, out_channel=2048, num_classes=400, dropout_rate=0.5)

x3d head architecture.

base: nn.Cell

Parameters:

input_channel (int): The number of input channel.
out_channel (int): The number of inner channel. Default: 2048.
num_classes (int): Number of classes. Default: 400.
dropout_rate (float): Dropout keeping rate, between [0, 1]. Default: 0.5.

Return:

Tensor

x3d¶

class mindvideo.models.x3d(block: Type[BlockX3D], depth_factor: float, num_frames: int, train_crop_size: int, num_classes: int, dropout_rate: float, bottleneck_factor: float = 2.25, eval_with_clips: bool = False)

x3d architecture. Christoph Feichtenhofer. “X3D: Expanding Architectures for Efficient Video Recognition.” https://arxiv.org/abs/2004.04730

base: nn.Cell

Parameters:

block (Type[BlockX3D]): The block of X3D.
depth_factor (float): Depth expansion factor.
num_frames (int): The number of frames of the input clip.
train_crop_size (int): The spatial crop size for training.
num_classes (int): the channel dimensions of the output.
dropout_rate (float): dropout rate. If equal to 0.0, perform no dropout.
bottleneck_factor (float): Factor of bottleneck.
eval_with_clips (bool): If evalidate with clips, eval_with_clips is True.

Inputs:

x (Tensor) - Tensor of shape :math:(N, C_{in}, D_{in}, H_{in}, W_{in}).

Return:

Tensor of shape :math:(N, CLASSES_{out})

x3d_m¶

def mindvideo.models.x3d_m(num_classes: int = 400, dropout_rate: float = 0.5, depth_factor: float = 2.2, num_frames: int = 16, train_crop_size: int = 224, eval_with_clips: bool = False)

X3D middle model.

Parameters:

num_classes (int): the channel dimensions of the output.
dropout_rate (float): dropout rate. If equal to 0.0, perform no dropout.
depth_factor (float): Depth expansion factor.
num_frames (int): The number of frames of the input clip.
train_crop_size (int): The spatial crop size for training.

Inputs:

x (Tensor) - Tensor of shape :math:(N, C_{in}, D_{in}, H_{in}, W_{in}).

Return:

Tensor of shape :math:(N, CLASSES_{out})

x3d_s¶

def mindvideo.models.x3d_s(num_classes: int = 400, dropout_rate: float = 0.5, depth_factor: float = 2.2, num_frames: int = 13, train_crop_size: int = 160, eval_with_clips: bool = False)

X3D small model.

Parameters:

num_classes (int): the channel dimensions of the output.
dropout_rate (float): dropout rate. If equal to 0.0, perform no dropout.
depth_factor (float): Depth expansion factor.
num_frames (int): The number of frames of the input clip.
train_crop_size (int): The spatial crop size for training.

Inputs:

x (Tensor) - Tensor of shape :math:(N, C_{in}, D_{in}, H_{in}, W_{in}).

Return:

Tensor of shape :math:(N, CLASSES_{out})

x3d_xs¶

def mindvideo.models.x3d_xs(num_classes: int = 400, dropout_rate: float = 0.5, depth_factor: float = 2.2, num_frames: int = 4, train_crop_size: int = 160, eval_with_clips: bool = False)

X3D x-small model.

Parameters:

num_classes (int): the channel dimensions of the output.
dropout_rate (float): dropout rate. If equal to 0.0, perform no dropout.
depth_factor (float): Depth expansion factor.
num_frames (int): The number of frames of the input clip.
train_crop_size (int): The spatial crop size for training.

Inputs:

x (Tensor) - Tensor of shape :math:(N, C_{in}, D_{in}, H_{in}, W_{in}).

Return:

Tensor of shape :math:(N, CLASSES_{out})

x3d_l¶

def mindvideo.models.x3d_l(num_classes: int = 400, dropout_rate: float = 0.5, depth_factor: float = 5.0, num_frames: int = 16, train_crop_size: int = 312, eval_with_clips: bool = False)

X3D large model.

Parameters:

num_classes (int): the channel dimensions of the output.
dropout_rate (float): dropout rate. If equal to 0.0, perform no dropout.
depth_factor (float): Depth expansion factor.
num_frames (int): The number of frames of the input clip.
train_crop_size (int): The spatial crop size for training.

Inputs:

x (Tensor) - Tensor of shape :math:(N, C_{in}, D_{in}, H_{in}, W_{in}).

Return:

Tensor of shape :math:(N, CLASSES_{out})