## mindvideo.models ### SpatialAttention > class mindvideo.models.SpatialAttention(in_channels: int = 64, out_channels: int = 16) Initialize spatial attention unit which refine the aggregation step by re-weighting block contributions. - base: nn.Cell **Parameters:** - in_channels: The number of channels of the input feature. - out_channels: The number of channels of the output of hidden layers. **Return:** Tensor of shape (1, 1, H, W). ### SimilarityNetwork > class mindvideo.models.SimilarityNetwork(in_channels=2, out_channels=64, input_size=64, hidden_size=8) Similarity learning between query and support clips as paired relation descriptors for RelationNetwork. - base: nn.Cell **Parameters:** - in_channels (int): Number of channels of the input feature. Default: 2. - out_channels (int): Number of channels of the output feature. Default: 64. - input_size (int): Size of input features. Default: 64. - hidden_size (int): Number of channels in the hidden fc layers. Default: 8. **Return:** Tensor, output tensor. ### ARNEmbedding > class mindvideo.models.ARNEmbedding(support_num_per_class: int = 1, query_num_per_class: int = 1, class_num: int = 5, is_c3d: bool = True, in_channels: Optional[int] = 3, out_channels: Optional[int] = 64) Embedding for ARN based on Unit3d-built 4-layer Conv or C3d. - base: nn.Cell **Parameters:** - support_num_per_class (int): Number of samples in support set per class. Default: 1. - query_num_per_class (int): Number of samples in query set per class. Default: 1. - class_num (int): Number of classes. Default: 5. - is_c3d (bool): Specifies whether the network uses C3D as embedding for ARN. Default: False. - in_channels: The number of channels of the input feature. Default: 3. - out_channels: The number of channels of the output of hidden layers (only used when is_c3d is set to False). Default: 64. **Return:** Tensor, output 2 tensors. ### ARNBackbone > class mindvideo.models.ARNBackbone(jigsaw: int = 10, support_num_per_class: int = 1, query_num_per_class: int = 1, class_num: int = 5, seq: int = 16) ARN architecture. - base: nn.Cell **Parameters:** - jigsaw (int): Number of the output dimension for spacial-temporal jigsaw discriminator. Default: 10. - support_num_per_class (int): Number of samples in support set per class. Default: 1. - query_num_per_class (int): Number of samples in query set per class. Default: 1. - class_num (int): Number of classes. Default: 5. **Return:** Tensor, output 2 tensors. ### ARNNeck > class mindvideo.models.ARNNeck(class_num: int = 5, support_num_per_class: int = 1, sigma: int = 100) ARN neck architecture. - base: nn.Cell **Parameters:** - class_num (int): Number of classes. Default: 5. - support_num_per_class (int): Number of samples in support set per class. Default: 1. - sigma: Controls the slope of PN. Default: 100. **Return:** Tensor, output 2 tensors. > def mindvideo.models.ARNNeck.power_norm(x) Define the operation of Power Normalization. **Parameters:** x (Tensor): Tensor of shape :math:`(C_{in}, C_{in})`. **Return:** Tensor of shape: math:`(C_{out}, C_{out})`. ### ARNHead > class mindvideo.models.ARNHead(class_num: int = 5, query_num_per_class: int = 1) ARN head architecture. - base: nn.Cell **Parameters:** - class_num (int): Number of classes. Default: 5. - query_num_per_class (int): Number of query samples per class. Default: 1. **Return:** Tensor, output tensors. ### ARN > class mindvideo.models.ARN(support_num_per_class: int = 1, query_num_per_class: int = 1, class_num: int = 5, is_c3d: bool = False, in_channels: Optional[int] = 3, out_channels: Optional[int] = 64, jigsaw: int = 10, sigma: int = 100) Constructs a ARN architecture from `Few-shot Action Recognition via Permutation-invariant Attention `. - base: nn.Cell **Parameters:** - support_num_per_class (int): Number of samples in support set per class. Default: 1. - query_num_per_class (int): Number of samples in query set per class. Default: 1. - class_num (int): Number of classes. Default: 5. - is_c3d (bool): Specifies whether the network uses C3D as embendding for ARN. Default: False. - in_channels: The number of channels of the input feature. Default: 3. - out_channels: The number of channels of the output of hidden layers (only used when is_c3d is set to False). Default: 64. - jigsaw (int): Number of the output dimension for spacial-temporal jigsaw discriminator. Default: 10. - sigma: Controls the slope of PN. Default: 100. **Inputs:** - x(Tensor): Tensor of shape :math:`(E, N, C_{in}, D_{in}, H_{in}, W_{in})`. **Return:** Tensor of shape :math:`(CLASSES_NUM, CLASSES_{out})` ### C3D > class mindvideo.models.C3D(in_d: int = 16, in_h: int = 112, in_w: int = 112, in_channel: int = 3, kernel_size: Union[int, Tuple[int]] = (3, 3, 3), head_channel: Union[int, Tuple[int]] = (4096, 4096), num_classes: int = 400, keep_prob: Union[float, Tuple[float]] = (0.5, 0.5, 1.0)) Constructs a C3D architecture. - base: nn.Cell **Parameters:** - in_d: Depth of input data, it can be considered as frame number of a video. Default: 16. - in_h: Height of input frames. Default: 112. - in_w: Width of input frames. Default: 112. - in_channel(int): Number of channel of input data. Default: 3. - kernel_size(Union[int, Tuple[int]]): Kernel size for every conv3d layer in C3D. Default: (3, 3, 3). - head_channel(Tuple[int]): Hidden size of multi-dense-layer head. Default: [4096, 4096]. - num_classes(int): Number of classes, it is the size of classfication score for every sample, i.e. :math:`CLASSES_{out}`. Default: 400. - keep_prob(Tuple[int]): Probability of dropout for multi-dense-layer head, the number of probabilities equals the number of dense layers. - pretrained(bool): If `True`, it will create a pretrained model, the pretrained model will be loaded from network. If `False`, it will create a c3d model with uniform initialization for weight and bias. **Inputs:** - x(Tensor): Tensor of shape :math:`(N, C_{in}, D_{in}, H_{in}, W_{in})`. **Return:** Tensor of shape :math:`(N, CLASSES_{out})`. ### BasicBlock > class mindvideo.models.BasicBlock(cin, cout, stride=1, dilation=1) Basic residual block for dla. - base: nn.Cell **Parameters:** - cin(int): Input channel. - cout(int): Output channel. - stride(int): Covolution stride. Default: 1. - dilation(int): The dilation rate to be used for dilated convolution. Default: 1. **Return:** Tensor, the feature after covolution. ### Root > class mindvideo.models.Root(in_channels, out_channels, kernel_size, residual) Get HDA node which play as the root of tree in each stage. - base: nn.Cell **Parameters:** - cin(int): Input channel. - cout(int):Output channel. - kernel_size(int): Covolution kernel size. - residual(bool): Add residual or not. **Return:** Tensor, HDA node after aggregation. ### Tree > class mindvideo.models.Tree(levels, block, in_channels, out_channels, stride=1, level_root=False, root_dim=0, root_kernel_size=1, dilation=1, root_residual=False) Construct the deep aggregation network through recurrent. Each stage can be seen as a tree with multiple children. - base: nn.Cell **Parameters:** - levels(list int): Tree height of each stage. - block(Cell): Basic block of the tree. - in_channels(list int): Input channel of each stage. - out_channels(list int): Output channel of each stage. - stride(int): Covolution stride. Default: 1. - level_root(bool): Whether is the root of tree or not. Default: False. - root_dim(int): Input channel of the root node. Default: 0. - root_kernel_size(int): Covolution kernel size at the root. Default: 1. - dilation(int): The dilation rate to be used for dilated convolution. Default: 1. - root_residual(bool): Add residual or not. Default: False. **Return:** Tensor, the root ida node. ### DLA34 > class mindvideo.models.DLA34(levels, channels, block=None, residual_root=False) Construct the downsampling deep aggregation network. - base: nn.Cell **Parameters:** - levels(list int): Tree height of each stage. - channels(list int): Input channel of each stage - block(Cell): Initial basic block. Default: BasicBlock. - residual_root(bool): Add residual or not. Default: False **Return:** tuple of Tensor, the root node of each stage. ### DlaDeformConv > class mindvideo.models.DlaDeformConv(cin, cout) Deformable convolution v2 with bn and relu. - base: nn.Cell **Parameters:** - cin(int): Input channel - cout(int): Output_channel **Return:** Tensor, results after deformable convolution and activation ### IDAUp > class mindvideo.models.IDAUp(out, channels, up_f) IDAUp sample. - base: nn.Cell **Return:** List. ### DLAUp > class mindvideo.models.DLAUp(startp, channels, scales, in_channels=None) DLAUp sample. - base: nn.Cell **Return:** List. ### DLASegConv > class mindvideo.models.DLASegConv(down_ratio: int, last_level: int, out_channel: int = 0, stage_levels: Tuple[int] = (1, 1, 1, 2, 2, 1), stage_channels: Tuple[int] = (16, 32, 64, 128, 256, 512)) The DLA backbone network. - base: nn.Cell **Parameters:** - down_ratio(int): The ratio of input and output resolution - last_level(int): The ending stage of the final upsampling - stage_levels(tuple[int]): The tree height of each stage block - stage_channels(tuple[int]): The feature channel of each stage **Return:** Tensor, the feature map extracted by dla network ### FairmotDla34 > class mindvideo.models.FairmotDla34(down_ratio: int, last_level: int, out_channel: int = 0, stage_levels: Tuple[int] = (1, 1, 1, 2, 2, 1), stage_channels: Tuple[int] = (16, 32, 64, 128, 256, 512)) Constructs a Fairmot architecture. - base: nn.Cell **Parameters:** - down_ratio(int): Output stride. Currently only supports 4. Default: 4. - last_level(int): Last level of dla layers used for deep layer aggregation(DLA) module. Default: 5. - head_channel(int): Channel of input of second conv2d layer in heads. Default: 256. - head_conv2_ksize(Union[int, Tuple]): Kernel size of second conv2d layer. Default: 1. - hm(int): Number of heatmap channels. Default: 1. - wh(int): Dimension of offset and size output, i.e. position of bbox, it equals 4 if regress left, top, right, bottom of bbox, else 2. Default: 4. - feature_id(int): Dimension of identity embedding. Default: 128. - reg(int): Dimension of local offset. Default: 2. - pretrained(bool): If `True`, it will create a pretrained model, the pretrained model will be loaded from network. If `False`, it will create a fairmot model with default initialization. Default: False. **Inputs:** - x(Tensor) - Tensor of shape :math:`(N, C_{in}, D_{in}, H_{in}, W_{in})`. **Return:** Tensor of shape :math:`(N, CLASSES_{out})`. ### Inception3dModule > class mindvideo.models.Inception3dModule(in_channels, out_channels) Inception3dModule definition. - base: nn.Cell **Parameters:** - in_channels (int): The number of channels of input frame images. - out_channels (int): The number of channels of output frame images. **Return:** Tensor, output tensor. ### InceptionI3d > class mindvideo.models.InceptionI3d(in_channels=3) InceptionI3d architecture. - base: nn.Cell **Parameters:** - in_channels (int): The number of channels of input frame images(default 3). **Return:** Tensor, output tensor. ### I3dHead > class mindvideo.models.I3dHead(in_channels, num_classes=400, dropout_keep_prob=0.5) I3dHead definition. - base: nn.Cell **Parameters:** - in_channels: Input channel. - num_classes (int): The number of classes . - dropout_keep_prob (float): A float value of prob. **Return:** Tensor, output tensor. ### I3D > class mindvideo.models.I3D(in_channel: int = 3, num_classes: int = 400, keep_prob: float = 0.5, pooling_keep_dim: bool = True, backbone_output_channel=1024) Constructs a I3D architecture. - base: nn.Cell **Parameters:** - in_channel(int): Number of channel of input data. Default: 3. - num_classes(int): Number of classes, it is the size of classfication score for every sample, i.e. :math:`CLASSES_{out}`. Default: 400. - keep_prob(float): Probability of dropout for multi-dense-layer head, the number of probabilities equals the number of dense layers. Default: 0.5. - pooling_keep_dim: whether to keep dim when pooling. Default: True. - pretrained(bool): If `True`, it will create a pretrained model, the pretrained model will be loaded from network. If `False`, it will create a i3d model with uniform initialization for weight and bias. Default: False. **Inputs:** - x(Tensor) - Tensor of shape :math:`(N, C_{in}, D_{in}, H_{in}, W_{in})`. **Return:** Tensor of shape :math:`(N, CLASSES_{out})`. ### NonLocalBlockND > class mindvideo.models.NonLocalBlockND(in_channels, inter_channels=None, mode='embedded', sub_sample=True, bn_layer=True) Classification backbone for nonlocal. Implementation of Non-Local Block with 4 different pairwise functions. - base: nn.Cell **Parameters:** - in_channels (int): original channel size. - inter_channels (int): channel size inside the block if not specified reduced to half. - mode: 4 mode to choose (gaussian, embedded, dot, and concatenation). - bn_layer: whether to add batch norm. **Inputs:** - x(Tensor) - Tensor of shape :math:`(N, C_{in}, D_{in}, H_{in}, W_{in})`. **Return:** Tensor of shape :math:`(N, C_{out}, D_{out}, H_{out}, W_{out})`. ### NLInflateBlockBase3D > class mindvideo.models.NLInflateBlockBase3D(in_channels, inter_channels=None, mode='embedded', sub_sample=True, bn_layer=True) ResNet residual block base definition. - base: ResidualBlockBase3D **Parameters:** - in_channel (int): Input channel. - out_channel (int): Output channel. - stride (int): Stride size for the first convolutional layer. Default: 1. - group (int): Group convolutions. Default: 1. - base_width (int): Width of per group. Default: 64. - norm (nn.Cell, optional): Module specifying the normalization layer to use. Default: None. - down_sample (nn.Cell, optional): Downsample structure. Default: None. **Return:** Tensor, output tensor. ### NLInflateBlock3D > class mindvideo.models.NLInflateBlockBase3D(in_channel: int, out_channel: int, conv12: Optional[nn.Cell] = Inflate3D, group: int = 1, base_width: int = 64, norm: Optional[nn.Cell] = None, down_sample: Optional[nn.Cell] = None, non_local: bool = False, non_local_mode: str = 'dot', **kwargs) ResNet3D residual block definition. - base: ResidualBlock3D **Parameters:** - in_channel (int): Input channel. - out_channel (int): Output channel. - stride (int): Stride size for the second convolutional layer. Default: 1. - group (int): Group convolutions. Default: 1. - base_width (int): Width of per group. Default: 64. - norm (nn.Cell, optional): Module specifying the normalization layer to use. Default: None. - down_sample (nn.Cell, optional): Downsample structure. Default: None. **Return:** Tensor, output tensor. ### NLInflateResNet3D > class mindvideo.models.NLInflateResNet3D(block: Optional[nn.Cell], layer_nums: Tuple[int], stage_channels: Tuple[int] = (64, 128, 256, 512), stage_strides: Tuple[int] = ((1, 1, 1), (1, 2, 2), (1, 2, 2), (1, 2, 2)), down_sample: Optional[nn.Cell] = Unit3D, inflate: Tuple[Tuple[int]] = ((1, 1, 1), (1, 0, 1, 0), (1, 0, 1, 0, 1, 0), (0, 1, 0)), non_local: Tuple[Tuple[int]] = ((0, 0, 0), (0, 1, 0, 1), (0, 1, 0, 1, 0, 1), (0, 0, 0)), **kwargs) Inflate3D with ResNet3D backbone and non local block. - base: ResNet3D **Parameters:** - block (Optional[nn.Cell]): THe block for network. - layer_nums (list): The numbers of block in different layers. - norm (nn.Cell, optional): The module specifying the normalization layer to use. Default: None. - stage_strides: Stride size for ResNet3D convolutional layer. - non_local: Determine whether to apply nonlocal block in this block. **Inputs:** - x(Tensor) - Tensor of shape :math:`(N, C_{in}, D_{in}, H_{in}, W_{in})`. **Return:** Tensor, output tensor. ### nonlocal3d > class mindvideo.models.nonlocal3d(in_d: int = 32, in_h: int = 224, in_w: int = 224, num_classes: int = 400, keep_prob: float = 0.5, backbone: Optional[nn.Cell] = NLResInflate3D50, avg_pool: Optional[nn.Cell] = AdaptiveAvgPool3D, flatten: Optional[nn.Cell] = nn.Flatten, head: Optional[nn.Cell] = DropoutDense) nonlocal3d model from Xiaolong Wang. "Non-local Neural Networks." https://arxiv.org/pdf/1711.07971v3 - base: nn.Cell **Parameters:** - in_d: Depth of input data, it can be considered as frame number of a video. Default: 32. - in_h: Height of input frames. Default: 224. - in_w: Width of input frames. Default: 224. - num_classes(int): Number of classes, it is the size of classfication score for every sample, i.e. :math:`CLASSES_{out}`. Default: 400. - pooling_keep_dim: whether to keep dim when pooling. Default: True. - keep_prob(float): Probability of dropout for multi-dense-layer head, the number of probabilities equals the number of dense layers. - pretrained(bool): If `True`, it will create a pretrained model, the pretrained model will be loaded from network. If `False`, it will create a nonlocal3d model with uniform initialization for weight and bias. - backbone: Bcxkbone of nonlocal3d. - avg_pool: Avgpooling and flatten. - head: LinearClsHead architecture. **Inputs:** - x(Tensor) - Tensor of shape :math:`(N, C_{in}, D_{in}, H_{in}, W_{in})`.. **Return:** Tensor of shape :math:`(N, CLASSES_{out})`. ### Conv2Plus1d > class mindvideo.models.Conv2Plus1d(in_channel, mid_channel, out_channel, kernel_size=(3, 3, 3), stride=(1, 1, 1), norm=nn.BatchNorm3d, activation=nn.ReLU) R(2+1)d conv12 block. It implements spatial-temporal feature extraction in a sperated way. - base: nn.Cell **Parameters:** - in_channels (int): The number of channels of input frame images. - out_channels (int): The number of channels of output frame images. - kernel_size (tuple): The size of the spatial-temporal convolutional layer kernels. - stride (Union[int, Tuple[int]]): Stride size for the convolutional layer. Default: 1. - group (int): Splits filter into groups, in_channels and out_channels must be divisible by the number of groups. Default: 1. - norm (Optional[nn.Cell]): Norm layer that will be stacked on top of the convolution layer. Default: nn.BatchNorm3d. - activation (Optional[nn.Cell]): Activation function which will be stacked on top of the normalization layer (if not None), otherwise on top of the conv layer. Default: nn.ReLU. **Return:** Tensor, its channel size is calculated from in_channel, out_channel and kernel_size. ### R2Plus1dNet > class mindvideo.models.R2Plus1dNet(block: Optional[nn.Cell], layer_nums: Tuple[int], stage_channels: Tuple[int] = (64, 128, 256, 512), stage_strides: Tuple[Tuple[int]] = ((1, 1, 1), (2, 2, 2), (2, 2, 2), (2, 2, 2)), num_classes: int = 400, **kwargs) Generic R(2+1)d generator. - base: ResNet3D **Parameters:** - block (Optional[nn.Cell]): THe block for network. - layer_nums (Tuple[int]): The numbers of block in different layers. - stage_channels (Tuple[int]): Output channel for every res stage. Default: (64, 128, 256, 512). - stage_strides (Tuple[Tuple[int]]): Strides for every res stage.Default:((1, 1, 1), (2, 2, 2), (2, 2, 2), (2, 2, 2)). - conv12 (nn.Cell, optional): Conv1 and conv2 config in resblock. Default: Conv2Plus1D. - base_width (int): The width of per group. Default: 64. - norm (nn.Cell, optional): The module specifying the normalization layer to use. Default: None. - num_classes(int): Number of categories in the action recognition dataset. - keep_prob(float): Dropout probability in classification stage. - kwargs (dict, optional): Key arguments for "make_res_layer" and resblocks. **Return:** Tensor, output tensor. ### WindowAttention3D > class mindvideo.models.WindowAttention3D(in_channels: int = 96, window_size: int = (8, 7, 7), num_head: int = 3, qkv_bias: Optional[bool] = True, qk_scale: Optional[float] = None, attn_kepp_prob: Optional[float] = 1.0, proj_keep_prob: Optional[float] = 1.0) Window based multi-head self attention (W-MSA) module with relative position bias. It supports both of shifted and non-shifted window. - base: nn.Cell **Parameters:** - in_channels (int): Number of input channels. - window_size (tuple[int]): The depth length, height and width of the window. Default: (8, 7, 7). - num_head (int): Number of attention heads. Default: 3. - qkv_bias (bool, optional): If True, add a learnable bias to query, key, value. Default: True. - qk_scale (float | None, optional): Override default qk scale of head_dim ** -0.5 if set. Default: None. - attn_keep_prob (float, optional): Dropout keep ratio of attention weight. Default: 1.0. - proj_keep_prob (float, optional): Dropout keep ratio of output. Deault: 1.0. **Inputs:** - `x` (Tensor) - Tensor of shape (B, N, C). - `mask` (Tensor) - (0 / - inf) mask with shape of (num_windows, N, N) or None. **Return:** Tensor of shape (B, N, C), which is equal to the input **x**. ### SwinTransformerBlock3D > class mindvideo.models.SwinTransformerBlock3D(embed_dim: int = 96, input_size: int = (16, 56, 56), num_head: int = 3, window_size: int = (8, 7, 7), shift_size: int = (4, 3, 3), mlp_ratio: float = 4., qkv_bias: bool = True, qk_scale: Optional[float] = None, keep_prob: float = 1., attn_keep_prob: float = 1., droppath_keep_prob: float = 1., act_layer: nn.Cell = nn.GELU, norm_layer: str = 'layer_norm') A Video Swin Transformer Block. The implementation of this block follows the paper "Video Swin Transformer". - base: nn.Cell **Parameters:** - embed_dim (int): input feature's embedding dimension, namely, channel number. Default: 96. - input_size (int | tuple(int)): input feature size. Default: (16, 56, 56). - num_head (int): number of attention head of the current Swin3d block. Default: 3. - window_size (int): window size of window attention. Default: (8, 7, 7). - shift_size (tuple[int]): shift size for shifted window attention. Default: (4, 3, 3). - mlp_ratio (float): ratio of mlp hidden dim to embedding dim. Default: 4.0. - qkv_bias (bool): if True, add a learnable bias to query, key,value. Default: True. - qk_scale (float | None, optional): override default qk scale of head_dim ** -0.5 if set True. Default: None. - keep_prob (float): dropout keep probability. Default: 1.0. - attn_keep_prob (float): units keeping probability for attention dropout. Default: 1.0. - droppath_keep_prob (float): path keeping probability for stochastic droppath. Default: 1.0. - act_layer (nn.Cell): activation layer. Default: nn.GELU. - norm_layer (nn.Cell): normalization layer. Default: 'layer_norm'. **Inputs:** - **x** (Tensor) - Input feature of shape (B, D, H, W, C). - **mask_matrix** (Tensor) - Attention mask for cyclic shift. **Return:** Tensor of shape (B, D, H, W, C) ### PatchMerging > class mindvideo.models.PatchMerging(dim: int = 96, norm_layer: str = 'layer_norm') Patch Merging Layer. - base: nn.Cell **Parameters:** - dim (int): Number of input channels. - norm_layer (nn.Cell): Normalization layer. Default: nn.LayerNorm **Inputs:** - **x** (Tensor) - Input feature of shape (B, D, H, W, C). **Return:** Tensor of shape (B, D, H/2, W/2, 2*C) ### SwinTransformerStage3D > class mindvideo.models.SwinTransformerStage3D(embed_dim=96, input_size=(16, 56, 56), depth=2, num_head=3, window_size=(8, 7, 7), mlp_ratio=4., qkv_bias=True, qk_scale=None, keep_prob=1., attn_keep_prob=1., droppath_keep_prob=0.8, norm_layer='layer_norm', downsample=PatchMerging) A basic Swin Transformer layer for one stage. - base: nn.Cell **Parameters:** - embed_dim (int): input feature's embedding dimension, namely, channel number. Default: 96. - input_size (tuple[int]): input feature size. Default. (16, 56, 56). - depth (int): depth of the current Swin3d stage. Default: 2. - num_head (int): number of attention head of the current Swin3d stage. Default: 3. - window_size (int): window size of window attention. Default: (8, 7, 7). - mlp_ratio (float): ratio of mlp hidden dim to embedding dim. Default: 4.0. - qkv_bias (bool): if qkv_bias is True, add a learnable bias into query, key, value matrixes. Default: Truee - qk_scale (float | None, optional): override default qk scale of head_dim ** -0.5 if set. Default: None. - keep_prob (float): dropout keep probability. Default: 1.0. - attn_keep_prob (float): units keeping probability for attention dropout. Default: 1. - droppath_keep_prob (float): path keeping probability for stochastic droppath. Default: 0.8. - norm_layer(string): normalization layer. Default: 'layer_norm'. - downsample (nn.Cell | None, optional): downsample layer at the end of swin3d stage. Default: PatchMerging. **Inputs:** A video feature of shape (N, D, H, W, C) **Return:** Tensor of shape (N, D, H / 2, W / 2, 2 * C) ### PatchEmbed3D > class mindvideo.models.PatchEmbed3D(input_size=(16, 224, 224), patch_size=(2, 4, 4), in_channels=3, embed_dim=96, norm_layer='layer_norm', patch_norm=True) Video to Patch Embedding. - base: nn.Cell **Parameters:** - input_size (tuple[int]): Input feature size. - patch_size (int): Patch token size. Default: (2,4,4). - in_channels (int): Number of input video channels. Default: 3. - embed_dim (int): Number of linear projection output channels. Default: 96. - norm_layer (nn.Module, optional): Normalization layer. Default: None. - patch_norm (bool): if True, add normalization after patch embedding. Default: True. **Inputs:** An original Video tensor in data format of 'NCDHW'. **Return:** An embedded tensor in data format of 'NDHWC'. ### SwinTransformer3D > class mindvideo.models.SwinTransformer3D(input_size=(16, 56, 56), embed_dim=96, depths=(2, 2, 6, 2), num_heads=(3, 6, 12, 24), window_size=(8, 7, 7), mlp_ratio=4., qkv_bias=True, qk_scale=None, keep_prob=1., attn_keep_prob=1., droppath_keep_prob=0.8, norm_layer='layer_norm') Video Swin Transformer backbone. A mindspore implementation of : `Video Swin Transformer` http://arxiv.org/abs/2106.13230 - base: nn.Cell **Parameters:** - input_size (int | tuple(int)): input feature size. Default: (16, 56, 56). - embed_dim (int): input feature's embedding dimension, namely, channel number. Default: 96. - depths (tuple[int]): depths of each Swin3d stage. Default: (2, 2, 6, 2). - num_heads (tuple[int]): number of attention head of each Swin3d stage. Default: (3, 6, 12, 24). - window_size (int): window size of window attention. Default: (8, 7, 7). - mlp_ratio (float): ratio of mlp hidden dim to embedding dim. Default: 4.0. - qkv_bias (bool): if qkv_bias is True, add a learnable bias into query, key, value matrixes. Default: True. - qk_scale (float | None, optional): override default qk scale of head_dim ** -0.5 if set. Default: None. - keep_prob (float): dropout keep probability. Default: 1.0. - attn_keep_prob (float): units keeping probability for attention dropout. Default: 1. - droppath_keep_prob (float): path keeping probability for stochastic droppath. Default: 0.8. - norm_layer (string): normalization layer. Default: 'layer_norm'. **Inputs:** - **x** (Tensor) - Tensor of shape 'NDHWC'. **Return:** Tensor of shape 'NCDHW'. ### Swin3D > class mindvideo.models.Swin3D(input_size=(16, 56, 56), embed_dim=96, depths=(2, 2, 6, 2), num_heads=(3, 6, 12, 24), window_size=(8, 7, 7), mlp_ratio=4., qkv_bias=True, qk_scale=None, keep_prob=1., attn_keep_prob=1., droppath_keep_prob=0.8, norm_layer='layer_norm') Constructs a swin3d architecture corresponding to `Video Swin Transformer `. - base: nn.Cell **Parameters:** - num_classes (int): The number of classification. Default: 400. - patch_size (int): Patch size used by window attention. Default: (2, 4, 4). - window_size (int): Window size used by window attention. Default: (8, 7, 7). - embed_dim (int): Embedding dimension of the featrue generated from patch embedding layer. Default: 96. - depths (int): Depths of each stage in Swin3d Tiny module. Default: (2, 2, 6, 2). - num_heads (int): Numbers of heads of each stage in Swin3d Tiny module. Default: (3, 6, 12, 24). - representation_size (int): Feature dimension of the last layer in backbone. Default: 768. - droppath_keep_prob (float): The drop path keep probability. Default: 0.9. - input_size (int | tuple(int)): Input feature size. Default: (32, 224, 224). - in_channels (int): Input channels. Default: 3. - mlp_ratio (float): Ratio of mlp hidden dim to embedding dim. Default: 4.0. - qkv_bias (bool): If qkv_bias is True, add a learnable bias into query, key, value matrixes. Default: True. - qk_scale (float | None, optional): Override default qk scale of head_dim ** -0.5 if set. Default: None. - keep_prob (float): Dropout keep probability. Default: 1.0. - attn_keep_prob (float): Keeping probability for attention dropout. Default: 1.0. - norm_layer (string): Normalization layer. Default: 'layer_norm'. - patch_norm (bool): If True, add normalization after patch embedding. Default: True. - pooling_keep_dim (bool): Specifies whether to keep dimension shape the same as input feature. Default: False. - head_bias (bool): Specifies whether the head uses a bias vector. Default: True. - head_activation (Union[str, Cell, Primitive]): Activate function applied in the head. Default: None. - head_keep_prob (float): Head's dropout keeping rate, between [0, 1]. Default: 0.5. **Inputs:** - **x** (Tensor) - Tensor of shape :math:`(N, C_{in}, D_{in}, H_{in}, W_{in})`. **Return:** Tensor of shape :math:`(N, CLASSES_{out})` ### swin3d_t > def mindvideo.models.swin3d_t(num_classes: int = 400, patch_size: int = (2, 4, 4), window_size: int = (8, 7, 7), embed_dim: int = 96, depths: int = (2, 2, 6, 2), num_heads: int = (3, 6, 12, 24), representation_size: int = 768, droppath_keep_prob: float = 0.9) Video Swin Transformer Tiny (swin3d-T) model. **Parameters:** num_classes (int): Number of categories. patch_size (int): Size of swin3d patch segmentation. window_size (int): Size of swin3d window. embed_dim (int): Dimension output by the patch embedding. depths (int): Depth of each stage. num_heads (int): Number of heads in window attention. representation_size (int): Size of features output at the last layer of backbone. droppath_keep_prob (float): Retetion probability of drop path. **Returns:** swin3d_t: nn.Cell ### swin3d_s > def mindvideo.models.swin3d_s(num_classes: int = 400, patch_size: int = (2, 4, 4), window_size: int = (8, 7, 7), embed_dim: int = 96, depths: int = (2, 2, 18, 2), num_heads: int = (3, 6, 12, 24), representation_size: int = 768, droppath_keep_prob: float = 0.9) Video Swin Transformer Small (swin3d-S) model. **Parameters:** num_classes (int): Number of categories. patch_size (int): Size of swin3d patch segmentation. window_size (int): Size of swin3d window. embed_dim (int): Dimension output by the patch embedding. depths (int): Depth of each stage. num_heads (int): Number of heads in window attention. representation_size (int): Size of features output at the last layer of backbone. droppath_keep_prob (float): Retetion probability of drop path. **Returns:** swin3d_s: nn.Cell ### swin3d_b > def mindvideo.models.swin3d_b(num_classes: int = 400, patch_size: int = (2, 4, 4), window_size: int = (8, 7, 7), embed_dim: int = 128, depths: int = (2, 2, 18, 2), num_heads: int = (4, 8, 16, 32), representation_size: int = 1024, droppath_keep_prob: float = 0.7) Video Swin Transformer Base (swin3d-B) model. **Parameters:** num_classes (int): Number of categories. patch_size (int): Size of swin3d patch segmentation. window_size (int): Size of swin3d window. embed_dim (int): Dimension output by the patch embedding. depths (int): Depth of each stage. num_heads (int): Number of heads in window attention. representation_size (int): Size of features output at the last layer of backbone. droppath_keep_prob (float): Retetion probability of drop path. **Returns:** swin3d_b: nn.Cell ### swin3d_l > def mindvideo.models.swin3d_l(num_classes: int = 400, patch_size: int = (2, 4, 4), window_size: int = (8, 7, 7), embed_dim: int = 192, depths: int = (2, 2, 18, 2), num_heads: int = (6, 12, 24, 48), representation_size: int = 1536, droppath_keep_prob: float = 0.9) Video Swin Transformer Large (swin3d-L) model. **Parameters:** num_classes (int): Number of categories. patch_size (int): Size of swin3d patch segmentation. window_size (int): Size of swin3d window. embed_dim (int): Dimension output by the patch embedding. depths (int): Depth of each stage. num_heads (int): Number of heads in window attention. representation_size (int): Size of features output at the last layer of backbone. droppath_keep_prob (float): Retetion probability of drop path. **Returns:** swin3d_l: nn.Cell ### GroupNorm3d > class mindvideo.models.GroupNorm3d(num_groups, num_channels, eps=1e-05, affine=True, gamma_init='ones', beta_init='zeros') modify from mindspore.nn.GroupNorm, add depth - base: nn.Cell **Parameters:** num_groups (int): Number of groups to be divided along the channel dimension. num_channels (int): Number of channels. eps(float): The value added to the denominator. affine (bool): When set to True, a learnable affine transformation parameter is added to the layer. gamma_init (str): Method of initializing the gamma parameter. beta_init (str): Method of initializing the beta parameter. **Return:** Tensor, output tensor. ### VistrCom > class mindvideo.models.VistrCom(name: str = 'ResNet50', train_embeding: bool = True, num_queries: int = 360, num_pos_feats: int = 64, num_frames: int = 36, temperature: int = 10000, normalize: bool = True, scale: float = None, hidden_dim: int = 384, d_model: int = 384, nhead: int = 8, num_encoder_layers: int = 6, num_decoder_layers: int = 6, dim_feedforward: int = 2048, dropout: int = 0.1, activation: str = "relu", normalize_before: bool = False, return_intermediate_dec: bool = True, aux_loss: bool = True, num_class: int = 41) Vistr Architecture. - base: nn.Cell **Parameters:** name (str): The type of ResNet. train_embeding (bool): Whether to train embeding or not. num_queries (int): Number of instances. num_pos_feats (int): The encoding length of each dimension. num_frames (int): Number of frames. temperature (int): Coefficient. normalize (bool): Whether to normalize. If True, normalize. scale (float): Coefficient. hidden_dim (int): Dimensions required by the input vector in the encoder. d_model (int): Number of expected features entered by the backbone nhead (int): Number of heads in multi head attention. num_encoder_layers (int): Layer number of encoders. num_decoder_layers (int): Layer number of decoders. dim_feedforward (int): Dimensions of the feedforward network model in backbone dropout (int): Value of dropout. activation(str): Activation function. normalize_before (bool): Whether is normalized or not before. return_intermediate_dec (bool): Whether to return intermediate output aux_loss (bool): Whether to calculate the loss of the middle layer. num_class (int): Number of categories. **Return:** Tensor, output tensor. ### BlockX3D > class mindvideo.models.BlockX3D(in_channel, out_channel, conv12: Optional[nn.Cell] = Inflate3D, inflate: int = 2, norm: Optional[nn.Cell] = None, down_sample: Optional[nn.Cell] = None, block_idx: int = 0, se_ratio: float = 0.0625, use_swish: bool = True, drop_connect_rate: float = 0.0, bottleneck_factor: float = 2.25, **kwargs) BlockX3D 3d building block for X3D. - base: ResidualBlock3D **Parameters:** - in_channel (int): Input channel. - out_channel (int): Output channel. - conv12(nn.Cell, optional): Block that constructs first two conv layers. It can be `Inflate3D`, `Conv2Plus1D` or other custom blocks, this block should construct a layer where the name of output feature channel size is `mid_channel` for the third conv layers. Default: Inflate3D. - inflate (int): Whether to inflate kernel. - spatial_stride (int): Spatial stride in the conv3d layer. Default: 1. - down_sample (nn.Module | None): DownSample layer. Default: None. - block_idx (int): the id of the block. - se_ratio (float | None): The reduction ratio of squeeze and excitation unit. If set as None, it means not using SE unit. Default: None. - use_swish (bool): Whether to use swish as the activation function before and after the 3x3x3 conv. Default: True. - drop_connect_rate (float): dropout rate. If equal to 0.0, perform no dropout. - bottleneck_factor (float): Bottleneck expansion factor for the 3x3x3 conv. **Return:** Tensor, output tensor. ### ResNetX3D > class mindvideo.models.ResNetX3D(block: Optional[nn.Cell], layer_nums: Tuple[int], stage_channels: Tuple[int], stage_strides: Tuple[Tuple[int]], drop_rates: Tuple[float], down_sample: Optional[nn.Cell] = Unit3D, bottleneck_factor: float = 2.25) X3D backbone definition. - base: ResNet3D **Parameters:** - block (Optional[nn.Cell]): THe block for network. - layer_nums (list): The numbers of block in different layers. - stage_channels (Tuple[int]): Output channel for every res stage. - stage_strides (Tuple[Tuple[int]]): Stride size for ResNet3D convolutional layer. - drop_rates (list): list of the drop rate in different blocks. The basic rate at which blocks are dropped, linearly increases from input to output blocks. - down_sample (Optional[nn.Cell]): Residual block in every resblock, it can transfer the input feature into the same channel of output. Default: Unit3D. - bottleneck_factor (float): Bottleneck expansion factor for the 3x3x3 conv. - fc_init_std (float): The std to initialize the fc layer(s). **Inputs:** - **x** (Tensor) - Tensor of shape :math:`(N, C_{in}, D_{in}, H_{in}, W_{in})`. **Return:** Tensor, output tensor. ### X3DHead > class mindvideo.models.X3DHead(pool_size, input_channel, out_channel=2048, num_classes=400, dropout_rate=0.5) x3d head architecture. - base: nn.Cell **Parameters:** - input_channel (int): The number of input channel. - out_channel (int): The number of inner channel. Default: 2048. - num_classes (int): Number of classes. Default: 400. - dropout_rate (float): Dropout keeping rate, between [0, 1]. Default: 0.5. **Return:** Tensor ### x3d > class mindvideo.models.x3d(block: Type[BlockX3D], depth_factor: float, num_frames: int, train_crop_size: int, num_classes: int, dropout_rate: float, bottleneck_factor: float = 2.25, eval_with_clips: bool = False) x3d architecture. Christoph Feichtenhofer. "X3D: Expanding Architectures for Efficient Video Recognition." https://arxiv.org/abs/2004.04730 - base: nn.Cell **Parameters:** - block (Type[BlockX3D]): The block of X3D. - depth_factor (float): Depth expansion factor. - num_frames (int): The number of frames of the input clip. - train_crop_size (int): The spatial crop size for training. - num_classes (int): the channel dimensions of the output. - dropout_rate (float): dropout rate. If equal to 0.0, perform no dropout. - bottleneck_factor (float): Factor of bottleneck. - eval_with_clips (bool): If evalidate with clips, eval_with_clips is True. **Inputs:** - **x** (Tensor) - Tensor of shape :math:`(N, C_{in}, D_{in}, H_{in}, W_{in})`. **Return:** Tensor of shape :math:`(N, CLASSES_{out})` ### x3d_m > def mindvideo.models.x3d_m(num_classes: int = 400, dropout_rate: float = 0.5, depth_factor: float = 2.2, num_frames: int = 16, train_crop_size: int = 224, eval_with_clips: bool = False) X3D middle model. **Parameters:** - num_classes (int): the channel dimensions of the output. - dropout_rate (float): dropout rate. If equal to 0.0, perform no dropout. - depth_factor (float): Depth expansion factor. - num_frames (int): The number of frames of the input clip. - train_crop_size (int): The spatial crop size for training. **Inputs:** - **x** (Tensor) - Tensor of shape :math:`(N, C_{in}, D_{in}, H_{in}, W_{in})`. **Return:** Tensor of shape :math:`(N, CLASSES_{out})` ### x3d_s > def mindvideo.models.x3d_s(num_classes: int = 400, dropout_rate: float = 0.5, depth_factor: float = 2.2, num_frames: int = 13, train_crop_size: int = 160, eval_with_clips: bool = False) X3D small model. **Parameters:** - num_classes (int): the channel dimensions of the output. - dropout_rate (float): dropout rate. If equal to 0.0, perform no dropout. - depth_factor (float): Depth expansion factor. - num_frames (int): The number of frames of the input clip. - train_crop_size (int): The spatial crop size for training. **Inputs:** - **x** (Tensor) - Tensor of shape :math:`(N, C_{in}, D_{in}, H_{in}, W_{in})`. **Return:** Tensor of shape :math:`(N, CLASSES_{out})` ### x3d_xs > def mindvideo.models.x3d_xs(num_classes: int = 400, dropout_rate: float = 0.5, depth_factor: float = 2.2, num_frames: int = 4, train_crop_size: int = 160, eval_with_clips: bool = False) X3D x-small model. **Parameters:** - num_classes (int): the channel dimensions of the output. - dropout_rate (float): dropout rate. If equal to 0.0, perform no dropout. - depth_factor (float): Depth expansion factor. - num_frames (int): The number of frames of the input clip. - train_crop_size (int): The spatial crop size for training. **Inputs:** - **x** (Tensor) - Tensor of shape :math:`(N, C_{in}, D_{in}, H_{in}, W_{in})`. **Return:** Tensor of shape :math:`(N, CLASSES_{out})` ### x3d_l > def mindvideo.models.x3d_l(num_classes: int = 400, dropout_rate: float = 0.5, depth_factor: float = 5.0, num_frames: int = 16, train_crop_size: int = 312, eval_with_clips: bool = False) X3D large model. **Parameters:** - num_classes (int): the channel dimensions of the output. - dropout_rate (float): dropout rate. If equal to 0.0, perform no dropout. - depth_factor (float): Depth expansion factor. - num_frames (int): The number of frames of the input clip. - train_crop_size (int): The spatial crop size for training. **Inputs:** - **x** (Tensor) - Tensor of shape :math:`(N, C_{in}, D_{in}, H_{in}, W_{in})`. **Return:** Tensor of shape :math:`(N, CLASSES_{out})`