mindvideo.models¶
SpatialAttention¶
class mindvideo.models.SpatialAttention(in_channels: int = 64, out_channels: int = 16)
Initialize spatial attention unit which refine the aggregation step by re-weighting block contributions.
base: nn.Cell
Parameters:
in_channels: The number of channels of the input feature.
out_channels: The number of channels of the output of hidden layers.
Return:
Tensor of shape (1, 1, H, W).
SimilarityNetwork¶
class mindvideo.models.SimilarityNetwork(in_channels=2, out_channels=64, input_size=64, hidden_size=8)
Similarity learning between query and support clips as paired relation descriptors for RelationNetwork.
base: nn.Cell
Parameters:
in_channels (int): Number of channels of the input feature. Default: 2.
out_channels (int): Number of channels of the output feature. Default: 64.
input_size (int): Size of input features. Default: 64.
hidden_size (int): Number of channels in the hidden fc layers. Default: 8.
Return:
Tensor, output tensor.
ARNEmbedding¶
class mindvideo.models.ARNEmbedding(support_num_per_class: int = 1, query_num_per_class: int = 1, class_num: int = 5, is_c3d: bool = True, in_channels: Optional[int] = 3, out_channels: Optional[int] = 64)
Embedding for ARN based on Unit3d-built 4-layer Conv or C3d.
base: nn.Cell
Parameters:
support_num_per_class (int): Number of samples in support set per class. Default: 1.
query_num_per_class (int): Number of samples in query set per class. Default: 1.
class_num (int): Number of classes. Default: 5.
is_c3d (bool): Specifies whether the network uses C3D as embedding for ARN. Default: False.
in_channels: The number of channels of the input feature. Default: 3.
out_channels: The number of channels of the output of hidden layers (only used when is_c3d is set to False). Default: 64.
Return:
Tensor, output 2 tensors.
ARNBackbone¶
class mindvideo.models.ARNBackbone(jigsaw: int = 10, support_num_per_class: int = 1, query_num_per_class: int = 1, class_num: int = 5, seq: int = 16)
ARN architecture.
base: nn.Cell
Parameters:
jigsaw (int): Number of the output dimension for spacial-temporal jigsaw discriminator. Default: 10.
support_num_per_class (int): Number of samples in support set per class. Default: 1.
query_num_per_class (int): Number of samples in query set per class. Default: 1.
class_num (int): Number of classes. Default: 5.
Return:
Tensor, output 2 tensors.
ARNNeck¶
class mindvideo.models.ARNNeck(class_num: int = 5, support_num_per_class: int = 1, sigma: int = 100)
ARN neck architecture.
base: nn.Cell
Parameters:
class_num (int): Number of classes. Default: 5.
support_num_per_class (int): Number of samples in support set per class. Default: 1.
sigma: Controls the slope of PN. Default: 100.
Return:
Tensor, output 2 tensors.
def mindvideo.models.ARNNeck.power_norm(x)
Define the operation of Power Normalization.
Parameters:
x (Tensor): Tensor of shape :math:(C_{in}, C_{in}).
Return:
Tensor of shape: math:(C_{out}, C_{out}).
ARNHead¶
class mindvideo.models.ARNHead(class_num: int = 5, query_num_per_class: int = 1)
ARN head architecture.
base: nn.Cell
Parameters:
class_num (int): Number of classes. Default: 5.
query_num_per_class (int): Number of query samples per class. Default: 1.
Return:
Tensor, output tensors.
ARN¶
class mindvideo.models.ARN(support_num_per_class: int = 1, query_num_per_class: int = 1, class_num: int = 5, is_c3d: bool = False, in_channels: Optional[int] = 3, out_channels: Optional[int] = 64, jigsaw: int = 10, sigma: int = 100)
Constructs a ARN architecture from Few-shot Action Recognition via Permutation-invariant Attention <https://arxiv.org/pdf/2001.03905.pdf>.
base: nn.Cell
Parameters:
support_num_per_class (int): Number of samples in support set per class. Default: 1.
query_num_per_class (int): Number of samples in query set per class. Default: 1.
class_num (int): Number of classes. Default: 5.
is_c3d (bool): Specifies whether the network uses C3D as embendding for ARN. Default: False.
in_channels: The number of channels of the input feature. Default: 3.
out_channels: The number of channels of the output of hidden layers (only used when is_c3d is set to False). Default: 64.
jigsaw (int): Number of the output dimension for spacial-temporal jigsaw discriminator. Default: 10.
sigma: Controls the slope of PN. Default: 100.
Inputs:
x(Tensor): Tensor of shape :math:
(E, N, C_{in}, D_{in}, H_{in}, W_{in}).
Return:
Tensor of shape :math:(CLASSES_NUM, CLASSES_{out})
C3D¶
class mindvideo.models.C3D(in_d: int = 16, in_h: int = 112, in_w: int = 112, in_channel: int = 3, kernel_size: Union[int, Tuple[int]] = (3, 3, 3), head_channel: Union[int, Tuple[int]] = (4096, 4096), num_classes: int = 400, keep_prob: Union[float, Tuple[float]] = (0.5, 0.5, 1.0))
Constructs a C3D architecture.
base: nn.Cell
Parameters:
in_d: Depth of input data, it can be considered as frame number of a video. Default: 16.
in_h: Height of input frames. Default: 112.
in_w: Width of input frames. Default: 112.
in_channel(int): Number of channel of input data. Default: 3.
kernel_size(Union[int, Tuple[int]]): Kernel size for every conv3d layer in C3D. Default: (3, 3, 3).
head_channel(Tuple[int]): Hidden size of multi-dense-layer head. Default: [4096, 4096].
num_classes(int): Number of classes, it is the size of classfication score for every sample, i.e. :math:
CLASSES_{out}. Default: 400.keep_prob(Tuple[int]): Probability of dropout for multi-dense-layer head, the number of probabilities equals the number of dense layers.
pretrained(bool): If
True, it will create a pretrained model, the pretrained model will be loaded from network. IfFalse, it will create a c3d model with uniform initialization for weight and bias.
Inputs:
x(Tensor): Tensor of shape :math:
(N, C_{in}, D_{in}, H_{in}, W_{in}).
Return:
Tensor of shape :math:(N, CLASSES_{out}).
BasicBlock¶
class mindvideo.models.BasicBlock(cin, cout, stride=1, dilation=1)
Basic residual block for dla.
base: nn.Cell
Parameters:
cin(int): Input channel.
cout(int): Output channel.
stride(int): Covolution stride. Default: 1.
dilation(int): The dilation rate to be used for dilated convolution. Default: 1.
Return:
Tensor, the feature after covolution.
Root¶
class mindvideo.models.Root(in_channels, out_channels, kernel_size, residual)
Get HDA node which play as the root of tree in each stage.
base: nn.Cell
Parameters:
cin(int): Input channel.
cout(int):Output channel.
kernel_size(int): Covolution kernel size.
residual(bool): Add residual or not.
Return:
Tensor, HDA node after aggregation.
Tree¶
class mindvideo.models.Tree(levels, block, in_channels, out_channels, stride=1, level_root=False, root_dim=0, root_kernel_size=1, dilation=1, root_residual=False)
Construct the deep aggregation network through recurrent. Each stage can be seen as a tree with multiple children.
base: nn.Cell
Parameters:
levels(list int): Tree height of each stage.
block(Cell): Basic block of the tree.
in_channels(list int): Input channel of each stage.
out_channels(list int): Output channel of each stage.
stride(int): Covolution stride. Default: 1.
level_root(bool): Whether is the root of tree or not. Default: False.
root_dim(int): Input channel of the root node. Default: 0.
root_kernel_size(int): Covolution kernel size at the root. Default: 1.
dilation(int): The dilation rate to be used for dilated convolution. Default: 1.
root_residual(bool): Add residual or not. Default: False.
Return:
Tensor, the root ida node.
DLA34¶
class mindvideo.models.DLA34(levels, channels, block=None, residual_root=False)
Construct the downsampling deep aggregation network.
base: nn.Cell
Parameters:
levels(list int): Tree height of each stage.
channels(list int): Input channel of each stage
block(Cell): Initial basic block. Default: BasicBlock.
residual_root(bool): Add residual or not. Default: False
Return:
tuple of Tensor, the root node of each stage.
DlaDeformConv¶
class mindvideo.models.DlaDeformConv(cin, cout)
Deformable convolution v2 with bn and relu.
base: nn.Cell
Parameters:
cin(int): Input channel
cout(int): Output_channel
Return:
Tensor, results after deformable convolution and activation
IDAUp¶
class mindvideo.models.IDAUp(out, channels, up_f)
IDAUp sample.
base: nn.Cell
Return:
List.
DLAUp¶
class mindvideo.models.DLAUp(startp, channels, scales, in_channels=None)
DLAUp sample.
base: nn.Cell
Return:
List.
DLASegConv¶
class mindvideo.models.DLASegConv(down_ratio: int, last_level: int, out_channel: int = 0, stage_levels: Tuple[int] = (1, 1, 1, 2, 2, 1), stage_channels: Tuple[int] = (16, 32, 64, 128, 256, 512))
The DLA backbone network.
base: nn.Cell
Parameters:
down_ratio(int): The ratio of input and output resolution
last_level(int): The ending stage of the final upsampling
stage_levels(tuple[int]): The tree height of each stage block
stage_channels(tuple[int]): The feature channel of each stage
Return:
Tensor, the feature map extracted by dla network
FairmotDla34¶
class mindvideo.models.FairmotDla34(down_ratio: int, last_level: int, out_channel: int = 0, stage_levels: Tuple[int] = (1, 1, 1, 2, 2, 1), stage_channels: Tuple[int] = (16, 32, 64, 128, 256, 512))
Constructs a Fairmot architecture.
base: nn.Cell
Parameters:
down_ratio(int): Output stride. Currently only supports 4. Default: 4.
last_level(int): Last level of dla layers used for deep layer aggregation(DLA) module. Default: 5.
head_channel(int): Channel of input of second conv2d layer in heads. Default: 256.
head_conv2_ksize(Union[int, Tuple]): Kernel size of second conv2d layer. Default: 1.
hm(int): Number of heatmap channels. Default: 1.
wh(int): Dimension of offset and size output, i.e. position of bbox, it equals 4 if regress left, top, right, bottom of bbox, else 2. Default: 4.
feature_id(int): Dimension of identity embedding. Default: 128.
reg(int): Dimension of local offset. Default: 2.
pretrained(bool): If
True, it will create a pretrained model, the pretrained model will be loaded from network. IfFalse, it will create a fairmot model with default initialization. Default: False.
Inputs:
x(Tensor) - Tensor of shape :math:
(N, C_{in}, D_{in}, H_{in}, W_{in}).
Return:
Tensor of shape :math:(N, CLASSES_{out}).
Inception3dModule¶
class mindvideo.models.Inception3dModule(in_channels, out_channels)
Inception3dModule definition.
base: nn.Cell
Parameters:
in_channels (int): The number of channels of input frame images.
out_channels (int): The number of channels of output frame images.
Return:
Tensor, output tensor.
InceptionI3d¶
class mindvideo.models.InceptionI3d(in_channels=3)
InceptionI3d architecture.
base: nn.Cell
Parameters:
in_channels (int): The number of channels of input frame images(default 3).
Return:
Tensor, output tensor.
I3dHead¶
class mindvideo.models.I3dHead(in_channels, num_classes=400, dropout_keep_prob=0.5)
I3dHead definition.
base: nn.Cell
Parameters:
in_channels: Input channel.
num_classes (int): The number of classes .
dropout_keep_prob (float): A float value of prob.
Return:
Tensor, output tensor.
I3D¶
class mindvideo.models.I3D(in_channel: int = 3, num_classes: int = 400, keep_prob: float = 0.5, pooling_keep_dim: bool = True, backbone_output_channel=1024)
Constructs a I3D architecture.
base: nn.Cell
Parameters:
in_channel(int): Number of channel of input data. Default: 3.
num_classes(int): Number of classes, it is the size of classfication score for every sample, i.e. :math:
CLASSES_{out}. Default: 400.keep_prob(float): Probability of dropout for multi-dense-layer head, the number of probabilities equals the number of dense layers. Default: 0.5.
pooling_keep_dim: whether to keep dim when pooling. Default: True.
pretrained(bool): If
True, it will create a pretrained model, the pretrained model will be loaded from network. IfFalse, it will create a i3d model with uniform initialization for weight and bias. Default: False.
Inputs:
x(Tensor) - Tensor of shape :math:
(N, C_{in}, D_{in}, H_{in}, W_{in}).
Return:
Tensor of shape :math:(N, CLASSES_{out}).
NonLocalBlockND¶
class mindvideo.models.NonLocalBlockND(in_channels, inter_channels=None, mode=’embedded’, sub_sample=True, bn_layer=True)
Classification backbone for nonlocal. Implementation of Non-Local Block with 4 different pairwise functions.
base: nn.Cell
Parameters:
in_channels (int): original channel size.
inter_channels (int): channel size inside the block if not specified reduced to half.
mode: 4 mode to choose (gaussian, embedded, dot, and concatenation).
bn_layer: whether to add batch norm.
Inputs:
x(Tensor) - Tensor of shape :math:
(N, C_{in}, D_{in}, H_{in}, W_{in}).
Return:
Tensor of shape :math:(N, C_{out}, D_{out}, H_{out}, W_{out}).
NLInflateBlockBase3D¶
class mindvideo.models.NLInflateBlockBase3D(in_channels, inter_channels=None, mode=’embedded’, sub_sample=True, bn_layer=True)
ResNet residual block base definition.
base: ResidualBlockBase3D
Parameters:
in_channel (int): Input channel.
out_channel (int): Output channel.
stride (int): Stride size for the first convolutional layer. Default: 1.
group (int): Group convolutions. Default: 1.
base_width (int): Width of per group. Default: 64.
norm (nn.Cell, optional): Module specifying the normalization layer to use. Default: None.
down_sample (nn.Cell, optional): Downsample structure. Default: None.
Return:
Tensor, output tensor.
NLInflateBlock3D¶
class mindvideo.models.NLInflateBlockBase3D(in_channel: int, out_channel: int, conv12: Optional[nn.Cell] = Inflate3D, group: int = 1, base_width: int = 64, norm: Optional[nn.Cell] = None, down_sample: Optional[nn.Cell] = None, non_local: bool = False, non_local_mode: str = ‘dot’, **kwargs)
ResNet3D residual block definition.
base: ResidualBlock3D
Parameters:
in_channel (int): Input channel.
out_channel (int): Output channel.
stride (int): Stride size for the second convolutional layer. Default: 1.
group (int): Group convolutions. Default: 1.
base_width (int): Width of per group. Default: 64.
norm (nn.Cell, optional): Module specifying the normalization layer to use. Default: None.
down_sample (nn.Cell, optional): Downsample structure. Default: None.
Return:
Tensor, output tensor.
NLInflateResNet3D¶
class mindvideo.models.NLInflateResNet3D(block: Optional[nn.Cell], layer_nums: Tuple[int], stage_channels: Tuple[int] = (64, 128, 256, 512), stage_strides: Tuple[int] = ((1, 1, 1), (1, 2, 2), (1, 2, 2), (1, 2, 2)), down_sample: Optional[nn.Cell] = Unit3D, inflate: Tuple[Tuple[int]] = ((1, 1, 1), (1, 0, 1, 0), (1, 0, 1, 0, 1, 0), (0, 1, 0)), non_local: Tuple[Tuple[int]] = ((0, 0, 0), (0, 1, 0, 1), (0, 1, 0, 1, 0, 1), (0, 0, 0)), **kwargs)
Inflate3D with ResNet3D backbone and non local block.
base: ResNet3D
Parameters:
block (Optional[nn.Cell]): THe block for network.
layer_nums (list): The numbers of block in different layers.
norm (nn.Cell, optional): The module specifying the normalization layer to use. Default: None.
stage_strides: Stride size for ResNet3D convolutional layer.
non_local: Determine whether to apply nonlocal block in this block.
Inputs:
x(Tensor) - Tensor of shape :math:
(N, C_{in}, D_{in}, H_{in}, W_{in}).
Return:
Tensor, output tensor.
nonlocal3d¶
class mindvideo.models.nonlocal3d(in_d: int = 32, in_h: int = 224, in_w: int = 224, num_classes: int = 400, keep_prob: float = 0.5, backbone: Optional[nn.Cell] = NLResInflate3D50, avg_pool: Optional[nn.Cell] = AdaptiveAvgPool3D, flatten: Optional[nn.Cell] = nn.Flatten, head: Optional[nn.Cell] = DropoutDense)
nonlocal3d model from Xiaolong Wang. “Non-local Neural Networks.” https://arxiv.org/pdf/1711.07971v3
base: nn.Cell
Parameters:
in_d: Depth of input data, it can be considered as frame number of a video. Default: 32.
in_h: Height of input frames. Default: 224.
in_w: Width of input frames. Default: 224.
num_classes(int): Number of classes, it is the size of classfication score for every sample, i.e. :math:
CLASSES_{out}. Default: 400.pooling_keep_dim: whether to keep dim when pooling. Default: True.
keep_prob(float): Probability of dropout for multi-dense-layer head, the number of probabilities equals the number of dense layers.
pretrained(bool): If
True, it will create a pretrained model, the pretrained model will be loaded from network. IfFalse, it will create a nonlocal3d model with uniform initialization for weight and bias.backbone: Bcxkbone of nonlocal3d.
avg_pool: Avgpooling and flatten.
head: LinearClsHead architecture.
Inputs:
x(Tensor) - Tensor of shape :math:
(N, C_{in}, D_{in}, H_{in}, W_{in})..
Return:
Tensor of shape :math:(N, CLASSES_{out}).
Conv2Plus1d¶
class mindvideo.models.Conv2Plus1d(in_channel, mid_channel, out_channel, kernel_size=(3, 3, 3), stride=(1, 1, 1), norm=nn.BatchNorm3d, activation=nn.ReLU)
R(2+1)d conv12 block. It implements spatial-temporal feature extraction in a sperated way.
base: nn.Cell
Parameters:
in_channels (int): The number of channels of input frame images.
out_channels (int): The number of channels of output frame images.
kernel_size (tuple): The size of the spatial-temporal convolutional layer kernels.
stride (Union[int, Tuple[int]]): Stride size for the convolutional layer. Default: 1.
group (int): Splits filter into groups, in_channels and out_channels must be divisible by the number of groups. Default: 1.
norm (Optional[nn.Cell]): Norm layer that will be stacked on top of the convolution layer. Default: nn.BatchNorm3d.
activation (Optional[nn.Cell]): Activation function which will be stacked on top of the normalization layer (if not None), otherwise on top of the conv layer. Default: nn.ReLU.
Return:
Tensor, its channel size is calculated from in_channel, out_channel and kernel_size.
R2Plus1dNet¶
class mindvideo.models.R2Plus1dNet(block: Optional[nn.Cell], layer_nums: Tuple[int], stage_channels: Tuple[int] = (64, 128, 256, 512), stage_strides: Tuple[Tuple[int]] = ((1, 1, 1), (2, 2, 2), (2, 2, 2), (2, 2, 2)), num_classes: int = 400, **kwargs)
Generic R(2+1)d generator.
base: ResNet3D
Parameters:
block (Optional[nn.Cell]): THe block for network.
layer_nums (Tuple[int]): The numbers of block in different layers.
stage_channels (Tuple[int]): Output channel for every res stage. Default: (64, 128, 256, 512).
stage_strides (Tuple[Tuple[int]]): Strides for every res stage.Default:((1, 1, 1), (2, 2, 2), (2, 2, 2), (2, 2, 2)).
conv12 (nn.Cell, optional): Conv1 and conv2 config in resblock. Default: Conv2Plus1D.
base_width (int): The width of per group. Default: 64.
norm (nn.Cell, optional): The module specifying the normalization layer to use. Default: None.
num_classes(int): Number of categories in the action recognition dataset.
keep_prob(float): Dropout probability in classification stage.
kwargs (dict, optional): Key arguments for “make_res_layer” and resblocks.
Return:
Tensor, output tensor.
WindowAttention3D¶
class mindvideo.models.WindowAttention3D(in_channels: int = 96, window_size: int = (8, 7, 7), num_head: int = 3, qkv_bias: Optional[bool] = True, qk_scale: Optional[float] = None, attn_kepp_prob: Optional[float] = 1.0, proj_keep_prob: Optional[float] = 1.0)
Window based multi-head self attention (W-MSA) module with relative position bias. It supports both of shifted and non-shifted window.
base: nn.Cell
Parameters:
in_channels (int): Number of input channels.
window_size (tuple[int]): The depth length, height and width of the window. Default: (8, 7, 7).
num_head (int): Number of attention heads. Default: 3.
qkv_bias (bool, optional): If True, add a learnable bias to query, key, value. Default: True.
qk_scale (float | None, optional): Override default qk scale of head_dim ** -0.5 if set. Default: None.
attn_keep_prob (float, optional): Dropout keep ratio of attention weight. Default: 1.0.
proj_keep_prob (float, optional): Dropout keep ratio of output. Deault: 1.0.
Inputs:
x(Tensor) - Tensor of shape (B, N, C).mask(Tensor) - (0 / - inf) mask with shape of (num_windows, N, N) or None.
Return:
Tensor of shape (B, N, C), which is equal to the input x.
SwinTransformerBlock3D¶
class mindvideo.models.SwinTransformerBlock3D(embed_dim: int = 96, input_size: int = (16, 56, 56), num_head: int = 3, window_size: int = (8, 7, 7), shift_size: int = (4, 3, 3), mlp_ratio: float = 4., qkv_bias: bool = True, qk_scale: Optional[float] = None, keep_prob: float = 1., attn_keep_prob: float = 1., droppath_keep_prob: float = 1., act_layer: nn.Cell = nn.GELU, norm_layer: str = ‘layer_norm’)
A Video Swin Transformer Block. The implementation of this block follows the paper “Video Swin Transformer”.
base: nn.Cell
Parameters:
embed_dim (int): input feature’s embedding dimension, namely, channel number. Default: 96.
input_size (int | tuple(int)): input feature size. Default: (16, 56, 56).
num_head (int): number of attention head of the current Swin3d block. Default: 3.
window_size (int): window size of window attention. Default: (8, 7, 7).
shift_size (tuple[int]): shift size for shifted window attention. Default: (4, 3, 3).
mlp_ratio (float): ratio of mlp hidden dim to embedding dim. Default: 4.0.
qkv_bias (bool): if True, add a learnable bias to query, key,value. Default: True.
qk_scale (float | None, optional): override default qk scale of head_dim ** -0.5 if set True. Default: None.
keep_prob (float): dropout keep probability. Default: 1.0.
attn_keep_prob (float): units keeping probability for attention dropout. Default: 1.0.
droppath_keep_prob (float): path keeping probability for stochastic droppath. Default: 1.0.
act_layer (nn.Cell): activation layer. Default: nn.GELU.
norm_layer (nn.Cell): normalization layer. Default: ‘layer_norm’.
Inputs:
x (Tensor) - Input feature of shape (B, D, H, W, C).
mask_matrix (Tensor) - Attention mask for cyclic shift.
Return:
Tensor of shape (B, D, H, W, C)
PatchMerging¶
class mindvideo.models.PatchMerging(dim: int = 96, norm_layer: str = ‘layer_norm’)
Patch Merging Layer.
base: nn.Cell
Parameters:
dim (int): Number of input channels.
norm_layer (nn.Cell): Normalization layer. Default: nn.LayerNorm
Inputs:
x (Tensor) - Input feature of shape (B, D, H, W, C).
Return:
Tensor of shape (B, D, H/2, W/2, 2*C)
SwinTransformerStage3D¶
class mindvideo.models.SwinTransformerStage3D(embed_dim=96, input_size=(16, 56, 56), depth=2, num_head=3, window_size=(8, 7, 7), mlp_ratio=4., qkv_bias=True, qk_scale=None, keep_prob=1., attn_keep_prob=1., droppath_keep_prob=0.8, norm_layer=’layer_norm’, downsample=PatchMerging)
A basic Swin Transformer layer for one stage.
base: nn.Cell
Parameters:
embed_dim (int): input feature’s embedding dimension, namely, channel number. Default: 96.
input_size (tuple[int]): input feature size. Default. (16, 56, 56).
depth (int): depth of the current Swin3d stage. Default: 2.
num_head (int): number of attention head of the current Swin3d stage. Default: 3.
window_size (int): window size of window attention. Default: (8, 7, 7).
mlp_ratio (float): ratio of mlp hidden dim to embedding dim. Default: 4.0.
qkv_bias (bool): if qkv_bias is True, add a learnable bias into query, key, value matrixes. Default: Truee
qk_scale (float | None, optional): override default qk scale of head_dim ** -0.5 if set. Default: None.
keep_prob (float): dropout keep probability. Default: 1.0.
attn_keep_prob (float): units keeping probability for attention dropout. Default: 1.
droppath_keep_prob (float): path keeping probability for stochastic droppath. Default: 0.8.
norm_layer(string): normalization layer. Default: ‘layer_norm’.
downsample (nn.Cell | None, optional): downsample layer at the end of swin3d stage. Default: PatchMerging.
Inputs:
A video feature of shape (N, D, H, W, C)
Return:
Tensor of shape (N, D, H / 2, W / 2, 2 * C)
PatchEmbed3D¶
class mindvideo.models.PatchEmbed3D(input_size=(16, 224, 224), patch_size=(2, 4, 4), in_channels=3, embed_dim=96, norm_layer=’layer_norm’, patch_norm=True)
Video to Patch Embedding.
base: nn.Cell
Parameters:
input_size (tuple[int]): Input feature size.
patch_size (int): Patch token size. Default: (2,4,4).
in_channels (int): Number of input video channels. Default: 3.
embed_dim (int): Number of linear projection output channels. Default: 96.
norm_layer (nn.Module, optional): Normalization layer. Default: None.
patch_norm (bool): if True, add normalization after patch embedding. Default: True.
Inputs:
An original Video tensor in data format of ‘NCDHW’.
Return:
An embedded tensor in data format of ‘NDHWC’.
SwinTransformer3D¶
class mindvideo.models.SwinTransformer3D(input_size=(16, 56, 56), embed_dim=96, depths=(2, 2, 6, 2), num_heads=(3, 6, 12, 24), window_size=(8, 7, 7), mlp_ratio=4., qkv_bias=True, qk_scale=None, keep_prob=1., attn_keep_prob=1., droppath_keep_prob=0.8, norm_layer=’layer_norm’)
Video Swin Transformer backbone. A mindspore implementation of : Video Swin Transformer http://arxiv.org/abs/2106.13230
base: nn.Cell
Parameters:
input_size (int | tuple(int)): input feature size. Default: (16, 56, 56).
embed_dim (int): input feature’s embedding dimension, namely, channel number. Default: 96.
depths (tuple[int]): depths of each Swin3d stage. Default: (2, 2, 6, 2).
num_heads (tuple[int]): number of attention head of each Swin3d stage. Default: (3, 6, 12, 24).
window_size (int): window size of window attention. Default: (8, 7, 7).
mlp_ratio (float): ratio of mlp hidden dim to embedding dim. Default: 4.0.
qkv_bias (bool): if qkv_bias is True, add a learnable bias into query, key, value matrixes. Default: True.
qk_scale (float | None, optional): override default qk scale of head_dim ** -0.5 if set. Default: None.
keep_prob (float): dropout keep probability. Default: 1.0.
attn_keep_prob (float): units keeping probability for attention dropout. Default: 1.
droppath_keep_prob (float): path keeping probability for stochastic droppath. Default: 0.8.
norm_layer (string): normalization layer. Default: ‘layer_norm’.
Inputs:
x (Tensor) - Tensor of shape ‘NDHWC’.
Return:
Tensor of shape ‘NCDHW’.
Swin3D¶
class mindvideo.models.Swin3D(input_size=(16, 56, 56), embed_dim=96, depths=(2, 2, 6, 2), num_heads=(3, 6, 12, 24), window_size=(8, 7, 7), mlp_ratio=4., qkv_bias=True, qk_scale=None, keep_prob=1., attn_keep_prob=1., droppath_keep_prob=0.8, norm_layer=’layer_norm’)
Constructs a swin3d architecture corresponding to Video Swin Transformer <http://arxiv.org/abs/2106.13230>.
base: nn.Cell
Parameters:
num_classes (int): The number of classification. Default: 400.
patch_size (int): Patch size used by window attention. Default: (2, 4, 4).
window_size (int): Window size used by window attention. Default: (8, 7, 7).
embed_dim (int): Embedding dimension of the featrue generated from patch embedding layer. Default: 96.
depths (int): Depths of each stage in Swin3d Tiny module. Default: (2, 2, 6, 2).
num_heads (int): Numbers of heads of each stage in Swin3d Tiny module. Default: (3, 6, 12, 24).
representation_size (int): Feature dimension of the last layer in backbone. Default: 768.
droppath_keep_prob (float): The drop path keep probability. Default: 0.9.
input_size (int | tuple(int)): Input feature size. Default: (32, 224, 224).
in_channels (int): Input channels. Default: 3.
mlp_ratio (float): Ratio of mlp hidden dim to embedding dim. Default: 4.0.
qkv_bias (bool): If qkv_bias is True, add a learnable bias into query, key, value matrixes. Default: True.
qk_scale (float | None, optional): Override default qk scale of head_dim ** -0.5 if set. Default: None.
keep_prob (float): Dropout keep probability. Default: 1.0.
attn_keep_prob (float): Keeping probability for attention dropout. Default: 1.0.
norm_layer (string): Normalization layer. Default: ‘layer_norm’.
patch_norm (bool): If True, add normalization after patch embedding. Default: True.
pooling_keep_dim (bool): Specifies whether to keep dimension shape the same as input feature. Default: False.
head_bias (bool): Specifies whether the head uses a bias vector. Default: True.
head_activation (Union[str, Cell, Primitive]): Activate function applied in the head. Default: None.
head_keep_prob (float): Head’s dropout keeping rate, between [0, 1]. Default: 0.5.
Inputs:
x (Tensor) - Tensor of shape :math:
(N, C_{in}, D_{in}, H_{in}, W_{in}).
Return:
Tensor of shape :math:(N, CLASSES_{out})
swin3d_t¶
def mindvideo.models.swin3d_t(num_classes: int = 400, patch_size: int = (2, 4, 4), window_size: int = (8, 7, 7), embed_dim: int = 96, depths: int = (2, 2, 6, 2), num_heads: int = (3, 6, 12, 24), representation_size: int = 768, droppath_keep_prob: float = 0.9)
Video Swin Transformer Tiny (swin3d-T) model.
Parameters:
num_classes (int): Number of categories. patch_size (int): Size of swin3d patch segmentation. window_size (int): Size of swin3d window. embed_dim (int): Dimension output by the patch embedding. depths (int): Depth of each stage. num_heads (int): Number of heads in window attention. representation_size (int): Size of features output at the last layer of backbone. droppath_keep_prob (float): Retetion probability of drop path.
Returns:
swin3d_t: nn.Cell
swin3d_s¶
def mindvideo.models.swin3d_s(num_classes: int = 400, patch_size: int = (2, 4, 4), window_size: int = (8, 7, 7), embed_dim: int = 96, depths: int = (2, 2, 18, 2), num_heads: int = (3, 6, 12, 24), representation_size: int = 768, droppath_keep_prob: float = 0.9)
Video Swin Transformer Small (swin3d-S) model.
Parameters:
num_classes (int): Number of categories. patch_size (int): Size of swin3d patch segmentation. window_size (int): Size of swin3d window. embed_dim (int): Dimension output by the patch embedding. depths (int): Depth of each stage. num_heads (int): Number of heads in window attention. representation_size (int): Size of features output at the last layer of backbone. droppath_keep_prob (float): Retetion probability of drop path.
Returns:
swin3d_s: nn.Cell
swin3d_b¶
def mindvideo.models.swin3d_b(num_classes: int = 400, patch_size: int = (2, 4, 4), window_size: int = (8, 7, 7), embed_dim: int = 128, depths: int = (2, 2, 18, 2), num_heads: int = (4, 8, 16, 32), representation_size: int = 1024, droppath_keep_prob: float = 0.7)
Video Swin Transformer Base (swin3d-B) model.
Parameters:
num_classes (int): Number of categories. patch_size (int): Size of swin3d patch segmentation. window_size (int): Size of swin3d window. embed_dim (int): Dimension output by the patch embedding. depths (int): Depth of each stage. num_heads (int): Number of heads in window attention. representation_size (int): Size of features output at the last layer of backbone. droppath_keep_prob (float): Retetion probability of drop path.
Returns:
swin3d_b: nn.Cell
swin3d_l¶
def mindvideo.models.swin3d_l(num_classes: int = 400, patch_size: int = (2, 4, 4), window_size: int = (8, 7, 7), embed_dim: int = 192, depths: int = (2, 2, 18, 2), num_heads: int = (6, 12, 24, 48), representation_size: int = 1536, droppath_keep_prob: float = 0.9)
Video Swin Transformer Large (swin3d-L) model.
Parameters:
num_classes (int): Number of categories. patch_size (int): Size of swin3d patch segmentation. window_size (int): Size of swin3d window. embed_dim (int): Dimension output by the patch embedding. depths (int): Depth of each stage. num_heads (int): Number of heads in window attention. representation_size (int): Size of features output at the last layer of backbone. droppath_keep_prob (float): Retetion probability of drop path.
Returns:
swin3d_l: nn.Cell
GroupNorm3d¶
class mindvideo.models.GroupNorm3d(num_groups, num_channels, eps=1e-05, affine=True, gamma_init=’ones’, beta_init=’zeros’)
modify from mindspore.nn.GroupNorm, add depth
base: nn.Cell
Parameters:
num_groups (int): Number of groups to be divided along the channel dimension. num_channels (int): Number of channels. eps(float): The value added to the denominator. affine (bool): When set to True, a learnable affine transformation parameter is added to the layer. gamma_init (str): Method of initializing the gamma parameter. beta_init (str): Method of initializing the beta parameter.
Return:
Tensor, output tensor.
VistrCom¶
class mindvideo.models.VistrCom(name: str = ‘ResNet50’, train_embeding: bool = True, num_queries: int = 360, num_pos_feats: int = 64, num_frames: int = 36, temperature: int = 10000, normalize: bool = True, scale: float = None, hidden_dim: int = 384, d_model: int = 384, nhead: int = 8, num_encoder_layers: int = 6, num_decoder_layers: int = 6, dim_feedforward: int = 2048, dropout: int = 0.1, activation: str = “relu”, normalize_before: bool = False, return_intermediate_dec: bool = True, aux_loss: bool = True, num_class: int = 41)
Vistr Architecture.
base: nn.Cell
Parameters:
name (str): The type of ResNet. train_embeding (bool): Whether to train embeding or not. num_queries (int): Number of instances. num_pos_feats (int): The encoding length of each dimension. num_frames (int): Number of frames. temperature (int): Coefficient. normalize (bool): Whether to normalize. If True, normalize. scale (float): Coefficient. hidden_dim (int): Dimensions required by the input vector in the encoder. d_model (int): Number of expected features entered by the backbone nhead (int): Number of heads in multi head attention. num_encoder_layers (int): Layer number of encoders. num_decoder_layers (int): Layer number of decoders. dim_feedforward (int): Dimensions of the feedforward network model in backbone dropout (int): Value of dropout. activation(str): Activation function. normalize_before (bool): Whether is normalized or not before. return_intermediate_dec (bool): Whether to return intermediate output aux_loss (bool): Whether to calculate the loss of the middle layer. num_class (int): Number of categories.
Return:
Tensor, output tensor.
BlockX3D¶
class mindvideo.models.BlockX3D(in_channel, out_channel, conv12: Optional[nn.Cell] = Inflate3D, inflate: int = 2, norm: Optional[nn.Cell] = None, down_sample: Optional[nn.Cell] = None, block_idx: int = 0, se_ratio: float = 0.0625, use_swish: bool = True, drop_connect_rate: float = 0.0, bottleneck_factor: float = 2.25, **kwargs)
BlockX3D 3d building block for X3D.
base: ResidualBlock3D
Parameters:
in_channel (int): Input channel.
out_channel (int): Output channel.
conv12(nn.Cell, optional): Block that constructs first two conv layers. It can be
Inflate3D,Conv2Plus1Dor other custom blocks, this block should construct a layer where the name of output feature channel size ismid_channelfor the third conv layers. Default: Inflate3D.inflate (int): Whether to inflate kernel.
spatial_stride (int): Spatial stride in the conv3d layer. Default: 1.
down_sample (nn.Module | None): DownSample layer. Default: None.
block_idx (int): the id of the block.
se_ratio (float | None): The reduction ratio of squeeze and excitation unit. If set as None, it means not using SE unit. Default: None.
use_swish (bool): Whether to use swish as the activation function before and after the 3x3x3 conv. Default: True.
drop_connect_rate (float): dropout rate. If equal to 0.0, perform no dropout.
bottleneck_factor (float): Bottleneck expansion factor for the 3x3x3 conv.
Return:
Tensor, output tensor.
ResNetX3D¶
class mindvideo.models.ResNetX3D(block: Optional[nn.Cell], layer_nums: Tuple[int], stage_channels: Tuple[int], stage_strides: Tuple[Tuple[int]], drop_rates: Tuple[float], down_sample: Optional[nn.Cell] = Unit3D, bottleneck_factor: float = 2.25)
X3D backbone definition.
base: ResNet3D
Parameters:
block (Optional[nn.Cell]): THe block for network.
layer_nums (list): The numbers of block in different layers.
stage_channels (Tuple[int]): Output channel for every res stage.
stage_strides (Tuple[Tuple[int]]): Stride size for ResNet3D convolutional layer.
drop_rates (list): list of the drop rate in different blocks. The basic rate at which blocks are dropped, linearly increases from input to output blocks.
down_sample (Optional[nn.Cell]): Residual block in every resblock, it can transfer the input feature into the same channel of output. Default: Unit3D.
bottleneck_factor (float): Bottleneck expansion factor for the 3x3x3 conv.
fc_init_std (float): The std to initialize the fc layer(s).
Inputs:
x (Tensor) - Tensor of shape :math:
(N, C_{in}, D_{in}, H_{in}, W_{in}).
Return:
Tensor, output tensor.
X3DHead¶
class mindvideo.models.X3DHead(pool_size, input_channel, out_channel=2048, num_classes=400, dropout_rate=0.5)
x3d head architecture.
base: nn.Cell
Parameters:
input_channel (int): The number of input channel.
out_channel (int): The number of inner channel. Default: 2048.
num_classes (int): Number of classes. Default: 400.
dropout_rate (float): Dropout keeping rate, between [0, 1]. Default: 0.5.
Return:
Tensor
x3d¶
class mindvideo.models.x3d(block: Type[BlockX3D], depth_factor: float, num_frames: int, train_crop_size: int, num_classes: int, dropout_rate: float, bottleneck_factor: float = 2.25, eval_with_clips: bool = False)
x3d architecture. Christoph Feichtenhofer. “X3D: Expanding Architectures for Efficient Video Recognition.” https://arxiv.org/abs/2004.04730
base: nn.Cell
Parameters:
block (Type[BlockX3D]): The block of X3D.
depth_factor (float): Depth expansion factor.
num_frames (int): The number of frames of the input clip.
train_crop_size (int): The spatial crop size for training.
num_classes (int): the channel dimensions of the output.
dropout_rate (float): dropout rate. If equal to 0.0, perform no dropout.
bottleneck_factor (float): Factor of bottleneck.
eval_with_clips (bool): If evalidate with clips, eval_with_clips is True.
Inputs:
x (Tensor) - Tensor of shape :math:
(N, C_{in}, D_{in}, H_{in}, W_{in}).
Return:
Tensor of shape :math:(N, CLASSES_{out})
x3d_m¶
def mindvideo.models.x3d_m(num_classes: int = 400, dropout_rate: float = 0.5, depth_factor: float = 2.2, num_frames: int = 16, train_crop_size: int = 224, eval_with_clips: bool = False)
X3D middle model.
Parameters:
num_classes (int): the channel dimensions of the output.
dropout_rate (float): dropout rate. If equal to 0.0, perform no dropout.
depth_factor (float): Depth expansion factor.
num_frames (int): The number of frames of the input clip.
train_crop_size (int): The spatial crop size for training.
Inputs:
x (Tensor) - Tensor of shape :math:
(N, C_{in}, D_{in}, H_{in}, W_{in}).
Return:
Tensor of shape :math:(N, CLASSES_{out})
x3d_s¶
def mindvideo.models.x3d_s(num_classes: int = 400, dropout_rate: float = 0.5, depth_factor: float = 2.2, num_frames: int = 13, train_crop_size: int = 160, eval_with_clips: bool = False)
X3D small model.
Parameters:
num_classes (int): the channel dimensions of the output.
dropout_rate (float): dropout rate. If equal to 0.0, perform no dropout.
depth_factor (float): Depth expansion factor.
num_frames (int): The number of frames of the input clip.
train_crop_size (int): The spatial crop size for training.
Inputs:
x (Tensor) - Tensor of shape :math:
(N, C_{in}, D_{in}, H_{in}, W_{in}).
Return:
Tensor of shape :math:(N, CLASSES_{out})
x3d_xs¶
def mindvideo.models.x3d_xs(num_classes: int = 400, dropout_rate: float = 0.5, depth_factor: float = 2.2, num_frames: int = 4, train_crop_size: int = 160, eval_with_clips: bool = False)
X3D x-small model.
Parameters:
num_classes (int): the channel dimensions of the output.
dropout_rate (float): dropout rate. If equal to 0.0, perform no dropout.
depth_factor (float): Depth expansion factor.
num_frames (int): The number of frames of the input clip.
train_crop_size (int): The spatial crop size for training.
Inputs:
x (Tensor) - Tensor of shape :math:
(N, C_{in}, D_{in}, H_{in}, W_{in}).
Return:
Tensor of shape :math:(N, CLASSES_{out})
x3d_l¶
def mindvideo.models.x3d_l(num_classes: int = 400, dropout_rate: float = 0.5, depth_factor: float = 5.0, num_frames: int = 16, train_crop_size: int = 312, eval_with_clips: bool = False)
X3D large model.
Parameters:
num_classes (int): the channel dimensions of the output.
dropout_rate (float): dropout rate. If equal to 0.0, perform no dropout.
depth_factor (float): Depth expansion factor.
num_frames (int): The number of frames of the input clip.
train_crop_size (int): The spatial crop size for training.
Inputs:
x (Tensor) - Tensor of shape :math:
(N, C_{in}, D_{in}, H_{in}, W_{in}).
Return:
Tensor of shape :math:(N, CLASSES_{out})