## mindvideo.model.layers ### AdaptiveAvgPool3D > class mindvideo.model.layers.AdaptiveAvgPool3D(output_size) Applies a 3D adaptive average pooling over an input tensor which is typically of shape`(N, C, D_{in}, H_{in}, W_{in})` and output shape`(N, C, D_{out}, H_{out}, W_{out})`. where `N` is batch size. `C` is channel number. - base: nn.Cell **Parameters:** - output_size(Union[int, tuple[int]]): The target output size of the form D x H x W. Can be a tuple (D, H, W) or a single number D for a cube D x D x D. **Inputs:** - x(Tensor): The input Tensor in the form of :math:`(N, C, D_{in}, H_{in}, W_{in})`. **Return:** Tensor, the pooled Tensor in the form of :math:`(N, C, D_{out}, H_{out}, W_{out})`. ### AvgPool3D > class mindvideo.model.layers.AvgPool3D(kernel_size=(1, 1, 1), strides=(1, 1, 1)) Average pooling for 3d feature. - base: nn.Cell **Parameters:** - kernel_size(Union[int, tuple[int]]): The size of kernel window used to take the average value, Default: (1, 1, 1). - strides(Union[int, tuple[int]]): The distance of kernel moving. Default: (1, 1, 1). **Inputs:** - x(Tensor): The input Tensor. **Return:** Tensor, the pooled Tensor. ### GlobalAvgPooling3D > class mindvideo.model.layers.GlobalAvgPooling3D(keep_dims: bool = True) A module of Global average pooling for 3D video features. - base: nn.Cell **Parameters:** - keep_dims (bool): Specifies whether to keep dimension shape the same as input feature. E.g. `True`. Default: False **Return:** Tensor, output tensor. ### MultiIou > class mindvideo.model.layers.MultiIou() Multi iou calculating Iou between pred boxes and gt boxes. - base: nn.Cell **Parameters:** None **Inputs:** - pred_bbox(tensor):predicted bbox. - gt_bbox(tensor):Ground Truth bbox. **Return:** Tensor, iou of predicted box and ground truth box. ### BoxIou > class mindvideo.model.layers.BoxIou() calculate box iou - base: nn.Cell **Parameters:** None **Inputs:** - boxes1(Tensor):[x0, y0, x1, y1] format - boxes2(Tensor):[x0, y0, x1, y1] format **Return:** Tensor ### BoxIou > class mindvideo.model.layers.BoxIou() Generalized IoU from https://giou.stanford.edu/. The boxes should be in [x0, y0, x1, y1] format. Returns a [N, M] pairwise matrix, where N = len(boxes1) and M = len(boxes2). - base: nn.Cell **Parameters:** None **Inputs:** - boxes1(Tensor):[x0, y0, x1, y1] format - boxes2(Tensor):[x0, y0, x1, y1] format **Return:** a [N, M] pairwise matrix, where N = len(boxes1) and M = len(boxes2) ### ConvNormActivation > class mindvideo.model.layers.ConvNormActivation(in_planes: int, out_planes: int, kernel_size: int = 3, stride: int = 1, groups: int = 1, norm: Optional[nn.Cell] = nn.BatchNorm2d, activation: Optional[nn.Cell] = nn.ReLU, has_bias: bool = False) Convolution/Depthwise fused with normalization and activation blocks definition. - base: nn.Cell **Parameters:** - in_planes (int): Input channel. - out_planes (int): Output channel. - kernel_size (int): Input kernel size. - stride (int): Stride size for the first convolutional layer. Default: 1. - groups (int): channel group. Convolution is 1 while Depthiwse is input channel. Default: 1. - norm (nn.Cell, optional): Norm layer that will be stacked on top of the convolution layer. Default: nn.BatchNorm2d. - activation (nn.Cell, optional): Activation function which will be stacked on top of the normalization layer (if not None), otherwise on top of the conv layer. Default: nn.ReLU. **Return:** Tensor, output tensor. ### Conv2dNormResAct > class mindvideo.model.layers.Conv2dNormResAct(in_channels, out_channels, kernel_size, stride, padding, residual=False) Convolution/Depthwise fused with normalization and activation blocks definition. - base: nn.Cell **Parameters:** - in_channels (int): The channel number of the input tensor of the Conv2d layer. - out_channels (int): The channel number of the output tensor of the Conv2d layer. - kernel_size (Union[int, tuple[int]]): Specifies the height and width of the 2D convolution kernel. - stride (Union[int, tuple[int]]): The movement stride of the 2D convolution kernel. - padding (Union[int, tuple[int]]): The number of padding on the height and width directions of the input. - residual (bool): Whether the input value needs to be added. **Inputs:** - **x** (Tensor) - Tensor of shape :math:`(N, C_{in}, H_{in}, W_{in})`. **Return:** Tensor of shape :math:`(N, C_{out}, H_{out}, W_{out})`. ### Conv2dTransPadBN > class mindvideo.model.layers.Conv2dTransPadBN(in_channels, out_channels, kernel_size, stride, padding, output_padding=0) Convolution/Depthwise fused with normalization and activation blocks definition. - base: nn.Cell **Parameters:** - in_channels (int): The channel number of the input tensor of the Conv2d layer. - out_channels (int): The channel number of the output tensor of the Conv2d layer. - kernel_size (Union[int, tuple[int]]): Specifies the height and width of the 2D convolution kernel. - stride (Union[int, tuple[int]]): The movement stride of the 2D convolution kernel. - padding (Union[int, tuple[int]]): The number of padding on the height and width directions of the input. - output_padding (int): The number of padding of the output. **Inputs:** - **x** (Tensor) - Tensor of shape :math:`(N, C_{in}, H_{in}, W_{in})`. **Return:** Tensor of shape :math:`(N, C_{out}, H_{out}, W_{out})`. ### C3DBackbone > class mindvideo.model.layers.C3DBackbone(in_channel=3, kernel_size=(3, 3, 3)) C3D backbone. It works when the of input data is in the shape of :math:`(B, C, T, H, W)`. - base: nn.Cell **Parameters:** - in_channel(int): Number of input data. Default: 3. - kernel_size(Union[int, Tuple[int]]): Kernel size for every conv3d layer in C3D. Default: (3, 3, 3). **Return:** Tensor, infer output tensor. ### DeformConv2d > class mindvideo.model.layers.DeformConv2d(inc, outc, kernel_size=3, stride=1, pad_mode='same', padding=0, has_bias=False, modulation=True) Deformable convolution opertor. - base: nn.Cell **Parameters:** - inc(int): Input channel. - outc(int): Output channel. - kernel_size (int): Convolution window. Default: 3. - stride (int): The distance of kernel moving. Default: 1. - padding (int): Implicit paddings size on both sides of the input. Default: 1. - has_bias (bool): Specifies whether the layer uses a bias vector. Default: False. - modulation (bool): If True, modulated defomable convolution (Deformable ConvNets v2). Default: True. **Return:** Tensor, detection of images(bboxes, score, keypoints and category id of each objects) ### _get_offset_base > def mindvideo.model.layers._get_offset_base(offset_shape, stride) Get base position index from deformable shift of each kernel element. ### _get_feature_by_index > def mindvideo.model.layers._get_feature_by_index(x, p_h, p_w) Gather feature by specified index. ### _regenerate_feature_map > def mindvideo.model.layers._regenerate_feature_map(x_offset) Get rescaled feature map which was enlarged by ks**2 times. ### ProbDropPath3D > class mindvideo.model.layers.ProbDropPath3D(keep_prob) Drop path per sample using a fixed probability. Use keep_prob param as the probability for keeping network units. - base: nn.Cell **Parameters:** - keep_prob (int): Network unit keeping probability. - ndim (int): Number of dropout features' dimension. **Inputs:** Tensor of ndim dimension. **Return:** A path-dropped tensor. ### DropoutDense > class mindvideo.model.layers.DropoutDense(input_channel: int, out_channel: int, has_bias: bool = True, activation: Optional[Union[str, nn.Cell]] = None, keep_prob: float = 1.0) Dropout + Dense architecture. - base: nn.Cell **Parameters:** - input_channel (int): The number of input channel. - out_channel (int): The number of output channel. - has_bias (bool): Specifies whether the layer uses a bias vector. Default: True. - activation (Union[str, Cell, Primitive]): activate function applied to the output. Eg. `ReLU`. Default: None. - keep_prob (float): Dropout keeping rate, between [0, 1]. E.g. rate=0.9, means dropping out 10% of input. Default: 1.0. **Return:** Tensor, output tensor. ### FairMOTSingleHead > class mindvideo.model.layers.FairMOTSingleHead(in_channel, head_conv=0, classes=100, kernel_size=3, bias_init=Zero()) Simple convolutional head, two conv2d layers will be created if head_conv > 0, else there is only one conv2d layer. - base: nn.Cell **Parameters:** - in_channel(int): Channel size of input feature. - head_conv(int): Channel size between two conv2d layers, there will be only one conv2d layer if head_conv equals 0. Default: 0. - classes(int): Number of classes, channel size of output tensor. - kernel_size(Union[int, tuple]): The kernel size of first conv2d layer. - bias_init(Union[Tensor, str, Initializer, numbers.Number]): Bias initialization of last conv2d layer. The input value is the same as `mindspore.common.initializer.initializer`. **Return:** Tensor, the classification result. ### FairMOTMultiHead > class mindvideo.model.layers.FairMOTMultiHead(heads, in_channel, head_conv=0, kernel_size=3) Fairmot net multi-conv head, the combination of single heads. - base: nn.Cell **Parameters:** - heads(dict): A dict contains name and output dimension of heads, the name is the key, and output dimension is the value. For fairmot, it must have 'hm', 'wh', 'id', 'reg' heads. - in_channel(int): Channel size of input feature. - head_conv(int): Channel size between two conv2d layers, there will be only one conv2d layer if head_conv equals 0. Default: 0. - kernel_size(Union[int, tuple]): The kernel size of first conv2d layer. - bias_init(Union[Tensor, str, Initializer, numbers.Number]): Bias initialization of last conv2d layer. The input value is the same as `mindspore.common.initializer.initializer`. **Return:** Tensor, the multi-head classification results. ### FeedForward > class mindvideo.model.layers.FeedForward(in_features: int, hidden_features: Optional[int] = None, out_features: Optional[int] = None, activation: nn.Cell = nn.GELU, keep_prob: float = 1.0) Feed Forward layer implementation. - base: nn.Cell **Parameters:** - in_features (int): The dimension of input features. - hidden_features (int): The dimension of hidden features. Default: None. - out_features (int): The dimension of output features. Default: None - activation (nn.Cell): Activation function which will be stacked on top of the - normalization layer (if not None), otherwise on top of the conv layer. Default: nn.GELU. - keep_prob (float): The keep rate, greater than 0 and less equal than 1. Default: 1.0. **Return:** Tensor, output tensor. ### Hungarian > class mindvideo.model.layers.Hungarian(dim) Given a cost matrix, calculate the best assignment that cost the least. This ops now only support square matrix. - base: nn.Cell **Parameters:** - dim (int): The size of the input square matrix. **Inputs:** x(Tensor): The input cost matrix. **Returns:** - Tensor[bool]: The best assignment, there can be multiple solutions. - Tensor[int32]: The indices of row assignment. - Tensor[int32]: The indices of column assignment. > def mindvideo.model.layers.Hungarian.create_onehot(idx) Calculate one hot vector according to input indice. **Return:** Tensor: One hot vector. > def mindvideo.model.layers.Hungarian.get_assign(assign_matrix) Make every row of assign matrix has at most one assignment. **Return:** Tensor: assign matrix. > def mindvideo.model.layers.Hungarian.try_assign(x) Try assignment, if succeed return the result. **Return:** Tensor: The best assignment, there can be multiple solutions. ### Inflate3D > class mindvideo.model.layers.Inflate3D(in_channel: int, out_channel: int, mid_channel: int = 0, stride: tuple = (1, 1, 1), kernel_size: tuple = (3, 3, 3), conv2_group: int = 1, norm: Optional[nn.Cell] = nn.BatchNorm3d, activation: List[Optional[Union[nn.Cell, str]]] = (nn.ReLU, None), inflate: int = 1) Inflate3D block definition. - base: nn.Cell **Parameters:** - in_channel (int): The number of channels of input frame images. - out_channel (int): The number of channels of output frame images. - mid_channel (int): The number of channels of inner frame images. - kernel_size (tuple): The size of the spatial-temporal convolutional layer kernels. - stride (Union[int, Tuple[int]]): Stride size for the second convolutional layer. Default: 1. - conv2_group (int): Splits filter into groups for the second conv layer, in_channels and out_channels must be divisible by the number of groups. Default: 1. - norm (Optional[nn.Cell]): Norm layer that will be stacked on top of the convolution layer. Default: nn.BatchNorm3d. - activation (List[Optional[Union[nn.Cell, str]]]): Activation function which will be stacked on top of the normalization layer (if not None), otherwise on top of the conv layer. Default: nn.ReLU, None. - inflate (int): Whether to inflate two conv3d layers and with different kernel size. **Return:** Tensor, output tensor. ### HungarianMatcher > class mindvideo.model.layers.HungarianMatcher(num_frames: int = 36, cost_class: float = 1, cost_bbox: float = 1, cost_giou: float = 1) This class computes an assignment between the targets and the predictions of the network. For efficiency reasons, the targets don't include the no_object. Because of this, in general,there are more predictions than targets. In this case, we do a 1-to-1 matching of the best predictions, while the others are un-matched (and thus treated as non-objects). - base: nn.Cell **Parameters:** - num_frames: The number of frames. - cost_class: This is the relative weight of the classification error in the matching cost. - cost_bbox: This is the relative weight of the L1 error of the bounding box coordinates in the matching cost. - cost_giou: This is the relative weight of the giou loss of the bounding box in the matching cost. **Return:** Tensor, output tensor. > def mindvideo.model.layers.HungarianMatcher._CxcywhToXyxy(x) CxCyWH_to_XYXY **Parameters:** x(tensor):last dimension is four **Return:** Tensor, last dimension is four ### MaskHeadSmallConv > class mindvideo.model.layers.MaskHeadSmallConv(dim, fpn_dims, context_dim) MaskHeadSmallConv:Simple convolutional head, using group norm. Upsampling is done using a FPN approach. - base: nn.Cell **Parameters:** - dim(int):Size of the embeddings (dimension of the transformer) + Number of attention heads inside the transformer's attentions. - fpn_dims(dict):three dims for FPN. - context_dim(int):Size of the embeddings (dimension of the transformer). **Inputs:** - x(Tensor):sequence of encoded features - bbox_mask(Tensor): the attention softmax of bbox - fpns(list[Tensor]):images features without positional encoding **Return:** Tensor. ### MaxPool3D > class mindvideo.model.layers.MaxPool3D(kernel_size=1, strides=1, pad_mode="VALID", pad_list=0, ceil_mode=None, data_format="NCDHW") 3D max pooling operation. Applies a 3D max pooling over an input Tensor which can be regarded as a composition of 3D planes. - base: nn.Cell **Parameters:** - kernel_size (Union[int, tuple[int]]): The size of kernel used to take the maximum value, is an int number that represents depth, height and width of the kernel, or a tuple of three int numbers that represent depth, height and width respectively. Default: 1. - strides (Union[int, tuple[int]]): The distance of kernel moving, an int number that represents the depth, height and width of movement are both strides, or a tuple of three int numbers that represent depth, height and width of movement respectively. Default: 1. - pad_mode (str): The optional value for pad mode, is "same" or "valid", not case sensitive. Default: "valid". - pad_list (Union(int, tuple[int])): The pad value to be filled. Default: 0. If `pad` is an integer, the paddings of head, tail, top, bottom, left and right are the same, equal to pad. If `pad` is a tuple of six integers, the padding of head, tail, top, bottom, left and right equal to pad[0], pad[1], pad[2], pad[3], pad[4] and pad[5] correspondingly. - ceil_mode (bool): Whether to use ceil instead of floor to calculate output shape. Only effective in "pad" mode. When "pad_mode" is "pad" and "ceil_mode" is "None", "ceil_mode" will be set as "False". Default: None. - data_format (str) : The optional value for data format. Currently only support 'NCDHW'. Default: 'NCDHW'. **Inputs:** - **x** (Tensor) - Tensor of shape :math:`(N, C, D_{in}, H_{in}, W_{in})`. Data type must be float16 or float32. **Return:** Tensor, with shape :math:`(N, C, D_{out}, H_{out}, W_{out})`. Has the data type with `x`. ### Maxpool3DwithPad > class mindvideo.model.layers.Maxpool3DwithPad(kernel_size, padding, strides=1, pad_mode='SYMMETRIC') 3D max pooling with padding operation. - base: nn.Cell **Parameters:** - kernel_size (Union[int, tuple[int]]): The size of kernel used to take the maximum value, is an int number that represents depth, height and width of the kernel, or a tuple of three int numbers that represent depth, height and width respectively. Default: 1. - padding (Union(int, tuple[int])): The pad value to be filled. Default: 0. If `pad` is an integer, the paddings of head, tail, top, bottom, left and right are the same, equal to pad. If `pad` is a tuple of six integers, the padding of head, tail, top, bottom, left and right equal to pad[0], pad[1], pad[2], pad[3], pad[4] and pad[5] correspondingly. - strides (Union[int, tuple[int]]): The distance of kernel moving, an int number that represents not only the depth, height of movement but also the width of movement,, or a tuple of three int numbers that represent depth, height and width of movement respectively. Default: 1. - pad_mode (str): The optional value of pad mode is "same" or "valid" or "SYMMETRIC". Default: "SYMMETRIC". **Return:** Tensor, output tensor. ### MHAttentionMsp > class mindvideo.model.layers.MHAttentionMsp(query_dim, hidden_dim, num_heads, dropout=0.0, bias=True) This is a 2D attention module, which only returns the attention softmax (no multiplication by value). - base: nn.Cell **Parameters:** - query_dim(int): The number of channels in input sequence. - hidden_dim(int): The number of channels in output sequence. - num_heads(int): parallel attention heads. - dropout(float):The dropout rate.Default: 0.0. - bias(bool): Whether the Conv layer has a bias parameter. Default: True. **Return:** Tensor, output tensor. ### MLP > class mindvideo.model.layers.MLP(input_dim, hidden_dim, output_dim, num_layers) Very simple multi-layer perceptron (also called FFN). - base: nn.Cell **Parameters:** - input_dim(int): The number of channels in the input space. - hidden_dim(int): The number of extra channels - output_dim(int): The number of channels in the output space. - num_layers(int): The number of layers in the mlp **Return:** tensor, one tensor ### linear > def mindvideo.model.layers.linear(input_arr, weight, bias=None) Applies a linear transformation to the incoming data: :math:`y = xA^T + b`. **Parameters:** - Input: :math:`(N, *, in_features)` N is the batch size, `*` means any number of additional dimensions - Weight: :math:`(out_features, in_features)` - Bias: :math:`(out_features)` - Output: :math:`(N, *, out_features)` **Return:** Tensor. ### MultiheadAttention > class mindvideo.model.layers.MultiheadAttention(embed_dim, num_heads, dropout=0.) multi head attention - base: nn.Cell **Parameters:** - embed_dim(int): total dimension of the model - num_heads(int): parallel attention heads - dropout(float): a Dropout layer on attn_output_weights.Default:0. **Return:** tensor ### ResidualBlockBase > class mindvideo.model.layers.ResidualBlockBase(in_channel: int, out_channel: int, stride: int = 1, group: int = 1, base_width: int = 64, norm: Optional[nn.Cell] = None, down_sample: Optional[nn.Cell] = None) ResNet residual block base definition. - base: nn.Cell **Parameters:** - in_channel (int): Input channel. - out_channel (int): Output channel. - stride (int): Stride size for the first convolutional layer. Default: 1. - group (int): Group convolutions. Default: 1. - base_width (int): Width of per group. Default: 64. - norm (nn.Cell, optional): Module specifying the normalization layer to use. Default: None. - down_sample (nn.Cell, optional): Downsample structure. Default: None. **Return:** Tensor, output tensor. ### ResidualBlock > class mindvideo.model.layers.ResidualBlock(in_channel: int, out_channel: int, stride: int = 1, group: int = 1, base_width: int = 64, norm: Optional[nn.Cell] = None, down_sample: Optional[nn.Cell] = None) ResNet residual block definition. - base: nn.Cell **Parameters:** - in_channel (int): Input channel. - out_channel (int): Output channel. - stride (int): Stride size for the second convolutional layer. Default: 1. - group (int): Group convolutions. Default: 1. - base_width (int): Width of per group. Default: 64. - norm (nn.Cell, optional): Module specifying the normalization layer to use. Default: None. - down_sample (nn.Cell, optional): Downsample structure. Default: None. **Return:** Tensor, output tensor. ### ResNet > class mindvideo.model.layers.ResNet(block: Type[Union[ResidualBlockBase, ResidualBlock]], layer_nums: List[int], group: int = 1, base_width: int = 64, norm: Optional[nn.Cell] = None) ResNet architecture. - base: nn.Cell **Parameters:** - block (Type[Union[ResidualBlockBase, ResidualBlock]]): THe block for network. - layer_nums (list): The numbers of block in different layers. - group (int): The number of Group convolutions. Default: 1. - base_width (int): The width of per group. Default: 64. - norm (nn.Cell, optional): The module specifying the normalization layer to use. Default: None. **Inputs:** - **x** (Tensor) - Tensor of shape :math:`(N, C_{in}, H_{in}, W_{in})`. **Return:** Tensor of shape :math:`(N, 2048, 7, 7)` ### ResidualBlockBase3D > class mindvideo.model.layers.ResidualBlockBase3D(in_channel: int, out_channel: int, mid_channel: int = 0, conv12: Optional[nn.Cell] = Inflate3D, group: int = 1, base_width: int = 64, norm: Optional[nn.Cell] = None, down_sample: Optional[nn.Cell] = None, **kwargs) ResNet3D residual block base definition. - base: nn.Cell **Parameters:** - in_channel (int): Input channel. - out_channel (int): Output channel. - conv12(nn.Cell, optional): Block that constructs first two conv layers. It can be `Inflate3D`, `Conv2Plus1D` or other custom blocks, this block should construct a layer where the name of output feature channel size is `mid_channel` for the third conv layers. Default: Inflate3D. - group (int): Group convolutions. Default: 1. - base_width (int): Width of per group. Default: 64. - norm (nn.Cell, optional): Module specifying the normalization layer to use. Default: None. - down_sample (nn.Cell, optional): Downsample structure. Default: None. - **kwargs(dict, optional): Key arguments for "conv12", it can contain "stride", "inflate", etc. **Return:** Tensor, output tensor. ### ResidualBlock3D > class mindvideo.model.layers.ResidualBlock3D(in_channel: int, out_channel: int, mid_channel: int = 0, conv12: Optional[nn.Cell] = Inflate3D, group: int = 1, base_width: int = 64, norm: Optional[nn.Cell] = None, activation: List[Optional[Union[nn.Cell, str]]] = (nn.ReLU, None), down_sample: Optional[nn.Cell] = None, **kwargs) ResNet3D residual block definition. - base: nn.Cell **Parameters:** - in_channel (int): Input channel. - out_channel (int): Output channel. - mid_channel (int): Inner channel. - conv12(nn.Cell, optional): Block that constructs first two conv layers. It can be `Inflate3D`, `Conv2Plus1D` or other custom blocks, this block should construct a layer where the name of output feature channel size is `mid_channel` for the third conv layers. Default: Inflate3D. - group (int): Group convolutions. Default: 1. - base_width (int): Width of per group. Default: 64. - norm (nn.Cell, optional): Module specifying the normalization layer to use. Default: None. - activation (List[Optional[Union[nn.Cell, str]]]): Activation function which will be stacked on top of the normalization layer (if not None), otherwise on top of the conv layer. Default: nn.ReLU, None. - down_sample (nn.Cell, optional): Downsample structure. Default: None. - **kwargs(dict, optional): Key arguments for "conv12", it can contain "stride", "inflate", etc. **Return:** Tensor, output tensor. ### ResNet3D > class mindvideo.model.layers.ResNet3D(block: Optional[nn.Cell], layer_nums: Tuple[int], stage_channels: Tuple[int] = (64, 128, 256, 512), stage_strides: Tuple[Tuple[int]] = ((1, 1, 1), (1, 2, 2), (1, 2, 2), (1, 2, 2)), group: int = 1, base_width: int = 64, norm: Optional[nn.Cell] = None, down_sample: Optional[nn.Cell] = Unit3D, **kwargs) ResNet3D architecture. - base: nn.Cell **Parameters:** - block (Optional[nn.Cell]): THe block for network. - layer_nums (Tuple[int]): The numbers of block in different layers. - stage_channels (Tuple[int]): Output channel for every res stage. Default: [64, 128, 256, 512]. - stage_strides (Tuple[Tuple[int]]): Strides for every res stage. Default:[[1, 1, 1], [1, 2, 2], [1, 2, 2], [1, 2, 2]]. - group (int): The number of Group convolutions. Default: 1. - base_width (int): The width of per group. Default: 64. - norm (nn.Cell, optional): The module specifying the normalization layer to use. Default: None. - down_sample(nn.Cell, optional): Residual block in every resblock, it can transfer the input feature into the same channel of output. Default: Unit3D. - kwargs (dict, optional): Key arguments for "make_res_layer" and resblocks. **Inputs:** - **x** (Tensor) - Tensor of shape :math:`(N, C_{in}, T_{in}, H_{in}, W_{in})`. **Return:** Tensor of shape :math:`(N, 2048, 7, 7, 7)` ### Roll3D > class mindvideo.model.layers.Roll3D(shift) Roll Tensors of shape (B, D, H, W, C). - base: nn.Cell **Parameters:** - shift (tuple[int]): shift size for target rolling. **Inputs:** Tensor of shape (B, D, H, W, C). **Return:** Rolled Tensor. ### make_divisible > def mindvideo.model.layers.make_divisible(v: float, divisor: int, min_value: Optional[int] = None) It ensures that all layers have a channel number that is divisible by 8. **Parameters:** - v (int): original channel of kernel. - divisor (int): Divisor of the original channel. - min_value (int, optional): Minimum number of channels. **Return:** Number of channel. ### SqueezeExcite3D > class mindvideo.model.layers.SqueezeExcite3D(dim_in, ratio, act_fn: Union[str, nn.Cell] = Swish) Squeeze-and-Excitation (SE) block implementation. - base: nn.Cell **Parameters:** - dim_in (int): the channel dimensions of the input. - ratio (float): the channel reduction ratio for squeeze. - act_fn (Union[str, nn.Cell]): the activation of conv_expand: Default: Swish. **Return:** Tensor. ### Swish > class mindvideo.model.layers.Swish() Swish activation function: x * sigmoid(x). - base: nn.Cell **Parameters:** None **Return:** Tensor. ### Unit3D > class mindvideo.model.layers.Unit3D(in_channels: int, out_channels: int, kernel_size: Union[int, Tuple[int]] = 3, stride: Union[int, Tuple[int]] = 1, pad_mode: str = 'pad', padding: Union[int, Tuple[int]] = 0, dilation: Union[int, Tuple[int]] = 1, group: int = 1, activation: Optional[nn.Cell] = nn.ReLU, norm: Optional[nn.Cell] = nn.BatchNorm3d, pooling: Optional[nn.Cell] = None, has_bias: bool = False) Conv3d fused with normalization and activation blocks definition. - base: nn.Cell **Parameters:** - in_channels (int): The number of channels of input frame images. - out_channels (int): The number of channels of output frame images. - kernel_size (tuple): The size of the conv3d kernel. - stride (Union[int, Tuple[int]]): Stride size for the first convolutional layer. Default: 1. - pad_mode (str): Specifies padding mode. The optional values are "same", "valid", "pad". Default: "pad". - padding (Union[int, Tuple[int]]): Implicit paddings on both sides of the input x. If `pad_mode` is "pad" and `padding` is not specified by user, then the padding size will be `(kernel_size - 1) // 2` for C, H, W channel. - dilation (Union[int, Tuple[int]]): Specifies the dilation rate to use for dilated convolution. Default: 1 - group (int): Splits filter into groups, in_channels and out_channels must be divisible by the number of groups. Default: 1. - activation (Optional[nn.Cell]): Activation function which will be stacked on top of the normalization layer (if not None), otherwise on top of the conv layer. Default: nn.ReLU. - norm (Optional[nn.Cell]): Norm layer that will be stacked on top of the convolution layer. Default: nn.BatchNorm3d. - pooling (Optional[nn.Cell]): Pooling layer (if not None) will be stacked on top of all the former layers. Default: None. - has_bias (bool): Whether to use Bias. **Return:** Tensor, output tensor. ### TransformerDecoder > class mindvideo.model.layers.TransformerDecoder(decoder_layers, norm=None, return_intermediate=False) Transformer decoder is a stack of N decoder layers. - base: nn.Cell **Parameters:** - decoder_layers(nn.cell):an instance of the TransformerDecoderLayer() class - norm(nn.cell):the layer normalization component (optional).Default=None - return_intermediate(bool):return intermediate result.Default=False **Inputs:** - tgt(tensor): the sequence to the decoder - memory(tensor): the sequence from the last layer of the encoder - tgt_key_padding_mask(tensor): the mask for the tgt keys per batch - memory_key_padding_mask(tensor): he mask for the memory keys per batch - pos(tensor): memory's encoded position - query_pos(tensor): tgt's encoded position **Return:** Tensor. ### TransformerDecoderLayer > class mindvideo.model.layers.TransformerDecoderLayer(d_model, nhead, dim_feedforward=2048, dropout=0.1, activation="relu", normalize_before=False) Transformer decoder layer is made up of self-attn and feedforward network. - base: nn.Cell **Parameters:** - d_model(int): the number of expected features in the input - nhead(int): the number of heads in the multiheadattention models - dim_feedfroward(int): the dimension of the feedforward network model.Default=2048 - dropout(float): the dropout value.Default=0.1 - activation(str): the activation function of the intermediate layer, can be a string ("relu" or "gelu") or a unary callable. Default="relu" - normalize_before(bool): done normalize before decoderlayer. Default:False **Inputs:** - tgt(tensor): the sequence to the decoder - memory(tensor): the sequence from the last layer of the encoder - tgt_key_padding_mask(tensor): the mask for the tgt keys per batch - memory_key_padding_mask(tensor): he mask for the memory keys per batch - pos(tensor): memory's encoded position - query_pos(tensor): tgt's encoded position **Return:** Tensor. ### TransformerEncoder > class mindvideo.model.layers.TransformerEncoder(encoder_layers, norm=None) Transformer encoder is a stack of N encoder layers. - base: nn.Cell **Parameters:** - encoder_layers: an list of TransformerEncoderlayer class's instance - norm: the layer normalization component **Inputs:** - src: the sequence to encoder - src_key_padding_mask: the mask for the src key per batch - pos: the sequence's encoder position **Return:** Tensor. ### TransformerEncoderLayer > class mindvideo.model.layers.TransformerEncoder(d_model, nhead, dim_feedforward=2048, dropout=0.1, activation="relu", normalize_before=False) Transformer encoder layer is made up of self-attn and feedforward network. - base: nn.Cell **Parameters:** - d_model(int): the number of expected features in the input - nhead(int): the number of heads in the multiheadattention models - dim_feedfroward(int): the dimension of the feedforward network model.Default=2048 - dropout(float): the dropout value.Default=0.1 - activation(str): the activation function of the intermediate layer, can be a string ("relu" or "gelu") or a unary callable. Default="relu" - normalize_before(bool): done normalize before decoderlayer.Default:False **Inputs:** - src: the sequence to encoder - src_key_padding_mask: the mask for the src key per batch - pos: the sequence's encoder position **Return:** Tensor.