The EfficientDet Architecture in PyTorch
In this blog post, we will look at how to implement the EfficientDet architecture in PyTorch from scratch.
This blog post is a direct continuation of my previous blog post explaining EfficientDets. In my previous post, we looked and understood what’s inside an EfficientDet and also read about the various components such as BiFPN and Compound Scaling that make an EfficientDet network so powerful.
Today, our focus will be to build on top of that knowledge and showcase how to implement the network using PyTorch step-by-step. Throughout this blog post I have added some side notes to be able to explain things better.
These side-notes would look something like this.
NOTE: The code implementations shared below are not my own. All code shown below has been directly copied from Ross Wightman’s wonderful repo efficientdet-pytorch. efficientdet-pytorch
makes heavy use of timm to create the backbone network and also for several other operations.
As part of this blog post - we will not be looking at the source code of
timm
. We will only be looking at the implementation insideefficientdet-pytorch
repo. This is a conscious decision to keep this blog post from blowing up. Also, everything below is based on “my understanding” of the code. It is possible that Ross might have implemented things differently than the way in which I have understood them.
There are few notable things in the architecture above that we must look at before starting with the implementation: 1. The BiFPN Layer only interacts with the feature maps at level 3-7 of the backbone network. 2. EfficientNets are used as the backbone network for EfficientDets. 3. There are bottom-up and top-down connections between the feature maps at different levels. Thus, we would need to be able to Upsample or Downsample the features. 4. The BiFPN Network consists of multiple BiFPN Layers and the number of BiFPN layers depends on the size of the EfficientDet (compound scaling). 5. The EfficientDet Architecture consists of two main components - Backbone + BiFPN network. 6. Each “Node” inside a BiFPN layer can accept either 2 or 3 inputs and it combines them to produce a single output.
We are going to be using a bottom-up approach in coding this time and build the EfficientDet together component by component.
1 ResampleFeatureMap
So, the first thing we are going to implement is a class called ResampleFeatureMap
that is able to upsample or downsample an input feature map based on a parameter called reduction_ratio
. This class typically represents the “arrows” in the architecture diagram above.
class ResampleFeatureMap(nn.Sequential):
def __init__(
self, in_channels, out_channels, reduction_ratio=1., pad_type='', downsample=None, upsample=None,
=nn.BatchNorm2d, apply_bn=False, conv_after_downsample=False, redundant_bias=False):
norm_layersuper(ResampleFeatureMap, self).__init__()
= downsample or 'max'
downsample = upsample or 'nearest'
upsample self.in_channels = in_channels
self.out_channels = out_channels
self.reduction_ratio = reduction_ratio
self.conv_after_downsample = conv_after_downsample
= None
conv if in_channels != out_channels:
= ConvBnAct2d(
conv =1, padding=pad_type,
in_channels, out_channels, kernel_size=norm_layer if apply_bn else None,
norm_layer=not apply_bn or redundant_bias, act_layer=None)
bias
if reduction_ratio > 1:
if conv is not None and not self.conv_after_downsample:
self.add_module('conv', conv)
if downsample in ('max', 'avg'):
= int(reduction_ratio)
stride_size = create_pool2d(
downsample =stride_size + 1, stride=stride_size, padding=pad_type)
downsample, kernel_sizeelse:
= Interpolate2d(scale_factor=1./reduction_ratio, mode=downsample)
downsample self.add_module('downsample', downsample)
if conv is not None and self.conv_after_downsample:
self.add_module('conv', conv)
else:
if conv is not None:
self.add_module('conv', conv)
if reduction_ratio < 1:
= int(1 // reduction_ratio)
scale self.add_module('upsample', Interpolate2d(scale_factor=scale, mode=upsample))
Here is the general idea - if out_channels
is not equal to in_channels
, then use a 1x1 convolution operation to make them the same. Also, if the reduction ratio is not equal to 1, then either upsample or downsample the input feature map based on the requirements. If reduction_ratio<1
then, Upsample the input, otherwise if reduction_ratio>1
then, Downsample the input.
Upsampling or Downsampling in simple terms refers to making the spatial dimensions of the input feature map larger or smaller. Upsampling is generally done using bilinear interpolation and downsampling is generally done using pooling.
So an example of using this class, assuming all imports work, would be:
# downsampling
= torch.randn(1, 40, 64, 64)
inp = ResampleFeatureMap(in_channels=40, out_channels=112, reduction_ratio=2)
resample = resample(inp)
out print(inp.shape, out.shape)
>> torch.Size([1, 40, 64, 64]) torch.Size([1, 112, 32, 32])
# upsampling
= torch.randn(1, 40, 64, 64)
inp = ResampleFeatureMap(in_channels=40, out_channels=112, reduction_ratio=0.5)
resample = resample(inp)
out print(inp.shape, out.shape)
>> torch.Size([1, 40, 64, 64]) torch.Size([1, 112, 128, 128])
One key thing that’s part of this class, is that the class does not have a forward
method defined that is common to almost all layers in PyTorch. The reason is that this class inherits from nn.Sequential
instead of nn.Module
. This class does not need a forward
method to be defined and automatically calls the modules defined in this class one by one. That is why we do things like self.add_module
inside the ResampleFeatureMap
class.
Another thing, the convolution operation inside this ResampleFeatureMap
calls ConvBnAct2d
and not nn.Conv2d
. ConvBnAct2d
as the name suggests is a Convolution operation followed by Batch Normalization and an Activation function.
So, here is the implementation of ConvBnAct2d
:
class ConvBnAct2d(nn.Module):
def __init__(self, in_channels, out_channels, kernel_size, stride=1, dilation=1, padding='', bias=False,
=nn.BatchNorm2d, act_layer=_ACT_LAYER):
norm_layersuper(ConvBnAct2d, self).__init__()
self.conv = create_conv2d(
=stride, dilation=dilation, padding=padding, bias=bias)
in_channels, out_channels, kernel_size, strideself.bn = None if norm_layer is None else norm_layer(out_channels)
self.act = None if act_layer is None else act_layer(inplace=True)
def forward(self, x):
= self.conv(x)
x if self.bn is not None:
= self.bn(x)
x if self.act is not None:
= self.act(x)
x return x
The create_conv2d
is a function from timm
, that creates a nn.Conv2d
layer in our case. We won’t go into the source code of this function as it is part of the timm
library, which we will look into a series of blog posts later.
Now, let’s start to get into the tricky bits. Let’s see how could we implement a single BiFpnLayer
.
2 BiFPN Layer
Looking at the image above, we can see that a BiFPN Layer
has Nodes
. To be specific, each BiFPN Layer
has 5 input Nodes (number 0-4) and 8 internal nodes (number 5-12). The input nodes for the first BiFPN Layer
are feature outputs from the EfficientNet Backbone. For the subsequent BiFpn Layer
s, the feature outputs come from the previous BiFPN Layer
.
Also, each arrow is the ResampleFeatureMap
class where the blue arrows perform DownSampling and the red arrows perform Upsampling. From a code perspective, there are some things that we need to be able to implement the BiFPN Layer: 1. We need to be able to extract the feature maps from the EfficientNet Backbone. timm
will do this for us. As you’ll notice later, we call timm.create_model
method passing in a parameter called out_indices
and also features_only=True
. This tells timm
to create a model that extracts the required feature maps for us at the correct level. 2. We need to be able to combine the features coming from different nodes at different levels. The class FpnCombine
will take care of this for us which we will look at below. 3. We need to define the numbers and structures of the nodes in Python similar to the diagram. > For example, our implementation should know that Node-6
is the intermediate Node at level **P5**
and it accepts the outputs of Node-5
and Node-2
as inputs. 4. Not all Node
s accept the same number of inputs. Some accept 2 inputs whereas some nodes (such as 9, 10, 11) accept 3 inputs as shown in the fig-2
. We can clearly satisfy this requirement by passing the inputs as a List
of tensors
. 5. The output features from the EfficientNet-B0
backbone at level P3-P5 have 40, 112, 320 number of channels respectively and each spatial dimension is half that of the previous level. > This is important to note: considering an input image of size [3, 512, 512]
, the size of feature maps at levels P3-P5 would be [40, 64, 64]
, [112, 32, 32]
, [320, 16, 16]
respectively.
With this general understanding, let’s get to work.
First things first, how could our implementation get to know about the numbers and structures of the nodes? There is a function called get_fpn_config
that returns a Python dictionary like so:
= get_fpn_config()
fpn_config
fpn_config
>> {'nodes': [{'reduction': 64, 'inputs_offsets': [3, 4], 'weight_method': 'fastattn'}, {'reduction': 32, 'inputs_offsets': [2, 5], 'weight_method': 'fastattn'}, {'reduction': 16, 'inputs_offsets': [1, 6], 'weight_method': 'fastattn'}, {'reduction': 8, 'inputs_offsets': [0, 7], 'weight_method': 'fastattn'}, {'reduction': 16, 'inputs_offsets': [1, 7, 8], 'weight_method': 'fastattn'}, {'reduction': 32, 'inputs_offsets': [2, 6, 9], 'weight_method': 'fastattn'}, {'reduction': 64, 'inputs_offsets': [3, 5, 10], 'weight_method': 'fastattn'}, {'reduction': 128, 'inputs_offsets': [4, 11], 'weight_method': 'fastattn'}]}
For now, let’s not worry about where this function comes from. Let’s just consider we know that such a function exists which returns a dictionary output like above. And let’s assume that we need to build our
BiFPN Layer
using the config returned from theget_fpn_config()
.
By looking at the fpn_config
, we can see that nodes
is a List
of Dict
s and each item in the List
represents a single Node
. Specifically, the list represents Nodes 5-12. As can be seen and confirmed with the help of fig-2
, Node-5
which is the first item in the List
accepts the outputs from Nodes 3 & 4 as represented by {'reduction': 64, 'inputs_offsets': [3, 4], 'weight_method': 'fastattn'}
in the List
, Node-6
which is the second item in the List
accepts the outputs from Nodes 2 & 5 as represented by {'reduction': 32, 'inputs_offsets': [2, 5], 'weight_method': 'fastattn'}
in the List
and so on..
I repeat, let’s not worry about where the
fpn_config
comes from but let’s just say there is such a config that god created for us and we will use it to build theBiFPN Layer
.
You might ask what’s this reduction
inside the fpn_config
? Can you see in fig-2
that there are somethings written like “input, P1/2, P2/4, P3/8…, well the denominator number is the reduction
. For example, at level P5, where Node-5
exists, the reduction
is 32. What this means is that the spatial dimensions of the feature map at this level are of size H/32 x W/32
where H
and W
are the original image Height and Width.
Great, good work so far! At least now we have a basic structure to build our BiFPN Layer
on top of. Also, now we know which Node
s are linked to which other Node
s as defined inside the fpn_config
.
So, for now, let’s move on to implementing Node
s without implementing the BiFPNLayer
first.
2.1 FNode
Now inside a Node
, we need to be able to accept some iputs, combine those together, perform some computation on this combined input and output a tensor
.
class Fnode(nn.Module):
""" A simple wrapper used in place of nn.Sequential for torchscript typing
Handles input type List[Tensor] -> output type Tensor
"""
def __init__(self, combine: nn.Module, after_combine: nn.Module):
super(Fnode, self).__init__()
self.combine = combine
self.after_combine = after_combine
def forward(self, x: List[torch.Tensor]) -> torch.Tensor:
return self.after_combine(self.combine(x))
This is exactly what this class Fnode
does above. As can be seen in the forward
method, it accepts List
of Tensors
, performs the combine
to combine the inputs together and also performs the computations after_combine
operation on them and returns the output.
Now what are these combine
and after_combine
operations? The combine
operation is the one that will make sure that the input tensors are changed to be of the same resolution and once they are of the same size, it will combine them together. Remember fig-2
? A Node
can accept inputs from various other Nodes
that might not necessarily be at the same level. Thus there might be a need to do a resampling
operation before we can combine the inputs to make them be of the same size and same number of channels. Only then can we sum the inputs up.
In the actual implementation, we do not actually sum the input tensors but rather do something called
Fast normalized fusion
that has been described in section 3.3 of the paper. But, it is completely okay if for now we assume that combine the inputs by simply summing them up once they are of the same size.
Next, we still need to perform the “fusion”. Simply combining the inputs up might not be enough and we still need to do some more computation on top to get a good representation or do the actual “fusion” of the Node outputs. This “fusion” operation is a Depthwise Separable Convolution followed by a BatchNorm and activation layer.
This has been mentioned in the paper at the end of section-3 as well. > Notably, to further improve the efficiency, we use depthwise separable convolution for feature fusion, and add batch normalization and activation after each convolution.
Great, now that we have a general understanding of the combine
and after_combine
operations, let’s implement them below.
2.1.1 The “combine” method - FpnCombine
Note: This is the most complicated part of the code. So, please, bear with me. Re-read this section multiple times if needed.:)
Let’s understand the general idea of this class before looking at the code.
Assuming that the combine operation is simply a sum
operation for now. As can be seen from the figure above, Node-5
accepts the inputs from Node-3
and Node-4
. Now these feature maps are of different sizes and have different number of channels so we simply can’t sum them up. The feature map size at Node-4
is [64, 4, 4]
whereas at Node-3
is [64, 8, 8]
. So to be able to combine at Node-5
, we will convert both feature maps to be of the size [64, 8, 8]
cause that’s what the size of feature map at Node-5
should be.
This class, merely does this operation. It will first resample
both feature maps to be of the same size as the required [64, 8, 8]
and then it will combine them together.
So, now that we have some idea of what we want to accomplish, let’s look at the code implementation.
class FpnCombine(nn.Module):
def __init__(self, feature_info, fpn_config, fpn_channels, inputs_offsets, target_reduction, pad_type='',
=None, upsample=None, norm_layer=nn.BatchNorm2d, apply_resample_bn=False,
downsample=False, redundant_bias=False, weight_method='attn'):
conv_after_downsamplesuper(FpnCombine, self).__init__()
self.inputs_offsets = inputs_offsets
self.weight_method = weight_method
self.resample = nn.ModuleDict()
for idx, offset in enumerate(inputs_offsets):
= fpn_channels
in_channels if offset < len(feature_info):
= feature_info[offset]['num_chs']
in_channels = feature_info[offset]['reduction']
input_reduction else:
= offset - len(feature_info)
node_idx = fpn_config.nodes[node_idx]['reduction']
input_reduction = target_reduction / input_reduction
reduction_ratio self.resample[str(offset)] = ResampleFeatureMap(
=reduction_ratio, pad_type=pad_type,
in_channels, fpn_channels, reduction_ratio=downsample, upsample=upsample, norm_layer=norm_layer, apply_bn=apply_resample_bn,
downsample=conv_after_downsample, redundant_bias=redundant_bias)
conv_after_downsample
if weight_method == 'attn' or weight_method == 'fastattn':
self.edge_weights = nn.Parameter(torch.ones(len(inputs_offsets)), requires_grad=True) # WSM
else:
self.edge_weights = None
def forward(self, x: List[torch.Tensor]):
= x[0].dtype
dtype = []
nodes for offset, resample in zip(self.inputs_offsets, self.resample.values()):
= x[offset]
input_node = resample(input_node)
input_node
nodes.append(input_node)
if self.weight_method == 'attn':
= torch.softmax(self.edge_weights.to(dtype=dtype), dim=0)
normalized_weights = torch.stack(nodes, dim=-1) * normalized_weights
out elif self.weight_method == 'fastattn':
= nn.functional.relu(self.edge_weights.to(dtype=dtype))
edge_weights = torch.sum(edge_weights)
weights_sum = torch.stack(
out * edge_weights[i]) / (weights_sum + 0.0001) for i in range(len(nodes))], dim=-1)
[(nodes[i] elif self.weight_method == 'sum':
= torch.stack(nodes, dim=-1)
out else:
raise ValueError('unknown weight_method {}'.format(self.weight_method))
= torch.sum(out, dim=-1)
out return out
There’s actually quite a bit happening in this layer but hey, don’t be worried. Take a deep breath and read on! This layer will make sense. :)
Something new that we have encountered in this class is feature_info
. What is it? It’s something that comes from timm
. Do you remember that we are using the EfficientNet
backbone? This backbone has something called a feature_info
which we can see below.
Let’s not worry about how this
get_feature_info
function is actually implemented. But, let’s just assume there is this beatiful function that gives us the desired outputs.
= timm.create_model(
backbone =True, out_indices=(2, 3, 4),
config.backbone_name, features_only=True, **config.backbone_args)
pretrained= get_feature_info(backbone)
feature_info print(feature_info, '\n')
>> [{'num_chs': 40, 'reduction': 8}, {'num_chs': 112, 'reduction': 16}, {'num_chs': 320, 'reduction': 32}]
So the feature_info
is something that tells us how many channels are there at each reduction
level. For example, the number of channels at level P5 or reduction
32 is 320 as shown in the feature_info
dictionary. Note that this matches the number of channels shown in fig-2
. Note that this feature_info
is actually missing levels P6 and P7 where the reduction
is 64 and 128 respectively. Let’s again assume there is some part of code that updates this feature_info
so it actually looks something like below for the first BiFpnLayer
:
>> [{'num_chs': 40, 'reduction': 8}, {'num_chs': 112, 'reduction': 16}, {'num_chs': 320, 'reduction': 32}, {'num_chs': 64, 'reduction': 64}, {'num_chs': 64, 'reduction': 128}]
Great, now let’s dissect the __init__
method of this FpnCombine
class.
def __init__(self, feature_info, fpn_config, fpn_channels, inputs_offsets, target_reduction, pad_type='',
=None, upsample=None, norm_layer=nn.BatchNorm2d, apply_resample_bn=False,
downsample=False, redundant_bias=False, weight_method='attn'):
conv_after_downsamplesuper(FpnCombine, self).__init__()
self.inputs_offsets = inputs_offsets
self.weight_method = weight_method
self.resample = nn.ModuleDict()
for idx, offset in enumerate(inputs_offsets):
= fpn_channels
in_channels if offset < len(feature_info):
= feature_info[offset]['num_chs']
in_channels = feature_info[offset]['reduction']
input_reduction else:
= offset - len(feature_info)
node_idx = fpn_config.nodes[node_idx]['reduction']
input_reduction = target_reduction / input_reduction
reduction_ratio self.resample[str(offset)] = ResampleFeatureMap(
=reduction_ratio, pad_type=pad_type,
in_channels, fpn_channels, reduction_ratio=downsample, upsample=upsample, norm_layer=norm_layer, apply_bn=apply_resample_bn,
downsample=conv_after_downsample, redundant_bias=redundant_bias)
conv_after_downsample
if weight_method == 'attn' or weight_method == 'fastattn':
self.edge_weights = nn.Parameter(torch.ones(len(inputs_offsets)), requires_grad=True) # WSM
else:
self.edge_weights = None
As a general idea: > This FpnCombine
layer accepts a list of nodes as input nodes. Then it calculates some parameters which are then passed to ResampleFeatureMap
to make sure that we resample/resize the feature maps from the input nodes such that we can combine them.
The class accepts feature_info
, fpn_config
, fpn_channels
, inputs_offsets
and target_reduction
as required inputs. We will focus just on these. We already know the values of feature_info
and fpn_config
. Let me share them below once again for reference:
>> feature_info = [{'num_chs': 40, 'reduction': 8}, {'num_chs': 112, 'reduction': 16}, {'num_chs': 320, 'reduction': 32}, {'num_chs': 64, 'reduction': 64}, {'num_chs': 64, 'reduction': 128}]
>> fpn_config = {'nodes': [{'reduction': 64, 'inputs_offsets': [3, 4], 'weight_method': 'fastattn'}, {'reduction': 32, 'inputs_offsets': [2, 5], 'weight_method': 'fastattn'}, {'reduction': 16, 'inputs_offsets': [1, 6], 'weight_method': 'fastattn'}, {'reduction': 8, 'inputs_offsets': [0, 7], 'weight_method': 'fastattn'}, {'reduction': 16, 'inputs_offsets': [1, 7, 8], 'weight_method': 'fastattn'}, {'reduction': 32, 'inputs_offsets': [2, 6, 9], 'weight_method': 'fastattn'}, {'reduction': 64, 'inputs_offsets': [3, 5, 10], 'weight_method': 'fastattn'}, {'reduction': 128, 'inputs_offsets': [4, 11], 'weight_method': 'fastattn'}]}
Let’s just assume that we are currently creating an instance of this class for Node-5
as an example.
The inputs_offsets
would be a list of Node
id’s such as [3,4]
for Node-5
. This states that Node-5
accepts the outputs of Node-3
and Node-4
as inputs and has to combine them.
The only variables that we do not know the values of are fpn_channels
and target_reduction
. The fpn_channels
has a value of 64.Great, what about target_reduction
? target_reduction
just refers to the reduction
value of the current Node
for which we are creating this FpnCombine
class. So, from the fpn_config
we can see the reduction
for Node-5
is 64
. Thus target_reduction=64
.
Note that the value of
target_reduction
forNode-6
be 32, forNode-7
it will be 16 and so on..
I leave it to the reader to see how the self.resample
inside the FpnCombine
is a list of ResampleFeatureMap
that looks something like below for Node-5
:
>> self.resample
ModuleDict(3): ResampleFeatureMap: (inp_ch:64, out_ch:64, reduction:1.0)
(4): ResampleFeatureMap: (inp_ch:64, out_ch:64, reduction:0.5)
( )
For Node-6
, this looks something like:
>> ModuleDict(
2): ResampleFeatureMap: (inp_ch:320, out_ch:64, reduction:1.0)
(5): ResampleFeatureMap: (inp_ch:64, out_ch:64, reduction:0.5)
( )
And finally for Node-7
as an example, this looks like:
ModuleDict(1): ResampleFeatureMap: (inp_ch:112, out_ch:64, reduction:1.0)
(6): ResampleFeatureMap: (inp_ch:64, out_ch:64, reduction:0.5)
( )
If you follow along the
for
loop inside the__init__
method, what I have stated above will become pretty clear to you. It might be a good idea right now to take out a pen and paper, and actually try to guess the values that get passed toResampleFeatureMap
for each input offset. If you don’t get it, feel free to reach out to me and I’ll share the solution. Contact details have been provided at the last of this blog post.
I hope that you’ve been able to trace the values of self.resample
for the various Node
s. Now that we have already looked at __init__
, the forward
method is pretty straightforward:
def forward(self, x: List[torch.Tensor]):
= x[0].dtype
dtype = []
nodes for offset, resample in zip(self.inputs_offsets, self.resample.values()):
= x[offset]
input_node = resample(input_node)
input_node
nodes.append(input_node)
if self.weight_method == 'attn':
= torch.softmax(self.edge_weights.to(dtype=dtype), dim=0)
normalized_weights = torch.stack(nodes, dim=-1) * normalized_weights
out elif self.weight_method == 'fastattn':
= nn.functional.relu(self.edge_weights.to(dtype=dtype))
edge_weights = torch.sum(edge_weights)
weights_sum = torch.stack(
out * edge_weights[i]) / (weights_sum + 0.0001) for i in range(len(nodes))], dim=-1)
[(nodes[i] elif self.weight_method == 'sum':
= torch.stack(nodes, dim=-1)
out else:
raise ValueError('unknown weight_method {}'.format(self.weight_method))
= torch.sum(out, dim=-1)
out return out
There isn’t a lot happening. We just go over the inputs one by one, perform the required resample
operation as in self.resample
to make the input feature maps to be of the required size and finally do the “combination” inside the forward.
2.1.2 The after_combine
method
The after_combine
method is nothing but a** Depthwise Separable Convolution** that we will look at as part of the BiFpnLayer
implementation. But for completeness of FNode
, I state it below too:
= dict(
conv_kwargs =fpn_channels, out_channels=fpn_channels, kernel_size=3, padding=pad_type,
in_channels=False, norm_layer=norm_layer, act_layer=act_layer)
bias= nn.Sequential()
after_combine
after_combine.add_module('conv', SeparableConv2d(**conv_kwargs))
3 BiFPN Layer Implementation
Finally, we are ready to look at the implementation of the BiFPN Layer
.
If you have understood the implementation of
FpnCombine
, and also the general idea so far, then you will find the implementation ofBiFPN Layer
as something that brings all the pieces together. It should be intuitive rather than complex.
So, let’s have a look at it.
class BiFpnLayer(nn.Module):
def __init__(self, feature_info, fpn_config, fpn_channels, num_levels=5, pad_type='',
=None, upsample=None, norm_layer=nn.BatchNorm2d, act_layer=_ACT_LAYER,
downsample=False, conv_after_downsample=True, conv_bn_relu_pattern=False,
apply_resample_bn=True, redundant_bias=False):
separable_convsuper(BiFpnLayer, self).__init__()
self.num_levels = num_levels
self.conv_bn_relu_pattern = False
self.feature_info = []
self.fnode = nn.ModuleList()
for i, fnode_cfg in enumerate(fpn_config.nodes):
'fnode {} : {}'.format(i, fnode_cfg))
logging.debug(= fnode_cfg['reduction']
reduction = FpnCombine(
combine tuple(fnode_cfg['inputs_offsets']),
feature_info, fpn_config, fpn_channels, =reduction, pad_type=pad_type, downsample=downsample, upsample=upsample,
target_reduction=norm_layer, apply_resample_bn=apply_resample_bn, conv_after_downsample=conv_after_downsample,
norm_layer=redundant_bias, weight_method=fnode_cfg['weight_method'])
redundant_bias
= nn.Sequential()
after_combine = dict(
conv_kwargs =fpn_channels, out_channels=fpn_channels, kernel_size=3, padding=pad_type,
in_channels=False, norm_layer=norm_layer, act_layer=act_layer)
biasif not conv_bn_relu_pattern:
'bias'] = redundant_bias
conv_kwargs['act_layer'] = None
conv_kwargs['act', act_layer(inplace=True))
after_combine.add_module(
after_combine.add_module('conv', SeparableConv2d(**conv_kwargs) if separable_conv else ConvBnAct2d(**conv_kwargs))
self.fnode.append(Fnode(combine=combine, after_combine=after_combine))
self.feature_info.append(dict(num_chs=fpn_channels, reduction=reduction))
self.feature_info = self.feature_info[-num_levels::]
def forward(self, x: List[torch.Tensor]):
for fn in self.fnode:
x.append(fn(x))return x[-self.num_levels::]
We already know what the values of the required parameters are feature_info
, fpn_config
and fpn_channels
. I share them for reference below once again:
= [{'num_chs': 40, 'reduction': 8}, {'num_chs': 112, 'reduction': 16}, {'num_chs': 320, 'reduction': 32}, {'num_chs': 64, 'reduction': 64}, {'num_chs': 64, 'reduction': 128}]
feature_info
= {'nodes': [{'reduction': 64, 'inputs_offsets': [3, 4], 'weight_method': 'fastattn'}, {'reduction': 32, 'inputs_offsets': [2, 5], 'weight_method': 'fastattn'}, {'reduction': 16, 'inputs_offsets': [1, 6], 'weight_method': 'fastattn'}, {'reduction': 8, 'inputs_offsets': [0, 7], 'weight_method': 'fastattn'}, {'reduction': 16, 'inputs_offsets': [1, 7, 8], 'weight_method': 'fastattn'}, {'reduction': 32, 'inputs_offsets': [2, 6, 9], 'weight_method': 'fastattn'}, {'reduction': 64, 'inputs_offsets': [3, 5, 10], 'weight_method': 'fastattn'}, {'reduction': 128, 'inputs_offsets': [4, 11], 'weight_method': 'fastattn'}]}
fpn_config
= 64 fpn_channels
The BiFPN Layer implementation is the glue that binds the FNode
, FpnCombine
, ResampleFeatureMap
classes together. Let’s see how?
Inside the __init__
method, we iterate over the Nodes
. For each node, combine
is an instance of FpnCombine
and after_combine
is a SeparableConv2d
nn Module. Next, we create a FNode
for each of the Node
s inside the fpn_config
with each FNode
having it’s own combine
and after_combine
values.
Finally, in the forward
method, where the input x
is list of feature maps from levels P3-P7, is passed through to each node and we append the outputs. Finally, we return the last 5 outputs (feature maps), that are then passed on to another BiFpnLayer
which does the same thing again.
If this sounds confusing, then please feel free to reach out to me. I will share the solution. But I really want you to think about what I have stated above. It’s better if you spend some time trying to think about it than just asking me for the solution that explains this.
4 BiFPN
Having already understood how to implement the BiFPN Layer
, let’s now look at how to implement the BiFPN
network. It’s really a simple case of having multiple BiFPN layers
inside a single BiFPN
network.
class BiFpn(nn.Module):
def __init__(self, config, feature_info):
super(BiFpn, self).__init__()
self.num_levels = config.num_levels
= config.norm_layer or nn.BatchNorm2d
norm_layer if config.norm_kwargs:
= partial(norm_layer, **config.norm_kwargs)
norm_layer = get_act_layer(config.act_type) or _ACT_LAYER
act_layer = config.fpn_config or get_fpn_config(
fpn_config =config.min_level, max_level=config.max_level)
config.fpn_name, min_level
self.resample = nn.ModuleDict()
for level in range(config.num_levels):
if level < len(feature_info):
= feature_info[level]['num_chs']
in_chs = feature_info[level]['reduction']
reduction else:
# Adds a coarser level by downsampling the last feature map
= 2
reduction_ratio self.resample[str(level)] = ResampleFeatureMap(
=in_chs,
in_channels=config.fpn_channels,
out_channels=config.pad_type,
pad_type=config.downsample_type,
downsample=config.upsample_type,
upsample=norm_layer,
norm_layer=reduction_ratio,
reduction_ratio=config.apply_resample_bn,
apply_bn=config.conv_after_downsample,
conv_after_downsample=config.redundant_bias,
redundant_bias
)= config.fpn_channels
in_chs = int(reduction * reduction_ratio)
reduction dict(num_chs=in_chs, reduction=reduction))
feature_info.append(
self.cell = SequentialList()
for rep in range(config.fpn_cell_repeats):
'building cell {}'.format(rep))
logging.debug(= BiFpnLayer(
fpn_layer =feature_info,
feature_info=fpn_config,
fpn_config=config.fpn_channels,
fpn_channels=config.num_levels,
num_levels=config.pad_type,
pad_type=config.downsample_type,
downsample=config.upsample_type,
upsample=norm_layer,
norm_layer=act_layer,
act_layer=config.separable_conv,
separable_conv=config.apply_resample_bn,
apply_resample_bn=config.conv_after_downsample,
conv_after_downsample=config.conv_bn_relu_pattern,
conv_bn_relu_pattern=config.redundant_bias,
redundant_bias
)self.cell.add_module(str(rep), fpn_layer)
= fpn_layer.feature_info
feature_info
def forward(self, x: List[torch.Tensor]):
for resample in self.resample.values():
-1]))
x.append(resample(x[= self.cell(x)
x return x
So let’s have a look at what is self.resample
inside here? As you will see in the implementation of the EfficientDet
Architecture in the next section, the backbone EfficientNet
only returns a total of 3 feature maps for levels P3-P5. We still need to calculate feature maps for levels P6 & P7. This is what the self.resample
layer does here.
As you can see in the forward method:
for resample in self.resample.values():
-1])) x.append(resample(x[
As I mentioned, x
has total of 3 feature maps reprenting levels P3-P5. We call the resample
method and append 2 more feature maps to x
to make the total length 5 representing the feature maps for levels P3-P7.
Finally, what is self.cell
? It’s simply a repetition of BiFPN Layers
to represent the BiFPN Network
. Since the self.cell
is a SequentialList
, each BiFPNLayer
is called one by one.:)
5 EfficientDet Architecture
This class is the main one - or is it? Considering we have all the pieces already, this class merely puts them all together.
class EfficientDet(nn.Module):
def __init__(self, config, pretrained_backbone=True, alternate_init=False):
super(EfficientDet, self).__init__()
self.config = config
self.config)
set_config_readonly(self.backbone = create_model(
=True, out_indices=(2, 3, 4),
config.backbone_name, features_only=pretrained_backbone, **config.backbone_args)
pretrained= get_feature_info(self.backbone)
feature_info self.fpn = BiFpn(self.config, feature_info)
self.class_net = HeadNet(self.config, num_outputs=self.config.num_classes)
self.box_net = HeadNet(self.config, num_outputs=4)
for n, m in self.named_modules():
if 'backbone' not in n:
if alternate_init:
_init_weight_alt(m, n)else:
_init_weight(m, n)
def forward(self, x):
= self.backbone(x)
x = self.fpn(x)
x = self.class_net(x)
x_class = self.box_net(x)
x_box return x_class, x_box
The reader should by now be able to read this code and understand what’s going on. We have already done the bulk of the work together here. Though there are some notable things that I would like to mention: 1. self.backbone
returns a total of 3 feature maps because we passed the out_indices=(2, 3, 4)
. As to how? This comes from timm
. 2. I have not explained what a HeadNet
is. It is basically a custom head that takes the final outputs of the BiFPN
network and either returns a class or bounding box coordinates. The self.class_net
and self.box_net
together represent the Box Prediction Net as in fig-1
.
6 Conclusion
I truly believe that it wasn’t easy to understand how to implement EfficientDet
s in Tensorflow or PyTorch. The official implementation exists here and the PyTorch version is here.
Thanks to Ross Wightman for his wonderful work in providing us with a PyTorch implementation of the EfficientDet network. It really makes things super easy for the native PyTorch users.
Something I have realised during this exercise, is that, when we are doing something like:
from effdet import get_efficientdet_config, EfficientDet
= get_efficientdet_config('efficientdet_d0')
config = EfficientDet(config)
efficientdet = efficientdet(inputs) output
When we are merely using the library, it is hard to appreciate the hard work that the author has put into making things so simple for us. In doing this exercise of going through the source code and to try and explain it to everyone, I have noticed the minor details and have tremendous appreciation for Ross’s hard work. So, thanks so much Ross! And of course to the authors of the EfficientDet for open sourcing the implementation in Tensorflow.
I hope that in today’s blog post I have been able to explain how to implement EfficientDet
s in Code and take away all the confusion and doubt in case you had any.
As always, constructive feedback is always welcome at @amaarora.
Also, feel free to subscribe to my blog here to receive regular updates regarding new blog posts. Thanks for reading!