mv aws-security-group shoot-foot
How you can run into an unplanned downtime while making a seemingly harmless change of renaming an AWS security group through Terraform?
mv aws-security-group shoot-foot

Scenario:

  1. Create an EC2 instance (or any other resource which uses security groups).
  2. Associate one or more security groups to the instance.
  3. Rename the security group.

The above infra can be created through rename-security-group/v1/main.tf

$ terraform init
$ terraform apply

Sample terraform output

aws_instance_test_1 = i-0db7a4160425910bf
security_group_test_1 = sg-0822ccfe609ecd0e2
security_group_test_2 = sg-0c09e69ffcf9e1acf

Supposed we want to rename security group test-1 to test-1-new-name. We go ahead and change the block

resource "aws_security_group" "test_1" {
  name = "test-1"
}

to

resource "aws_security_group" "test_1" {
  name = "test-1-new-name"
}

and do terraform apply

This should ideally show – 1 resource to change – the renamed security group.

However, running terraform plan leads to the following output (deleting some lines to save space)

  # aws_instance.test_1 will be updated in-place
  ~ resource "aws_instance" "test_1" {
       .
       .
      ~ vpc_security_group_ids       = [
          - "sg-0822ccfe609ecd0e2",
          - "sg-0c09e69ffcf9e1acf",
        ] -> (known after apply)
        .
        .
    }

   # aws_security_group.test_1 must be replaced

-/+ resource "aws_security_group" "test_1" {
      ~ arn                    = "arn:aws:ec2:ap-south-1:xxxxxx:security-group/sg-0822ccfe609ecd0e2" -> (known after apply)
        description            = "Managed by Terraform"
      ~ egress                 = [] -> (known after apply)
      ~ id                     = "sg-0822ccfe609ecd0e2" -> (known after apply)
      ~ ingress                = [
          - {
              - cidr_blocks      = [
                  - "0.0.0.0/0",
                ]
              - description      = ""
              - from_port        = 443
              - ipv6_cidr_blocks = []
              - prefix_list_ids  = []
              - protocol         = "tcp"
              - security_groups  = []
              - self             = false
              - to_port          = 443
            },
            .
            .

        ] -> (known after apply)

      ~ name                   = "test-1" -> "test-1-new-name" # forces replacement
      + name_prefix            = (known after apply)
      ~ owner_id               = "xxxxxx" -> (known after apply)
        revoke_rules_on_delete = false
      - tags                   = {} -> null
      ~ vpc_id                 = "vpc-xxxxxxx" -> (known after apply)

    }

  # aws_security_group_rule.sg_1_rule_1 must be replaced

-/+ resource "aws_security_group_rule" "sg_1_rule_1" {
        cidr_blocks              = [
            "0.0.0.0/0",
        ]
        from_port                = 80
        to_port                  = 80
      ~ id                       = "sgrule-1889571072" -> (known after apply)
      ~ security_group_id        = "sg-0822ccfe609ecd0e2" -> (known after apply) # forces replacement
      + source_security_group_id = (known after apply)
      .
    }

  # aws_security_group_rule.sg_1_rule_2 must be replaced

-/+ resource "aws_security_group_rule" "sg_1_rule_2" {
        cidr_blocks              = [
            "0.0.0.0/0",
        ]
        from_port                = 443
        to_port                  = 443
      ~ id                       = "sgrule-1889571072" -> (known after apply)
      ~ security_group_id        = "sg-0822ccfe609ecd0e2" -> (known after apply) # forces replacement
        + source_security_group_id = (known after apply)
        .
        .
    }

    Plan: 3 to add, 1 to change, 3 to destroy.

Red flag 1: Instead of renaming the security group, the resource is being replaced. Apparently, you cannot rename an AWS security group. You can only add/edit/remove rules and tags. Renaming = delete old + create new. New security group => new security group id => updates to all the entities referencing the current security group. Hence the following change is shown for EC2.

  # aws_instance.test_1 will be updated in-place
  ~ resource "aws_instance" "test_1" {
       .
       .
      ~ vpc_security_group_ids       = [
          - "sg-0822ccfe609ecd0e2",
          - "sg-0c09e69ffcf9e1acf",
        ] -> (known after apply)
        .
        .
    }

So, instead of changing the security group name, a new one will get created and referenced. Big deal. You type yes and go ahead with the change.

Do you want to perform these actions?
Terraform will perform the actions described above.
Only 'yes' will be accepted to approve.
Enter a value: yes

This leads to the following output:

aws_security_group_rule.sg_1_rule_2: Destroying... [id=sgrule-556511616]
aws_security_group_rule.sg_1_rule_1: Destroying... [id=sgrule-1889571072]
aws_security_group_rule.sg_1_rule_1: Destruction complete after 0s
aws_security_group_rule.sg_1_rule_2: Destruction complete after 0s

aws_security_group.test_1: Destroying... [id=sg-0822ccfe609ecd0e2]
aws_security_group.test_1: Still destroying... [id=sg-0822ccfe609ecd0e2, 10s elapsed]
.
.
aws_security_group.test_1: Still destroying... [id=sg-0822ccfe609ecd0e2, 1m0s elapsed]
.
aws_security_group.test_1: Still destroying... [id=sg-0822ccfe609ecd0e2, 2m0s elapsed]
.
aws_security_group.test_1: Still destroying... [id=sg-0822ccfe609ecd0e2, 7m0s elapsed]

Red Flag 2: Why is Terraform taking so long to destroy the security group? Terraform is trying to:

  1. Destroy the old security group.
  2. Create a security group with the new name.
  3. Associate the new security group with the instance.

It seems like we are stuck at step-1. This is because Terraform will not destroy the old security group till it has disassociated it from the EC2 instance. It cannot disassociate the security group from the EC2 instance till it has a new security group to replace it with. This means that instead of the above series of steps, we need something that does:

  1. Create a security group with the new name.
  2. Destroy the old security group.
  3. Associate the new security group with the instance.

This can be achieved by using the create_before_destroy lifecycle rule

lifecycle {
  create_before_destroy = true
}

in aws_security_group_rule block.

However, what about the current Terraform run that is hung in this deadlock condition?

The only way to fix this is to ensure that the disassociation is done manually i.e. find the entities that the security group is referring to, remove that security group's reference from those entities so that Terraform is unblocked and can go ahead with the rest of the steps.

Use the following command to find the entities associated with this security group

aws ec2 describe-network-interfaces --filters Name=group-id,Values=sg-0822ccfe609ecd0e2 --region ap-south-1 --output json --query 'NetworkInterfaces[*]'.['NetworkInterfaceId','Description','PrivateIpAddress','VpcId']

This shows

[
    [
        "eni-0c2e49517cb12fbdd",
        "",
        "172.31.2.242",
        "vpc-xxxxx"
    ]
]

which is our EC2 instance. Editing the instance and removing the security group via the console leads to the stuck Terraform run completing like so:

aws_security_group.test_1: Still destroying... [id=sg-0822ccfe609ecd0e2, 7m0s elapsed]
aws_security_group.test_1: Destruction complete after 8m46s
aws_security_group.test_1: Creating...
aws_security_group.test_1: Creation complete after 1s [id=sg-0d4e8f63b345acbd3]
aws_security_group_rule.sg_1_rule_2: Creating...
aws_security_group_rule.sg_1_rule_1: Creating...
aws_instance.test_1: Modifying... [id=i-0db7a4160425910bf]
aws_security_group_rule.sg_1_rule_1: Creation complete after 1s [id=sgrule-2655916068]
aws_security_group_rule.sg_1_rule_2: Creation complete after 1s [id=sgrule-1249643902]
aws_instance.test_1: Modifications complete after 2s [id=i-0db7a4160425910bf]

Apply complete! Resources: 3 added, 1 changed, 3 destroyed.
lifecycle {
  create_before_destroy = true
}

is a lifesaver in this case. You can try renaming the security group test-sg-2 in the right way by using the updated code rename-security-group-post-code/v2/main.tf and validate that when using create_before_destroy – replacing the security group is done in a matter of seconds:

aws_security_group_rule.sg_2_rule_1: Destroying... [id=sgrule-667181795]
aws_security_group_rule.sg_2_rule_1: Destruction complete after 0s
aws_security_group_rule.sg_2_rule_1: Creating...
aws_security_group_rule.sg_2_rule_1: Creation complete after 1s [id=sgrule-2721852789]

I learnt the hard way that this is a known issue. But I am sure that many will run into this scenario as the need to rename security groups is rare and is not accounted for while creating them.

That's about it. In the next post, we will see how create_before_destroy which helped solve the problem of renaming a security group in this post, will cause a problem if used in the wrong place.

Share to:
Twitter
Reddit
Linkedin
#SRE Tooling #sre #devops #Observability #SLO #Deep Dives #Last9 Engineering #Last9 #Failures #hans #tools #Systems Engineering #Latency

You might also like...

Latency SLO
Latency SLO

How do you set Latency based alerts? The most common measurement is a percentile-based expression like: 95% of the requests must complete within 350ms. But is it as simple?

Read ->
A Primer on Saturation SLO: What Is It and Do You Need to Consider It?
A Primer on Saturation SLO: What Is It and Do You Need to Consider It?

What is Saturation and why should you think about it as an SLO? Saturation can be understood as the load on your network and server resources.

Read ->
Sleep Friendly Alerting
Sleep Friendly Alerting

We've all been woken up with that dreaded Slack notification at ungodly hours only to realise that the alert was all smoke and no fire. The perfect recipe for dread and alert fatigue.

Read ->

SRE with Last9 is incredibly easy. But don’t just take our word for it.