If it ain't broke...
A Terraform lifecycle rule in the right place can help prevent a deadlock. But the same lifecycle rule in the wrong place?
If it ain't broke...

In the previous post, we saw how the Terraform lifecycle rule create_before_destroy can help prevent a deadlock when renaming security groups. In this post, we will see how using the same lifecycle rule in the wrong place will create a problem.

To recap, when renaming a security group, you need to replace

resource "aws_security_group" "test_1" {
  name = "test-1-new-name"
}

by

resource "aws_security_group" "test_1" {
  name = "test-1-new-name"
  lifecycle {
    create_before_destroy = true
  }
}

This ensures the following series of steps:

  1. Create a security group with the new name.
  2. Destroy the old security group.
  3. Associate the new security group with the instance.

This made me think - using a lifecycle rule seems like a good practice. Let me use it for the aws_security_group_rule resource also. That was a presumptuous mistake. Let us see how.

We will replicate the same infrastructure setup scenario:

  1. Create an EC2 instance (or any other resource which uses security groups).
  2. Associate one or more security groups to the instance.

The above infra can be created through update-security-group-rule/v1/main.tf

$ terraform init
$ terraform apply

Sample terraform output

aws_instance_test_1 = i-02d50e0a62110bbc6
security_group_test_1 = sg-03cc308342b10ebe5
security_group_test_2 = sg-03c1cbe2eb0ace857

However, for the next step, instead of renaming the security group, we will add one more entry in the cidr_block in our security_group_rule i.e. we will update

resource "aws_security_group_rule" "sg_2_rule_1" {
  from_port         = 8080
  protocol          = "tcp"
  to_port           = 8080
  security_group_id = aws_security_group.test_2.id

  cidr_blocks = ["0.0.0.0/0"] # this line will be changed

  lifecycle {
    create_before_destroy = true
  }
  type = "ingress"
}

to

resource "aws_security_group_rule" "sg_2_rule_1" {
  from_port         = 8080
  protocol          = "tcp"
  to_port           = 8080
  security_group_id = aws_security_group.test_2.id

  cidr_blocks = ["0.0.0.0/0", "1.1.1.1/32"]

  lifecycle {
    create_before_destroy = true
  }
  type = "ingress"
}
$ terraform apply
# aws_security_group_rule.sg_2_rule_1 must be replaced
+/- resource "aws_security_group_rule" "sg_2_rule_1" {
      ~ cidr_blocks              = [ # forces replacement
            "0.0.0.0/0",
          + "1.1.1.1/32",
        ]
        from_port                = 8080
      ~ id                       = "sgrule-1489633736" -> (known after apply)
         .
         .
   }
aws_security_group_rule.sg_2_rule_1: Creating...

Error: [WARN] A duplicate Security Group rule was found on (sg-03c1cbe2eb0ace857). This may be a side effect of a now-fixed Terraform issue causing two security groups with identical attributes but different source_security_group_ids to overwrite each other in the state. See https://github.com/hashicorp/terraform/pull/2376 for more information and instructions for recovery.
Error message: the specified rule "peer: 0.0.0.0/0, TCP, from port: 8080, to port: 8080, ALLOW" already exists

What happened?

1. Initially, the security group had the following rule associated with it:

direction | from_port | to_port | source     | rule
ingress   | 8080      | 8080    | 0.0.0.0/0  | allow

2. We tried creating a new rule which has the following entries:

direction | from_port | to_port | source     | rule
ingress   | 8080      | 8080    | 0.0.0.0/0  | allow
ingress   | 8080      | 8080    | 1.1.1.1/32 | allow

3. Because of the lifecycle rule create_before_destroy, Terraform is creating the step-2 rule first, which is having an entry

direction | from_port | to_port | source     | rule
ingress   | 8080      | 8080    | 0.0.0.0/0  | allow

common to both rules. A security group cannot have 2 entries having the exact same rule associated with it (try creating a duplicate entry in the AWS console). Hence it fails with the error

Error message: the specified rule "peer: 0.0.0.0/0, TCP, from port: 8080, to port: 8080, ALLOW" already exists

This can be fixed by, you guessed it – removing the lifecycle rule from the security_group_rule block as per update-security-group-rule/v2/main.tf

aws_security_group_rule.sg_2_rule_1: Destroying... [id=sgrule-1489633736]
aws_security_group_rule.sg_2_rule_1: Destruction complete after 0s
aws_security_group_rule.sg_2_rule_1: Creating...
aws_security_group_rule.sg_2_rule_1: Creation complete after 1s [id=sgrule-2162410043]

Lessons learnt:

  1. lifecycle rule – create_before_destroy is useful in the aws_security_group block, but harmful in the aws_security_group_rule block.
  2. If it ain't broke, don't fix it. Again.
Share to:
Twitter
Reddit
Linkedin
#SRE Tooling #sre #devops #Observability #SLO #Deep Dives #Last9 Engineering #Last9 #Failures #hans #tools #Systems Engineering #Latency

You might also like...

Latency SLO
Latency SLO

How do you set Latency based alerts? The most common measurement is a percentile-based expression like: 95% of the requests must complete within 350ms. But is it as simple?

Read ->
A Primer on Saturation SLO: What Is It and Do You Need to Consider It?
A Primer on Saturation SLO: What Is It and Do You Need to Consider It?

What is Saturation and why should you think about it as an SLO? Saturation can be understood as the load on your network and server resources.

Read ->
Sleep Friendly Alerting
Sleep Friendly Alerting

We've all been woken up with that dreaded Slack notification at ungodly hours only to realise that the alert was all smoke and no fire. The perfect recipe for dread and alert fatigue.

Read ->

SRE with Last9 is incredibly easy. But don’t just take our word for it.