AWS security groups: canned answers and exploratory questions
While using a Terraform lifecycle rule, what do you do when you get a canned response from a security group?
AWS security groups: canned answers and exploratory questions

In the last two posts, we saw how to use the Terraform lifecycle rule create_before_destroy in the right way and the wrong way. Both the posts used updating AWS security groups as a reference task.

One interesting command that we used in the first post was

aws ec2 describe-network-interfaces --filters Name=group-id,Values=sg-0822ccfe609ecd0e2 --region ap-south-1 --output json --query 'NetworkInterfaces[*]'.['NetworkInterfaceId','Description','PrivateIpAddress','VpcId']

The above command:

  1. Queries EC2 network interfaces.
  2. Filters the ones associated with a specific security group.
  3. Lists the network interface, description, private IP and vpc id for it.

Sample output

[
    [
        "eni-xxxxx",
        "",
        "10.9.9.223",
        "vpc-xxxxx"
    ]
]

But for some security groups, the above output will be empty. This means that the security group is not associated with any network interface, which implies that the security group:

  1. Was created but not attached to any entity (EC2, RDS, etc.)
  2. Was created, attached to entities but this security group was not cleaned up after those entities were destroyed.

It is worthwhile to answer the same question for all of your security groups. A quick google search for “find unused security groups” – will give you different ways to search unused security groups from the console or provide one-off custom scripts to identify them. You run the script, find unused groups and delete them. All is well in your world.

Except that you have to go through the same process manually again in a few weeks. To fix this, search for “delete unused security groups automatically” and you will see solutions using AWS Config, AWS Lambda and so on.

The ease with which one can find canned answers to common problems is both a gift and a curse. A gift because it takes less time to solve that specific problem. A curse because one doesn’t ponder long enough on the solution to know if there are related problems that can be pre-empted while solving this one.

This post will not attempt to showcase Yet Another AWS Security Group Deletion Script (commonly known as YAASGDS amongst the cool kids). It will try to use that scenario as a stepping stone to ask some interesting questions.

Let’s step back and understand what we gained by deleting unused security groups.

  1. We cleaned up after ourselves.

However, the cleanup should also involve asking the following questions:

  1. Do I need to clean up other regions/accounts also?
  2. What about security groups that are in use but are too open? they allow access to 0.0.0.0/0 in source IP.
  3. The open-to-the-world rule might be ok for http/https i.e port 80/443 but not for SSH e.g. 22 How do I filter these?
  4. What about security groups that are not open to the world, but open to unwanted IPs i.e IPs other than your company’s network, VPN, etc.? Remember that time when you added a security group rule for port 22 for your ISP's public IP but forgot to delete it?
  5. There might be security groups referencing other security groups – how do I find the ones that have the max reference count so that I can prioritize their review?
  6. Are my security groups using any ports other than 80, 443, 22? (Useful to filter these out and review other security groups e.g. database access for 3306 - MySQL or 5432 - PostgreSQL)
  7. Are there security group rules present for AWS default vpc? This implies that someone has spun up infra in the default vpc. They’d better have a good reason for doing so.

We have now shifted focus from cleanup to security. Each of the above questions can be answered by a custom script. But more questions might come and one script per question might be hard to maintain. We need to:

  • Create a solution that answers the above questions AND lays the foundation for asking more questions.

To put it another way – for extracting data from a database, would you rather hunt for scripts that are thinly veiled wrappers over SQL or write raw SQL and get better at exploring data?

I found myself in this position and wrote scan_security_groups.py. This script:

  1. Scans one or more AWS regions for security groups.
  2. Find their associated network interfaces.
  3. Finds their from-to ports.
  4. Finds the security group rule type (references IP addresses, other security groups, etc.)
  5. Categorizes source IPs (AWS public IPS, default VPC private IPs, Your organization Public IPs explicitly passed by a cli input, Public IP which is 0.0.0.0/0, etc.)
  6. Dumps the output in a CSV format.

The last step is the most important one. Instead of answering the questions directly, we are creating a data structure that allows us to explore these scenarios.

Run

python scan_security_groups.py --org-ip-addr-file ./ip-addrs/org.json | tee /tmp/output.csv

and unleash the power of q (as seen before here)

Find security group with SSH port 22 open to the world

q -H -d ',' -O "select Region as region, VpcId as vpc, GroupId as id, GroupName as name, GroupAssociationsCount as grpcount, RuleSource as src, RuleFromPort as from_port, RuleToPort as to_port, RuleIpRangeStatus as status
from /tmp/output.csv 
where 
RuleSourceType == 'ip_address' and 
GroupAssociationsCount > 0 and 
RuleFromPort in (22) and RuleIpRangeStatus in ('public-0.0.0.0')" | \
ROW_TEXTWRAP_LEN=50 TABLE_HAS_HEADER=1 csv2table
RuleSourceType == 'ip_address' # Find ip_address type sg

and 

GroupAssociationsCount > 0 # which is in use

and 

RuleFromPort in (22) # allows SSH access

and 

RuleIpRangeStatus in ('public-0.0.0.0') # and is open to the world

Find security groups associated with ports other than ports 80, 443, 22.

q -H -d ',' -O "select Region as region, VpcId as vpc, GroupId as id, GroupName as name, GroupAssociationsCount as grpcount, RuleSource as src, RuleFromPort as from_port, RuleToPort as to_port, RuleIpRangeStatus as status
from /tmp/output.csv where 
RuleSourceType == 'ip_address' and
GroupAssociationsCount > 0 and
RuleFromPort not in (80, 443, 22)" | ROW_TEXTWRAP_LEN=50 TABLE_HAS_HEADER=1 csv2table

Sample output screenshot

Check out more queries that answer the above questions here.

Creating a queryable data structure also opens up possibilities for future questions e.g. find security groups with invalid names i.e. using launch-wizard prefix because someone never bothered to name them correctly.

Although I had to write a “custom” script, I did so to answer more than one question. This made me want to write such tooling for other AWS components like RDS to catch scenarios like – How many RDS databases haven’t had any connections over the last X days?

But writing one CSV-generating script per AWS component is not scaleable. What the world needs is an osquery type model but for AWS. Also, most of the security-related questions require a compliance policy-based approach, where tooling like cloud-custodian comes in handy. Take your timelines into consideration and pace yourself accordingly.

That’s about it. To summarize:

  1. Get canned answers for the short term.
  2. Generate explorable data structures for the long term.
  3. Have fun while you are at it.

P.S: The SQL query which answers the question “Are there any security group rules present for AWS default vpc?” is not answered here and is left as an exercise for the reader. Do reply back on Twitter with your answers. Fist bumps guaranteed.

Share to:
Twitter
Reddit
Linkedin
#SRE Tooling #sre #devops #Observability #SLO #Deep Dives #Last9 Engineering #Last9 #Failures #hans #tools #Systems Engineering #Latency

You might also like...

Latency SLO
Latency SLO

How do you set Latency based alerts? The most common measurement is a percentile-based expression like: 95% of the requests must complete within 350ms. But is it as simple?

Read ->
A Primer on Saturation SLO: What Is It and Do You Need to Consider It?
A Primer on Saturation SLO: What Is It and Do You Need to Consider It?

What is Saturation and why should you think about it as an SLO? Saturation can be understood as the load on your network and server resources.

Read ->
Sleep Friendly Alerting
Sleep Friendly Alerting

We've all been woken up with that dreaded Slack notification at ungodly hours only to realise that the alert was all smoke and no fire. The perfect recipe for dread and alert fatigue.

Read ->

SRE with Last9 is incredibly easy. But don’t just take our word for it.