Amazon Elastic Compute Cloud (EC2) is a key part of Amazon Web Services (AWS) that lets you run instances of your software stack on demand. You can scale almost instantly and pay as you grow, making infrastructure management easier.
However, there can be complications, like slower system calls or various configuration issues, that can cause your EC2 instance to become unreachable. Knowing about these issues ahead of time can help you make wise decisions when choosing what tech to deploy.
In this article, you’ll learn about some of the issues users have reported and see how you can avoid them. You’ll also briefly learn about potential alternatives to EC2.
EC2: The Upside
EC2’s secure and reliable compute capacity offers teams several benefits when deploying to the cloud. Its instances come in several different types, and these types are optimized for various purposes. For example, there are balanced, general-purpose instances, as well as those optimized for compute power, storage, and memory.
You can also pick instances with specific features that match your requirements. These include cluster networking, Elastic Block Store (EBS) optimization, and Intel processor features, like the deep learning boost for AI.
You can quickly scale up, which means as your traffic grows, so does your server capacity. New instances do take a few minutes to spin up, so it’s not instantaneous. Nevertheless, you’ll still get fast scaling that responds to changing traffic levels very effectively.
Another AWS feature, Elastic Load Balancing, ensures incoming requests are spread among multiple instances. It can also safely protect instances that are under strain, distributing traffic elsewhere.
Amazon Simple Storage Service (S3) offers you persistent storage that isn’t tied to your EC2 instances. This means you can freely create or destroy your instances without data loss.
You can also have servers running from multiple areas, giving you a measure of fault tolerance and ensuring your services remain available even if one server location fails.
Use Cases for EC2
Below are a few typical use cases where EC2 is a good fit. This is just a brief overview; there are many, many more.
Workloads in the Cloud
AWS lets you create distinct workloads to handle processing tasks. With EC2, these can be easily deployed to the cloud, running on multiple instances, with new ones created as required.
Flexible Database Customization
EC2 lets you create custom-built databases. You can prioritize speed, performance, scalability, or whatever else you need.
For instance, Amazon’s Image Builder lets you create images that you can instantly create new instances from. You’re free to configure these images as you see fit, and you can add software customized to work with the database of your choice and update it when necessary.
EC2 is ideal for implementing AI, using machine learning. You can choose a specialized instance with extra computing power and scale up as many copies as you need.
It’s important to note that your choice of processor can have an impact on your costs here. Using a more expensive, faster processor can save you money if it gets your work done more quickly.
For example, some have found that the Tesla V100 outperforms the cheaper M60 under specific types of workload.
Is Using EC2 a Bad Idea?
Before you run a database on EC2, there are a few problems users should be aware of.
Several users have experienced problems with databases on EC2. These are edge cases that happen in specific circumstances. They definitely won’t happen to everyone but are highly significant if they do.
There are also speed issues with system calls, storage access problems, and instance reachability issues. This article will go through them one by one, along with the consequences and possible solutions.
Instance Reachability Check
Network or start-up configuration issues can cause your EC2 instance to become unreachable. It can also occur due to infrastructure problems, overloaded instances, or boot problems. It’s possible for instances to develop file system corruption, or compatibility issues, which prevent a correct start-up.
One way you can diagnose and fix these problems is to create a temporary instance and attach the storage from the failed instance to it. Then you can investigate the configuration files and remove any potentially troublesome entries. Once completed, you can then reattach the volume to your instance, and hopefully, the issues are fixed.
You can read more about diagnosing this issue in the article Instance reachability check failed at AWS EC2, with a helpful run-through of potential causes.
Slower System Calls
When using EC2, Heap found that they were experiencing a high CPU overhead in system calls to the clock.
System calls require changes in permissions and memory settings that lead to a lot of overhead. The vDSO can help prevent these from being necessary, allowing calls to avoid switching kernel modes.
However, when using the Xen hypervisor, the vDSO doesn’t support calls to time-related functions. That means they are repeatedly called by the slower method.
Packagecloud’s blog discusses two specific time-related system calls that run slower on EC2. The two calls in question,
clock_gettime, both trigger the issue. In their specific tests, the gap is 77 percent.
Both posts include code to check if the issues affect your system. You can use
strace to create a loop calling these functions and then see how many times they are actually called. If the number of calls matches the loop, it shows the functions were not being called virtually.
It’s possible to switch the clock source between Xen and TSC, but this can lead to other issues. There are problems associated with TSC, including overflow issues and out-of-order execution. Virtualization doesn’t help with these, with hypervisors potentially introducing timing and synchronization problems.
Changing the clock source can also cause the time to drift backward. That’s clearly a big problem if you’re doing something time dependent or need to have detailed audits of when user actions occur.
Newer EC2 instances using the Nitro hypervisor use
kvm-clock as their source, which works much better. If your code depends on time-related calls, be sure to pick one of these instances.
Read-Only NVMe Volumes
There’s also an issue with IO operations. If an IO call causes a time-out, then the filesystem can be remounted and ends up being read-only. This means you can’t write any data, which is clearly not a good situation.
Fortunately, there’s a fix. You can modify GRUB to include the
nvme_core.io_timeout parameter and set it to a higher value.
Amazon’s documentation also mentions the issue when using EBS volumes attached to Xen instances and suggests setting the
nvme_core.io_timeout value as high as possible. That means setting it to 255 on older kernels and 4294967295 on more modern systems.
This discussion on Y Combinator’s Hacker News, which includes engineers from Heap and Amazon, shows that these solutions aren’t necessarily a silver bullet. The problems can recur after a while, so if you have high call volumes, you’ll need to find another way forward.
These are edge cases, but their implications need to be considered.
Alternatives to EC2
If the problems introduced here outweigh the benefits of using EC2, you can always turn to Amazon Relational Database Service (RDS). It’s easier to manage than EC2, giving you high performance, automatic maintenance, and scalability.
RDS may also be cheaper in many scenarios, so if you don’t need all that EC2 offers, RDS might be your best solution.
However, EC2 gives you far more scope for customization and is the better choice if you want to take advantage of its specialized features.
If you’re deploying to the cloud, you need to ensure you won’t run into issues as you scale. Getting things right early in your journey has long-term repercussions for the viability of your business.
As you’ve read, there are several issues with EC2 instances, which include issues with system calls, read-only IO, and unreachable instances. All these are very serious if not addressed quickly.
However, these problems do have solutions. They aren’t all perfect, but in most cases, you can solve the problem and keep using EC2. That way, you can enjoy its advantages and avoid the cost of shifting to other technology.
(A big thank you to James Konik for his contribution to this article)
For more such content on everything reliability, go check out our blog!
You might also like...
A simple guide to crunch numbers for understanding overall HTTP content length metrics.Read ->
Stories from the world of SRE. Delivered.
SRE with Last9 is incredibly easy. But don’t just take our word for it.