How To Troubleshoot Connectivity Issues In AWS Deployments?
Whether you are just learning AWS or have been using the amazon web services for some time, you will invariably run into connectivity issues in your deployments. For example, not able to SSH into the EC2 instance, the application tier is not able to talk to the database, and so on. In fact, there may be times when the connectivity was working fine when your stack was deployed, but it broke after some time. I am sure you can relate to at least some of these experiences. In this post, I will talk about how to troubleshoot and resolve some of the commonly encountered connectivity issues in AWS deployments.
Using Telnet For Port Check
Before we start talking about the issues, let’s first learn a simple technique named port check that can be used for troubleshooting. In networking, a port check refers to testing whether a port on a given node is listening or not. For example, if you want to check for the standard SSH port on a machine, you would check port 22. Why is a port check important? It is important because it is one of the most fundamental checks you can do for testing connectivity between two components without even knowing much about the components themselves. To explain this further, if an application running on an EC2 instance is failing to connect to the RDS instance and the port check for the database port fails from the EC2 instance, you can easily confirm that there is some connectivity issue between these two.
You can use the telnet utility for a port check. It is available on most platforms and is often pre-installed or can be easily installed at a later point. In order to do the port check, you will specify a command like the one shown below.
telnet <target-ip-address-or-dns-name> <port>
If you are able to telnet successfully, the port check is successful. Otherwise, it has failed. Simple!
The following screenshot shows a successful port check using telnet on an EC2 instance with the IP address 10.1.0.195 on the SSH port 22.
Here is another screenshot that shows a failed port check. In this case, the telnet connection has simply hung (that is, it is not able to connect successfully). But, you may see other variants like not able to connect, etc.
Common Connectivity Issues In AWS Deployments
Let’s talk about some common connectivity issues in AWS deployments now. Here is a list of such issues.
- Not able to connect to an EC2 instance via SSH.
- An application running on an EC2 instance is not able to connect to the RDS instance.
- The users are not able to access the web application.
These are just some of the commonly encountered issues. But, these represent some commonly observed patterns and you may find other issues that follow the same pattern. So, let’s discuss how to troubleshoot and resolve these.
Not able to connect to an EC2 instance via SSH
If you are not able to SSH into an EC2 instance, you can do a port check using the instance IP or DNS name on the SSH port to see if the connectivity is working at least.
If the port check was successful (that is, you were able to telnet to the SSH port), but you are still not able to connect via SSH, check for the following.
- Use of incorrect SSH user: Although the ec2-user is used quite commonly, certain AMIs use a different user. For example, the default user for the CentOS AMI is centos. You can check out Get Information About Your Instance to find the default user for various AMIs.
However, if the port check failed, the following could be some common reasons.
- Use of incorrect IP address or DNS name for the EC2 instance: This can happen due to a user error or the EC2 instance was rebooted and it’s public IP and DNS name changed. Now, if you have assigned an Elastic IP to the instance, it will not change upon reboot. But, Elastic IPs are expensive and used only for important instances typically. So, make sure to check the IP/DNS name is correct.
- A Firewall is blocking the SSH connection: At times, the corporate InfoSec team may block SSH connectivity to public IP addresses. This is typically done to avoid the scenario where a hacker can take advantage of an SSL vulnerability to hack into the corporate network. You can typically do some initial troubleshooting for this by using the SSH verbose option (-v or -vvv) when establishing the SSH connection. If that’s the case, you will need to work with your InfoSec team on the resolution. One possible solution is to assign an Elastic IP to your EC2 instance and get an exception from them to allow SSH to this IP address. Another potential firewall issue could be the OS firewall is blocking the SSH connection (such as the iptables configuration). To fix this you will have to modify the firewall configuration (preferably) or turn the firewall off.
- Incorrect Security Group configuration: This is quite common especially in manual deployments wherein either the Security Group that permits SSH connectivity has not been assigned to the EC2 instance or it does not have the proper ingress rule. So, double-check the Security Group assignment and ingress rules. Remember that you can assign multiple security groups to an EC2 instance and you can change the Security Group assignment post instance creation as well. The following screenshot shows a sample security group assignment to the EC2 instance.
An application running on an EC2 instance is not able to connect to the RDS instance
To troubleshoot this further, you can do a port check from the EC2 instance to the RDS instance port and see if that works.
The following screenshot shows a successful telnet port check to an RDS instance.
In that case, check for the following.
- Use of incorrect RDS instance name or port: Check your application configuration to see if it has the correct RDS instance name and port value.
If the port check failed, these could be some common reasons.
- Use of incorrect RDS instance name or port: Double-check the instance name and port to ensure these are correct.
- Is the RDS instance up? RDS supports the shutdown of instances. This is often useful for cost control purposes when the instance is not in use. For example, shutting down a development RDS instance during the weekends. So, check if the RDS instance is available.
- Incorrect Security Group configuration: Check whether the Security Group assigned to the RDS instance has correct ingress rules to permit connectivity from the EC2 instance. This would typically involve ensuring the ingress rule has the correct source subnet (which should be your EC2 subnet) and target (database) port specified. The following screenshot shows ingress rule entries for MySQL/Aurora RDS instance.
- Missing VPC Peering Configuration: Often resources like RDS instances are shared between application teams to reduce the cost. In such cases, the RDS instance is hosted in a different VPC (say, common VPC) and the applications are typically deployed in their own VPCs. VPC Peering is used to connect the application VPC with the common VPC so that these can talk to the RDS instance. If the port check failed, check the VPC Peering to ensure a Peering Connection has been made between these two VPCs and route table entries have been created in both these VPC route tables to route traffic to their respective subnets. For example, the screenshot below shows adding an entry in the common VPC route table for the application VPC subnet 10.1.0.0/16 that is routed via the Peering Connection pcx-09fbbdba113f3bb44. Similarly, an entry will need to be created in the application VPC’s route table for the common VPC subnet as the destination.
Although we talked about RDS instance here, these troubleshooting techniques can be used for connectivity issues for other application stack components, such as Consul, Elasticsearch, and so on.
The users are not able to access the web application
Again, a port check can be used here to see if the connectivity is working.
If the port check is successful, the following could be some common reasons.
- Application issue: Your application may be having issues. For example, the application stack may not be up or having other issues. So, check the application logs for any potential errors.
- Are you using a load balancer? For a deployment that is using a load balancer (LB), check for the instance health in load balancer setup to ensure that LB is able to use these to serve traffic. Note that LB will only serve traffic to healthy instances. If any of the instances are showing unhealthy, check their logs to troubleshoot further.
If the port check failed, these could be some common reasons.
- Use of incorrect IP address or DNS name: Ensure that the correct IP address/DNS name is being used to access the application.
- Use of incorrect port: Is your application running on a non-standard web port (that is, a port other than 80 (for HTTP) and 443 (for HTTPS))? For example, if you are using tomcat, the application may be running on port 8080 instead. So, check your configuration and ensure that the correct port is being used by the users.
- Incorrect Security Group configuration: Check whether the correct Security Group has been assigned to the web tier resource(s) and verify it’s ingress rules to ensure it permits connectivity to the correct port. For example, the screenshot below shows an ingress rule to permit traffic on port 8080 from any IP address.
In this post, we covered one of the most commonly used port check technique. It can be used in combination with other tools and techniques to effectively troubleshoot and identify the root cause. Troubleshooting connectivity issues can become a challenging task. Hence, it is important to spend some time and familiarize yourself with these tools and techniques in advance so that you are better prepared when issues arise.
Be a smart troubleshooter!
If you liked this post, you will find my AWS Advanced For Developers course helpful that focuses on many such best practices and techniques to design and deploy real-world applications in AWS.
Also published on Medium.