RDS Database Connection Pooling
This article presents our experience and practices we got while working with MySQL engine running on Amazon Relational Database Service (RDS). I focus especially on the connection layer and tweaking of Tomcat JDBC Connection Pool.
In our team we use two stand-alone service frameworks: Dropwizard and Spring Boot. Currently we support nearly two dozens of services and new ones are under development. Most of our services use MySQL for storing data and usually the engine is problem-free. There were, however, some situations that required having a look into the source code of connection pool provider and some troubleshooting. Most problems were caused by lack of correct understanding of some limitations and mechanisms RDS comes with. Both Dropwizard and Spring Boot come with default database connection pool settings and both use Tomcat JDBC Connection Pool (by default).
Number of connections opened by pool
For most developers RDS is completely transparent and they use MySQL running on RDS the same way they would use self-hosted MySQL. There are, however, some limitations which all users must be aware of. Firstly, a barrier that hit us on a quite early project phase, there is no control over number of maximum connections accepted by MySQL running on RDS. We started with running our services on small instances (like db.t2.medium or even db.m1.small), using default connection pool settings. It surprised us that under load test these settings were not sufficient. We simply ran out of connections what resulted in the following exceptions being thrown:
org.apache.tomcat.jdbc.pool.PoolExhaustedException: Timeout: Pool empty. Unable to fetch a connection in 30 seconds, none available[size:10; busy:10; idle:0; lastwait:30000].
! at org.apache.tomcat.jdbc.pool.ConnectionPool.borrowConnection(ConnectionPool.java:672) ~[tomcat-jdbc-8.0.18.jar:na]
! at org.apache.tomcat.jdbc.pool.ConnectionPool.getConnection(ConnectionPool.java:186) ~[tomcat-jdbc-8.0.18.jar:na]
! at org.apache.tomcat.jdbc.pool.DataSourceProxy.getConnection(DataSourceProxy.java:127) ~[tomcat-jdbc-8.0.18.jar:na]
! at org.hibernate.engine.jdbc.connections.internal.DatasourceConnectionProviderImpl.getConnection(DatasourceConnectionProviderImpl.java:139) ~[hibernate-core-4.3.6.Final.jar:4.3.6.Final]
! at org.hibernate.internal.AbstractSessionImpl$NonContextualJdbcConnectionAccess.obtainConnection(AbstractSessionImpl.java:380) ~[hibernate-core-4.3.6.Final.jar:4.3.6.Final]
! at org.hibernate.engine.jdbc.internal.LogicalConnectionImpl.obtainConnection(LogicalConnectionImpl.java:228) ~[hibernate-core-4.3.6.Final.jar:4.3.6.Final]
! ... 31 common frames omitted
OK, so the error message says we have maximum 10 connections in pool and if there are no more available, HTTP thread needs to wait for 30 seconds before giving up. Both of these values didn’t seem to be right. Connection wait timeout was way too high! We couldn’t wait for 30 seconds only to acquire the connection to DB while processing clients request. Maximum pool size was too low as we wanted to handle higher traffic. The solution was to adjust maximum connection wait time to 1000 ms and increase maximum number of connections to 50 per instance.
Our deployment consisted of three instances getting data from MySQL database (yes, I agree this may be too much for db.t2.small instance type). Also there was a periodical ETL process running (AWS Data Pipeline), moving MySQL data to data warehouse (Redshift). Furthermore, some developers were connected to MySQL instance using MySQL Workbench. The following drawing presents this situation:
After tweaking connection pool settings and enabling high test traffic, we encountered the following error:
ERROR [2017-01-27 13:01:00,847]  org.apache.tomcat.jdbc.pool.ConnectionPool: Unable to create initial connections of pool.
! com.mysql.jdbc.exceptions.jdbc4.MySQLNonTransientConnectionException: Too many connections
! at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) ~[na:1.8.0_20]
! at sun.reflect.NativeConstructorAccessorImpl.newInstance(Unknown Source) ~[na:1.8.0_20]
! at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(Unknown Source) ~[na:1.8.0_20]
! at java.lang.reflect.Constructor.newInstance(Unknown Source) ~[na:1.8.0_20]
! at com.mysql.jdbc.Util.handleNewInstance(Util.java:408) ~[mysql-connector-java-5.1.31.jar:na]
! at com.mysql.jdbc.Util.getInstance(Util.java:383) ~[mysql-connector-java-5.1.31.jar:na]
Well, that was kind of unexpected! This error says that MySQL refused to open new connection because max_connections limit has been reached. After a short investigation, we found out why! RDS MySQL connection limit depends on RDS instance class. So the smaller DB instance, the lower number of connections might be accepted. According to this blog entry db.t2.small instance may accept only 150 connections and we tried to open 170 connections. According to AWS generic rule in order to determine maximum number of connections:
Here are some samples:
After that incident we defined set of rules for DB connection pool sizing which I describe in the conclusions section.
At a later stage of the project we decided to perform destructive tests and check how our application behaves in case of RDS failover event. Amazon documentation says the following about RDS failover:
Amazon RDS provides high availability and failover support for DB instances using Multi-AZ deployments. Multi-AZ deployments for Oracle, PostgreSQL, MySQL, and MariaDB DB instances use Amazon technology, while SQL Server DB instances use SQL Server Mirroring.
In a Multi-AZ deployment, Amazon RDS automatically provisions and maintains a synchronous standby replica in a different Availability Zone. The primary DB instance is synchronously replicated across Availability Zones to a standby replica to provide data redundancy, eliminate I/O freezes, and minimize latency spikes during system backups. Running a DB instance with high availability can enhance availability during planned system maintenance, and help protect your databases against DB instance failure and Availability Zone disruption.
Basically, failover is a mechanism which provides high availability deployment of MySQL. It automatically changes the DNS records to point to the standby DB instance in case AWS detects problems with primary MySQL instance. The entire process usually takes 2 – 4 minutes. In case of failover the clients need to re-establish any existing connections to DB instance (more information here).
This image represents the concept of Multi-AZ deployment of RDS:
In order to “artificially” trigger MySQL failover we may use AWS Management Console. To run it, right click on RDS instance and choose “Reboot” option. On the next screen check “Reboot With Failover?” box and click Reboot.
The initial test results were unsatisfactory. The application did not recover in 5 minutes what was unacceptable. After some investigation we realized that default connection validation settings were not optimal. We set the following connection pool properties:
- testOnBorrow – true
- testOnConnect – true
- logAbandonedConnections – true
- logValidationErrors – true
After that switch the application recovered faster (in about three minutes), however, we realized that connection validation errors were there for three minutes longer (They did not affect client traffic. Tomcat Connection Pool runs background job to verify correctness of idle connections). These were much better and acceptable results. We could potentially lower that time by setting low value of JVM DNS cache TTL but that would save only few seconds because our JVM implementation sets it to 30 seconds by default.
These two tests made us realize that RDS comes with some constraints and mechanisms which need to be appropriately addressed and developers should be aware of them. Our tests resulted in a guidance document describing best practices for our services and developers while using MySQL running on RDS.
Database connection pool sizing recommendations:
- Take into consideration the size of DB instance. Calculate or check maximum connections allowed.
- Find out how many instances of application will connect to DB instance. If you use Auto Scaling Group check upper limit of the group.
- Remember about all background jobs which require DB connection (for example AWS Data Pipeline or AWS Glue or other ETL tools). Check how many connections are used by these jobs.
- Leave few connections available for developers to allow them to connect to DB for troubleshooting.
- Adjust pool’s connection wait timeout. In most cases this can affect your service response time.
- Test chosen settings in extreme conditions.
Here are recommended settings we use:
|Dropwizard Property||Spring Boot Property||Recommended value||Comment|
|checkConnectionOnBorrow||test-on-borrow||true||The indication of whether objects will be validated before being borrowed from the pool. If the object fails to validate, it will be dropped from the pool, and we will attempt to borrow another one.|
|checkConnectionOnConnect||test-on-connect||true||Set to true if we should run the validation query when connecting to the database for the first time on a connection.|
|checkConnectionWhileIdle||test-while-idle||true||Set to true if query validation should take place while the connection is idle.|
|initialSize||initial-size||equal to minSize||Set the number of connections that will be established when the connection pool is started. Please be aware that this parameter affects start time (pool is fully created while application starts)
If value of this setting is smaller than “minSize” then idle connection validation job will not validate all connections in pool. This is why it’s important to initialSize to be equal to minSize
|logAbandonedConnections||log-abandoned||true||Set to true if stack traces should be logged for application code which abandoned a Connection. This setting is helpful for troubleshooting.|
|logValidationErrors||log-validation-errors||true||Set to true if you wish that errors from validation should be logged as error messages. This setting is helpful for troubleshooting.|
|maxSize||max-idle||depends on many factors (see recommendations above)||The maximum number of connections that should be kept in the idle pool.|
|maxSize||max-active||depends on many factors (see recommendations above)||The maximum number of active connections that can be allocated from this pool at the same time.|
|maxWaitForConnection||max-wait||1000||The maximum number of milliseconds that the pool will wait (when there are no available connections and the maximum size has been reached) for a connection to be returned before throwing an exception (PoolExhaustedException). If we set this time for too high client response time may be affected.|
|minSize||min-idle||depends on many factors||The minimum number of established connections that should be kept in the pool at all times. The connection pool can shrink below this number if validation queries fail and connections get closed. The idle pool will not shrink below this value during an eviction run.|
|validationQuery||validation-query||/* Service Name Health Check */ SELECT 1||The SQL query that will be used to validate connections from this pool before returning them to the caller or pool. If specified, this query does not have to return any data, it just can’t throw a SQLException. In the comment it’s worth to put service name which does validation. It’ll be logged in MySQL in case of any problems.|
Sample Dropwizard configuration
validationQuery: "/* Payment Platform Health Check */ SELECT 1"
Sample Spring Boot configuration
validation-query: /* Service Name Health Check */ SELECT 1
I hope this article will help someone in making their application more resistant and reliable.