Connections

Handling connections in production

Every meaningful application relies on multiple components, both internal and external, to work seamlessly. These components might include microservices, databases, messaging services, or external systems.

To interact with these systems, connections are established, allowing communication to occur. Without diving too deep into the technical details of how these connections are formed or what data packets are exchanged, it’s important to understand that these operations are complex and consume significant time and memory. Additionally, because these connections are often made over networks, they can be unreliable at times.

Network reliability is a broader topic that I'll address separately. For now, let's focus on scenarios where frequent communication with these systems is necessary, leading us to the concept of connection pooling.

Connection Pooling and Tuning Parameters

Most of us have used connection pooling, whether through custom solutions or third-party tools like HikariCP, Apache DBCP, or C3P0. When developing locally, we often connect to systems running on a local network or use mocks, so connection loss and pooling settings may not seem critical. However, in production, our assumptions are constantly challenged. To ensure your application performs well under varying loads, several connection pooling parameters should be carefully tuned:

  1. Max Pool Size (maximumPoolSize):

    • Description: This parameter controls the maximum number of connections that can be pooled at a time.

    • Tuning: Setting this too low can cause a bottleneck during peak loads, leading to connection timeouts. Setting it too high might overwhelm your database or other external systems. The optimal size depends on your workload and the system's capacity to handle concurrent connections.

  2. Min Idle Connections (minimumIdle):

    • Description: Defines the minimum number of idle connections that should always be maintained in the pool.

    • Tuning: A higher value ensures that requests can be served faster during sudden traffic spikes, as connections are already available. However, maintaining too many idle connections can consume unnecessary resources.

  3. Connection Timeout (connectionTimeout):

    • Description: The maximum time that the pool will wait for a connection to be available before throwing an exception.

    • Tuning: If set too low, your application might frequently encounter timeouts during high load periods. If set too high, it might delay error reporting in case of an actual issue with connection availability.

  4. Idle Timeout (idleTimeout):

    • Description: The maximum amount of time that a connection is allowed to sit idle in the pool before it is closed.

    • Tuning: Idle Timeout is crucial because it helps manage resources by ensuring that idle connections do not persist longer than necessary. If idle connections are left open indefinitely, they can consume valuable database resources, potentially leading to connection exhaustion, memory leaks, or unexpected database issues. Conversely, if the timeout is too short, connections might close too frequently, leading to increased overhead from re-establishing connections. A balanced timeout ensures resources are used efficiently while maintaining readiness for new requests.

  5. Max Lifetime (maxLifetime):

    • Description: The maximum lifetime of a connection in the pool.

    • Tuning: This setting ensures that connections are recycled regularly, preventing issues such as memory leaks, stale connections, or connections that might be forcibly closed by the database server due to prolonged inactivity. Setting this slightly shorter than the database’s timeout helps avoid unexpected disconnections and keeps the connection pool healthy.

  6. Validation Query (connectionTestQuery):

    • Description: A query used to check if the connection is still valid before it is handed off to the application.

    • Tuning: This is essential in environments where connections might be dropped due to inactivity or network issues. Without a validation query, your application might end up using a dead or broken connection, leading to errors. However, frequent validation can add overhead, so it's important to balance the need for reliability with the cost of executing these queries.

  7. Time to Live (TTL):

    • Description: The duration for which a connection can remain open before it is automatically closed.

    • Tuning: TTL ensures that connections are not held indefinitely, which could lead to resource exhaustion or stale data issues. By enforcing a TTL, you can regularly refresh connections, ensuring that your application is always using fresh, valid connections.

  8. Ping Heartbeats:

    • Description: Periodic signals sent to keep the connection alive and to detect if the connection has been dropped.

    • Tuning: Ping Heartbeats are particularly useful in long-running connections where inactivity might lead to the connection being dropped by the database server. Regular pings can prevent the connection from being closed due to inactivity, ensuring that the connection pool remains ready to serve requests without the need to frequently re-establish connections.

  9. Acquire Increment:

    • Description: The number of new connections to create when the pool is exhausted.

    • Tuning: Adjusting this can help manage the rate at which new connections are opened during spikes in demand. A smaller increment conserves resources but might cause delays; a larger increment can quickly satisfy high demand but at the cost of increased resource consumption.

Why These Settings Matter

When developing locally, these settings may seem trivial. However, in production, they play a critical role in ensuring that your application can handle real-world scenarios.

  • Idle Timeout: If idle connections are not closed after a certain period, they might consume unnecessary resources, leading to potential performance degradation or even system crashes under heavy load. Properly tuning the idle timeout helps maintain a healthy balance between resource utilization and application readiness.

  • Validation Queries: Without validation, your application might attempt to use a broken or stale connection, leading to unexpected errors and potential downtime. Although validation adds overhead, it's a necessary step to ensure the reliability of your connections in a production environment.

  • Ping Heartbeats: In environments where network stability is a concern, ping heartbeats help ensure that connections remain active and are not dropped due to inactivity. This prevents sudden connection loss and ensures that your application can continue to operate smoothly without interruption.

Preparing for Production

To make sure your application is ready for production, it's important to run load and stress tests. These tests help determine the appropriate tuning settings based on the observed load and patterns. Even with these settings in place, production monitoring and observability related to connections are crucial.

Monitoring tools like Prometheus, Grafana, or custom logging can help track metrics such as connection pool utilization, wait times, and failure rates. These insights can guide further tuning efforts and help you adapt to changes in production load.

Continuous Monitoring and Tuning

The results from monitoring can be surprising, often differing from what you observed in your test environment. Continuous learning from these observations is key to further tuning your application.

Typical Issues you will encounter in production

  1. Connection Leaks
  • Description: A connection leak occurs when a connection is not properly closed and returned to the pool after use. Over time, this can exhaust the pool, causing new requests to fail as they cannot obtain a connection.

  • Impact: Leads to resource exhaustion, resulting in application downtime or degraded performance.

2. Connection Pool Exhaustion

  • Description: Under heavy load, the number of active connections may exceed the maximum pool size, causing requests to wait for a connection to become available.

  • Impact: Increased latency, timeouts, and potential application crashes if the pool cannot serve all requests.

3. Timeouts

  • Description: Connections can timeout if they take too long to establish or if a query execution exceeds the configured timeout limit. This is more common in unreliable networks where latency is high.

  • Impact: Failed transactions, partial updates, or inconsistencies in the application state.

4. Connection Resets

  • Description: In unreliable networks, connections may be unexpectedly reset by the server or network infrastructure, interrupting active transactions.

  • Impact: Data loss, incomplete transactions, and the need for complex retry logic.

5. Database Overload

  • Description: High volumes of concurrent connections can overwhelm the database, especially if it's not properly tuned or scaled.

  • Impact: Slow query performance, deadlocks, and potential database crashes.

6. Stale Connections

  • Description: Connections that remain idle for too long may become stale, especially in environments with frequent network changes. When these stale connections are reused, they might fail.

  • Impact: Increased latency as the system must re-establish connections, leading to possible timeouts.

7. Connection Validation Failures

  • Description: If connections are not validated before use, the application might try to use invalid or stale connections, leading to failures.

  • Impact: Failed transactions and degraded performance.

8. Insufficient Connection Pooling Configuration

  • Description: Poorly configured connection pools (e.g., incorrect max-pool-size, idle-timeout, max-lifetime, etc.) can result in suboptimal performance or resource exhaustion.

  • Impact: Inefficient resource utilization, leading to latency and potential downtime.

9. Inconsistent Connection Performance

  • Description: In a distributed system, different nodes might experience varying network conditions, leading to inconsistent connection performance across the application.

  • Impact: Unpredictable application behavior and difficulties in troubleshooting and monitoring.

10. Resource Contention

  • Description: Multiple applications or services sharing the same database or MQ can lead to resource contention, especially under heavy load.

  • Impact: Reduced performance, increased latency, and potential deadlocks.

11. Inadequate Monitoring and Observability

  • Description: Lack of proper monitoring for connection pools and network health can make it difficult to identify and resolve connection-related issues.

  • Impact: Prolonged downtime, inefficient troubleshooting, and reactive rather than proactive maintenance.

12. Message Queue (MQ) Backpressure

  • Description: In message-based systems, if consumers cannot process messages quickly enough, the queue can build up, leading to backpressure and connection issues.

  • Impact: Delayed message processing, potential message loss, and overwhelmed producers.

13. Failure to Handle Connection Failures Gracefully

  • Description: Without proper handling of connection failures (e.g., retries, fallbacks), the application can fail catastrophically when a connection issue occurs.

  • Impact: Complete application failure and loss of service availability.

Mitigation Strategies

  • Use Connection Pools: Properly configure connection pools with reasonable sizes, timeouts, and validation queries.

  • Implement Retry Logic: For transient network issues, use retry logic with exponential backoff.

  • Monitor and Tune Regularly: Continuously monitor connection metrics and adjust configurations based on observed patterns.

  • Graceful Degradation: Implement fallback mechanisms and circuit breakers to gracefully handle connection failures.

  • Use Heartbeats: Implement heartbeat mechanisms to detect and recover from connection issues early.

  • Load Balancing: Distribute load evenly across servers and databases to prevent bottlenecks

My Experience :

  • In one of our applications, we used the Open Session in View Filter (OSIVF) along with background threads performing cleanup tasks. These threads reused code from the REST API, which relied on OSIVF for Hibernate session and transaction management. Unfortunately, the background threads copied the code from the filter but didn't handle errors or close connections correctly in case of failure. This led to occasional connection leaks in production, where the connections would gradually accumulate until the response time for the same user load was affected. To address this, we implemented leak detection to identify when connections were not returned to the pool in a timely manner..

  • In another application, we encountered an issue with stale connections increasing on the database side, despite the pool stats appearing normal. We were using HikariCP with Spring Boot in a containerized environment running on Kubernetes (K8s). To investigate, we enabled verbose logging for HikariCP and discovered that when the connection idle timeout was reached, the pool attempted to close connections. However, due to a bug in the K8s version and platform compatibility issues, some connection closures resulted in errors that were not logged as ERROR by the HikariCP version we were using. This oversight caused an increase in stale connections, which went undetected by our monitoring systems.

These are just a few examples among many that I've encountered. If I were to list them all, it would be extensive.

In future posts, I’ll explore a Java Spring Boot application connected to a database server. We’ll dive into various connection pooling tuning parameters and how to prepare for them effectively in production.