After a client explained that users were experiencing slow response times on a recently live system we met with the project team to discuss the system in more detail and created a plan to investigate the problems reported. Some of the detail has been omitted, but a high level overview of the approach is outlined below.
The client’s system was an in-house developed service hosted externally with a cloud provider. Users accessed the service from within a large geographically dispersed IT estate with over 35,000 users. As a result, the full end to end data paths were not clear to the team, and shared infrastructure workloads were not known. The project team had successfully identified and fixed a number of performance issues, but these did not resolve the escalating problems reported by users.
After an initial discussion we reviewed the system architecture as well as performance related tickets on the teams Jira board. It should be noted at this point that no distributed tracing was in place and the system had been put live without performance testing.
A meeting was scheduled with end users to step through the problems they were seeing. From this we identified specific instances when problems had occurred and the application features being used. It also highlighted a couple of points; some features of the application were taking a long time to run and clearly needed reviewing, but, more interestingly, not all users from the same location were experiencing a problem.
System metrics and log data were collected from as many points in the processing chain as possible to analyse system health (see USE method by Brendan Gregg). Resource data over different time periods and levels of granularity were reviewed and correlated to known problem periods supplied by users. Log data was analysed to collate any errors present and to build a latency picture from key timing points.
Latency data showed that some long running requests could be accounted for within the application space and could be correlated to application errors. These were flagged and tickets raised. Latency data also showed that the duration of some long running requests could not be accounted for. The source of this latency was outside the boundaries of the application and it’s hosting. This tied with the fact that not all users from the same locations were reporting a problem.
A list of client end user configurations were created and a session with the network team held to break down end-to-end dataflows to our application for each configuration. This session was key and identified significant differences in the network path depending on the end user’s device (EUD).
Steps were taken to reproduce the issues experienced and measure round trip times to the application from different end user devices. We developed a custom utility that could be run in production within the organisations security policies to collect observational data from different access points and end user devices. Tests identified significant differences in network connect and download time across certain paths, as well as a significant slowdowns at particular points in the day.
Breaking down the paths further we discovered an overloaded network security appliance performing SSL termination and deep packet inspection for users of certain devices. We also discovered bandwidth saturation for a network link used by these devices at peak periods.
To de-escalate the problem quickly, key user groups were identified and migrated to non impacted EUD’s and network links, with application fixes rolled out to address the issues identified from the log analysis. To improve the experience for users remaining on impacted routes we used the network monitoring utility to test and tune the affected appliances, reducing round trip times by over 50%. Additional data was supplied to the network team to aid in addressing both the link and appliance capacity constraints, and further recommendations made to close the gaps in system monitoring (e.g. synthetic monitoring, distributed tracing) as well as implementing a test lab for performance testing.