Identifying an off-CPU bottleneck

This summarises an approach to identify an off-CPU bottleneck using some powerful and freely available Linux tools. Thread activity was traced for a new Java microservice and mapped to the application architecture and threading model.

Data for a new Java microservice showed significantly lower throughput for HTTP POST requests than GET requests. This summarises an approach to identify an off-CPU bottleneck using some powerful and freely available Linux tools.

Application

The diagram below outlines primary interactions of the service under test (Service A). Service A is a new microservice built with Quarkus for running in a containerised environment. Add protocol…..

Primary interactions of system under test service A

Environment

Dedicated test environments were created on AWS to test the service. Service A was hosted on a single node (c6g.large) AWS EKS cluster with sufficient capacity to run 2x 500m, 500Mi containers per node. The service was set to auto-scale containers at 80% CPU, and JVM max heap configured to 85% of the container memory allocation. Fully working and tested instances of service B and the database were provisioned on their own dedicated hosting and seeded with data that was the size, shape and volume of production. No caching was implemented for reads against service B or the database. Test tools were connected using an AWS VPC private link endpoint to an ELB. A single load generator rated up to 9000 req./sec was used for the testing. All data transfers and packet rates were significantly less that the AWS stress testing policy.

Throughput

The table below summarises some of the metrics produced by a series of stress tests against service A. Results show throughput and CPU utilisation are a lot lower for the POST test than the GET test, which suggests that the application is spending time waiting for an event. Results also show the POST request tests do not trigger container auto-scaling because of the lower CPU utilisation and they take longer to complete, with significant queuing at max. throughput.

Metric                                                  GET          POST
                                                   Stress Test   Stress Test
------------------------------------------------  ------------- -------------
Response Time Min. (ms)	                                      4	           12
Max. System Throughput (Req/Sec)	                          594	           78
Response Time Mean @ Max. System Throughput (ms)	           86	          299
Container Count @ Max. System Throughput	                    2	            1
Container Max. Throughput (Ops/sec)	                        303	           78
Container CPU @ Max. Throughput (%)	                         99	           53
JVM Heap @ Max. Throughput (MB)	                            121	           85
Application Errors                                            0             0
Bottleneck	                                             on-CPU	      off-CPU

Threads

To investigate why throughput and CPU are lower during the POST test the tools below were used to obtain more information when the system was under load. As some of the tools create a snapshot at a particular point in time it was useful to combine these into a script for sampling and offline analysis. EKS is running containerd so we can get the process id of the container we are interested in using ctr as below.

$ for i in `ctr --namespace k8s.io.containers ls | grep <name> | cut -d' ' -f1;
do
ctr --namespace k8s.io task ist | grep $i;
done
a087b8b5e433f80114dc5104f818d1a183af48bdb23774f0e28f9ee4b43ab5ae    1234567    RUNNING

Using ps (ref[1]) we can see details about threads the process is running and their commands. The output below shows that the application is running a set of threads for the memory management and functions of the JVM, a set of threads for the vert.x reactive application toolkit, and a set of threads that are part of the java executor pool and JDBC connectivity.

$ ps -T -p 1234567 -o stat,comm --no-header | sort | uniq -c | sort -n -r
     51 Ssl  executor-thread
      2 Ssl  vert.x-internal
      2 Ssl  vert.x-eventloo
      2 Ssl  Transaction
      2 Ssl  java
      1 Ssl  VM
      1 Ssl  VM
      1 Ssl  vertx-blocked-t
      1 Ssl  vert.x-acceptor
      1 Ssl  Sweeper
      1 Ssl  SmallRye
      1 Ssl  Signal
      1 Ssl  Service
      1 Ssl  Reference
      1 Ssl  PostgreSQL-JDBC
      1 Ssl  Notification
      1 Ssl  Monitor
      1 Ssl  Finalizer
      1 Ssl  Common-Cleaner
      1 Ssl  C2
      1 Ssl  C1
      1 Ssl  agroal-11

Vert.x [ref. 2] is an event based architecture that uses an event-loop thread for non-blocking event dispatching and executor threads for longer running blocking requests. It provides a set of modules for building reactive applications and it uses Netty for the NIO event loop. We can see from the output above that 51 worker threads have been created which suggests they are handling long running or blocking tasks. All of the threads except one were in an interruptible sleep state (waiting for an event).

System Calls

When a thread goes off CPU it can be for a number of reasons but primarily it is network I/O, disk I/O, acquiring a lock, or waiting for work. Using perf trace we can obtain a summary of the system calls each thread is making underload to find out more about their activity.

The table below shows top 5 system calls by total time for each thread during a GET stress test and a POST stress test. The data shows more time was spent waiting for file descriptor I/O during the POST test than the GET test, and the event-loop thread is making more read and write calls in the POST test. A key difference being the fact that the POST test makes a call to Service B.

                                                   GET                   POST
 thread                syscall            calls       total      calls       total
                                                       (msec)                 (msec)
 --------------------- --------------- --------  -----------  --------  -----------
 vert.x-acceptor       epoll_pwait           48    28791.242        65    25531.706
                       futex                 14      150.530       132        2.093
                       accept                99       10.856       201        1.652                        
                       setsockopt           153        9.549       134        0.865
 vert.x-eventloop      futex               4112     7257.119      6857       83.480
                       epoll_pwait          405      744.088      7059    19843.399
                       read                3555      584.133      7182       65.832
                       writev              3452      463.772      1051       31.407
                       write                  6        1.022      3543       63.328
 vert.x-eventloop      futex               4112     7257.119      6857       83.480
                       epoll_pwait          405      744.088      7059    19843.399
                       read                3555      584.133      7182       65.832
                       writev              3452      463.772      1051       31.407
                       write                  6        1.022      3543       63.328
 executor-thread       futex                980    22990.778       318      437.100
                       ppoll                119      444.694       269    23991.239
                       write                418      173.901       423        9.966
                       read                 531      103.196       814        6.302
 internal
 block

We can see the event-loop thread is monitoring descriptors using epoll_wait system call and the executor-thread using ppoll. As this is an HTTP service that makes calls Service B and Postgres, it is likely that the event-loop and executor threads are waiting on network I/O.

Running strace we can obtain further detail about the system calls above and the data being sent. NOTE: in this example we use strace for a short 10s window, however similar information can be obtained using perf which is also less intrusive.

$ strace -rttTf -p 1234567 -o strace-output.txt

3393265 17:47:03.760375 (+     0.000200) futex(0xffff4c1b44b8, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
1246868 17:47:03.760399 (+     0.000019) <... futext resumed> ) = 1 <0.000243>
1246868 17:47:03.760453 (+     0.000019) write(22, "GET /serviceb/123" ..., 232 <unfinished ...>
3393613 17:47:03.760524 (+     0.000070) <... ppoll resumed> ) = 1 ([{fd=69, revents=POLLIN}]) <0.001514>
3393613 17:47:03.762510 (+     0.000070) ppoll([{fd=69, events=POLLIN}], 1, NULL, NULL, 0) = 1 ([{fd=69, revents=POLLIN}]) <0.001717)
1246868 17:47:03.765582 (+     0.000035) read(22, "HTTP1.1 200 \r\nVary: Origin\r\nVar" ..., 4096) = 2083 <0.000011>

The output above can be analysed to find out the file descriptors being used by the event-loop threads and the executor threads. The table below summarises the output.

Using nsenter and fd, lsof and ss we can find out what type the file descriptors are. From lsof and fd we can see that fd=157 and fd=159 are eventfd file descriptors, fd=1 is /dev/null with the remaining descriptors being sockets. The output of ss shows that the eventloop threads (1246868, 1246869) were writing and reading to 5 TCP sockets connected to service b (fd=22, fd=87,fd=125, fd=130, fd=131) and the executor thread (3393288) is writing and reading to TCP sockets connected to postgres (1 per thread). fd=1 is (get which process id for the service in the container (this is one, then get the file descriptors, then run ss).

# find the proces id of java in the container
$ nsenter -t 1234567 -n ls -lrt /proc/*/exe

# list the file descriptors for that process id
$ nsenter -t 1234567 -n ls -lrt /proc/1/fd > fd-output.txt

#get socket statistics for the container
#note this will need to be run more than once if we want to capture any transient connections
$ nsenter -t 1234567 -n ss -i -p > ss-output.txt.`date + %s`

#get details for file descriptors used for socket I/O
$ grep -e "fd=125" -e "fd=130" -e "fd=131" -e "fd=132" -e "fd=22" -e "fd=87" -e "fd=43" ss-output.txt.*
tcp   ESTAB 0      0      [::ffff:xxx.xxx.xxx.xxx]:48862   [::ffff:xxx.xxx.xxx.xxx]:postgres       users:(("java",pid=1234567,fd=43))
tcp   ESTAB 0      0      [::ffff:xxx.xxx.xxx.xxx]:34370   [::ffff:xxx.xxx.xxx.xxx]:8090       
    users:(("java",pid=1234567,fd=22))
tcp   ESTAB 0      0      [::ffff:xxx.xxx.xxx.xxx]:34346   [::ffff:xxx.xxx.xxx.xxx]:8090       
    users:(("java",pid=1234567,fd=87))
tcp   ESTAB 0      0      [::ffff:xxx.xxx.xxx.xxx]:34349   [::ffff:xxx.xxx.xxx.xxx]:8090       
    users:(("java",pid=1234567,fd=125))
tcp   ESTAB 0      0      [::ffff:xxx.xxx.xxx.xxx]:34360   [::ffff:xxx.xxx.xxx.xxx]:8090       
    users:(("java",pid=1234567,fd=130))
tcp   ESTAB 0      0      [::ffff:xxx.xxx.xxx.xxx]:34363   [::ffff:xxx.xxx.xxx.xxx]:8090       
    users:(("java",pid=1234567,fd=131))
tcp   ESTAB 0      0      [::ffff:xxx.xxx.xxx.xxx]:34371   [::ffff:xxx.xxx.xxx.xxx]:8090       
    users:(("java",pid=1234567,fd=132))

Combining this with the output from offwaketime we have a better understanding of the data flow and calls being made. A POST request is dispatched to a worker thread by the event loop thread, and the worker thread makes a call to service b via the eventloop thread and waits for the response. All worker threads are making requests to service b via the event loop threads.

The table below shows a comparison of system calls made by the application during a GET stress test and a POST stress test (check the below numbers align with the ones above). The tests show different profiles; more time was spent waiting for file descriptor I/O during the POST test than the GET test, and the eventloop thread is making more read and write calls in the POST test. A key difference being the fact that the POST test makes a call to Service B.

The worker thread is Next check time and concurrency on the event loop threads.

Code

Eventloop threads run in a sequential way and are designed to run one thread per processor to minimise context switching overhead. For this model to work blocking calls must not be made from the event loop threads (https://vertx.io/docs/vertx-core/java/#golden_rule). Yet this is what appears to be happening.

The data we obtained at runtime shows that the eventloop threads for Service A are spending a lot of time waiting on network I/O to Service B.

Service A is using the RestEASY

Why are the eventloop threads making blocking calls to Service B?

Why don’t we see messages from the blocked thread checker ?

References

[1] Linux /proc man page: https://man7.org/linux/man-pages/man5/proc.5.html

[2] Vertx architecture: https://vertx.io/introduction-to-vertx-and-reactive/

Leave a Comment

Your email address will not be published. Required fields are marked *